1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: [email protected], [email protected]
78
Embed
1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer
F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*and K. Murakami*
*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan
**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan
General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work
3WAHA 2009Kyushu University
Introduction
Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC
Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar
Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)
A large memory bandwidth is demanded in conventional accelerators for high-performance computation
On chip memories are often used to hide memory access latency
Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator• reduces the no. of memory accesses• is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits• is suitable for high performance scientific computations
5WAHA 2009Kyushu University
Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor
Features:Data Flow Graphs (DFGs) extracted
from critical calculation parts are directly mapped
Pipeline executionBurst transfer is used for input /output
rearranged data from/to memoryMainMemory
GPP
ORN
: : : :
ORN : Operand Routing Network
...FU FU FUFU
...FU FU FUFU
...FU FU FUFU
LSRDP
: : : ... :SB
SMAC
Scratchpad Memory
Reconfigurable data-path includes:A large number of floating point
Functional Units (FUs)Reconfigurable Operand Routing
Network : ORNDynamic reconfiguration facilitiesStreaming Buffers (SB) for I/O ports Implementation by SFQ circuits
6WAHA 2009Kyushu University
Single-Flux Quantum (SFQ)against CMOS
CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed
SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing
7WAHA 2009Kyushu University
CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits
SFQ-LSRDP
Prof. K. MurakamiDr. K. InoueDr. H. Honda
Dr. F. MehdipourH. Kataoka
Kyushu Univ.Architecture, Compiler
and Applications
Dr. S. Nagasawa et al.
Superconducting Research Lab. (SRL)
SFQ process
Prof. N. Yoshikawa et al.
Yokohama National Univ.SFQ-FPU chip, cell library
Prof. A. Fujimaki et al.
Nagoya Univ.SFQ-RDP chip, cell library,
and wiring
Prof. N. Takagi (Leader) et al.
Nagoya Univ.CAD for logic design and arithmetic circuits
8WAHA 2009Kyushu University
Goals of the Project
Discovering appropriate applications
Developing compiler tools
Developing performance analyzing tools
Designing and Implementing SFQ-LSRDP architecture Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuitsconsidering the features and limitations of SFQ circuits
9
LSRDP General Architecture and Specifications
10WAHA 2009Kyushu University
Parameters Should Be DecidedWithin the LSRDP Design Procedure
Height
PE1 ...
...
...
PEm...
.
.
.
.
.
.
.
.
.
PE2 PE3
ORN
ORN
Width
...
...
Streaming Buffer (SB)
ORN
Operand Routing Network (ORN)
Streaming Buffer (SB)
Maximum Connection Length (MCL)between consecutive rows?
• PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU)
• Reconfiguration mechanism?
(PE, ORN, Immediate data)
Layout: FU types(ADD/SUB and MUL)?
• Core structure a matrix of PEs
Width and Height ?
• On-chip memory configuration?
11WAHA 2009Kyushu University
LSRDP Architecture
Processing Elements FU
implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL
TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row
FU TU
FU
TU FU TUTU
FU TUFU
PE including Two components
Four functionalities
12WAHA 2009Kyushu University
Layout Types- Type IW
ORN
ORN
ORN
.
.
.
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
ADD/SUM
MUL
TU
Each PE implements ADD/SUB and MUL
M
A
T
: ADD/SUB
: MUL
: Transfer Unit
H
Flexible but consume a lot of resources
13WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
Layout Types- Type II (Checkered)
H
Each PE implements ADD/SUB or MUL Each PE implements
ADD/SUB or MUL
ADD/SUM TU MUL TU
14WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TM T M T M T M T
…A TA T A T A T A T
…M TM T M T M T M T
…A TA T A T A T A T
Layout Types- Type III (Striped)
H
Each PE implements ADD/SUB or MUL
Each PE implements ADD/SUB or MUL
ADD/SUM TU
MUL TU
Type II or III, which one is more efficient?
15WAHA 2009Kyushu University
Maximum Connection Length (MCL)
(i, 0)
(i+1,0)
(i+1,j)
...
...
(i,j)
ORN
...
... ...
...
(i+1,j+L)
Longest ConnectionLength= L
(i,j+2)
(i,j+1)
(i+1,j+2)
(i+1,j+1)
ConnectionLength= 0
ConnectionLength= 2
MCL: maximum horizontal distance between two PEs located in two consecutive rows
16WAHA 2009Kyushu University
An ORN Structure
A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
FPUFPUFPUFPUFPU TTTTT
FPUFPUFPUFPUFPU TT
T
TT
½CB½CB½CB½CB½CB
CB CB CB CBT2 T2
½CB½CB½CB½CB½CB
CB CB CB CBCB
CB CB CB CBCB CB CB CBCBCB
CB CB CB CBT2 T2CB CB CB CBCB
T2 CB T2 CBT2 CB T2 CBCBT2
FPUFPUFPUFPUFPU TTTTT
FPUFPUFPUFPUFPU TT
T
TT
½CB½CB½CB½CB½CB
CB CB CB CBT2 T2
½CB½CB½CB½CB½CB
CB CB CB CBCB
CB CB CB CBCB CB CB CBCBCB
CB CB CB CBT2 T2CB CB CB CBCB
T2 CB T2 CBT2 CB T2 CBCBT2
ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches
FPU
2bit shiftregister
ORN
17WAHA 2009Kyushu University
Dynamic Reconfiguration Mechanism
FU(A op B)
TransferUnit
ImmediateRegister (64b)
ORN
MUX
・・・・・・
ImmediateRegister
・・・・・・
PEInput-AInput-B Input-C
log(2x (2MCL+1)) x 3 [b]
Conf. Reg.[bit]
Three bit-stream lines for dynamic reconfiguration of:• Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs• Cross-bar switches in ORNs
18
Design Procedure and Tool Chain
19WAHA 2009Kyushu University
Compiler and Design Flow
Application Code
Hardware-Software Partitioning(manual or automatic)
Critical Parts(h/w part)
Non-Critical Parts(s/w part)
Port positioning
s/w Part Modification
Binary Code(for GPP)
Configurations(for LSRDP)
LSRDP Architecture
Placement
Routing
DFG Genration
Bit-Stream Generation
DFGsDFGs
Analyzing DFG mappingresults
Design Phase
Mapping
• DFGs are manually generated from critical parts of applications• DFG mapping results are used for
In order to obtain u(n+1) (xi,yj) in the next iteration,current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed
Red/Black Gauss Seidel
55
56WAHA 2009Kyushu University
Example of extracted DFGs (Poisson)
Maximum Poisson DFG
Inputs: 32Outputs: 1FUs: 721 Immediates: 364
57WAHA 2009Kyushu University
Performance Evaluation:Simulation Environment
57
GPP
MainMemory
LSRDP
GPP : Exec. time measurement by processor simulatorLSRDP : Estimation by performance modeling
Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip
I/O data is sorted in the main memory.
58WAHA 2009Kyushu University
Estimated performance improvementof 1-dim heat equation by LSRDP calc.
0
0.2
0.4
0.6
0.8
1
1.2
OnlyGPP
12.8 25.6 51.2 102.4
LSRDP
Trec
Ttra
Trearr
Tst
Tcal
ETgpp
Main Mem. bandwidth [GByte/sec]
59WAHA 2009Kyushu University
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12.8 25.6 51.2 102.4
Heat
Trec
Ttra
Trearr
Tst
Tcal
ETgpp
Estimated performance improvement of 1-dim heat equation by LSRDP calc.
Main Mem. bandwidth [GByte/sec]
Nor
mal
ized
exe
c. ti
me
by
GPP
(3G
Hz)
cal
c.
60WAHA 2009Kyushu University
Poisson Red/Black 法におけるDFG の拡大による繰り返し回数の増加
9+4 ノードの入力
中心 1 ノードの出力
SOR 式2 回の繰り返し
4+1 ノードの入力
中心 1 ノードの出力
SOR 式1 回の計算
これに伴い必要な入力数も増加•DFG の拡大により 1 度に計算可能な繰り返し回数が増加
60
61WAHA 2009Kyushu University
Implementation of Heat calculation to LSRDP
Loop j Loop i T(xi,tj) End LoopEnd Loop
Original GPP code
LSRDP ReconfigurationLoop j’ Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data RearrangementEnd Loop
61
LSRDP code
62WAHA 2009Kyushu University
Implementation of Poisson calculation to LSRDP
Loop Iter Loop i loop j u(xi,yj) End Loop End LoopEnd Loop
Original GPP code
LSRDP ReconfigurationLoop Iter’ Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data rearrangementEnd Loop
62
LSRDP code
63WAHA 2009Kyushu University
Implementation of ERI-Rec calculation to LSRDP
Loop I,J,K,L LSRDP Reconfiguration Loop contraction Initial Integral Calc. End Loop Input Data rearrangement Loop N LSRDP pipeline calc. (Recursive DFG calc.) End Loop Output Data rearrangement Partial Fock Calc.End Loop
Loop I,J,K,L Loop contraction Initial Integral Calc. Recursive Calc. End Loop Partial Fock Calc.End Loop
Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation
Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation
original GPP code LSRDP code
63
64WAHA 2009Kyushu University
Vertical vs. HorizontalDFG Decomposition
Loop N Reconfiguration Loop M LSRDP pipeline calc. End LoopEnd Loop
Original
64
Loop n ( > N) Reconfiguration Loop M LSRDP pipeline calc. End LoopEnd Loop
Vertical Decomp.
Loop N Reconfiguration Loop M 1st LSRDP pipeline calc. End LoopEnd LoopLoop N Reconfiguration Loop M 2nd LSRDP pipeline calc. End LoopEnd Loop
GPP : Exec. time measurement by processor simulatorLSRDP : Estimation by performance modeling
Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip
I/O data is sorted in the main memory.
68WAHA 2009Kyushu University
Performance Evaluation:Execution Time Modeling
68
LSRDPCalLSRDP
OverheadLSRDPGPPTOTAL
STTET
TETETET
calT
LSRDPST
Execution time
Calculation time
Stall time
Latency of LSRDP <->Mem For first Input and last output
Sort data + Reconfig.+ Send signal for comm.
+ Stall fromBandwidthreq > Bandwidthmem
Total pipeline depthin the given program +
# of rows of LSRDP(latency of LSRDP)
69WAHA 2009Kyushu University
Layout Types- Type IW
ORN
ORN
ORN
.
.
.
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
Total No. of PEs= W * H
Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)
ADD/SUM
MUL
TU
Each PE implements ADD/SUB and MUL
M
A
T
: ADD/SUB
: MUL
: Transfer Unit
H
70WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
Layout Types- Type II
H
Each PE implements ADD/SUB or MUL Each PE implements
ADD/SUB or MUL
Total No. of PEs= W * H
Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)
ADD/SUM TU MUL TU
71WAHA 2009Kyushu University
W
ORN
ORN
ORN
.
.
.
…M TM T M T M T M T
…A TA T A T A T A T
…M TM T M T M T M T
…M TA T A T A T M T
Layout Types- Type III
H
Each PE implements ADD/SUB or MUL
Each PE implements ADD/SUB or MUL
Total No. of PEs= W * H
Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)
ADD/SUM TU
MUL TU
72WAHA 2009Kyushu University
CB Various Functionalities
CB two inputs/outputs four cases are possible reconfigurable
1/2CB one input/ two outputs four cases are possible reconfigurable
00 01 10 11 00 01 10 11
CB ½CB
73WAHA 2009Kyushu University
The number of FPUs is M, the number of Transfer Units (T) is also M;MCL is a maximum connection length if we consider FPUs only =>