Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations 2012/12/07 The Third International Conference on Networking and Computing International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30) 25-minute presentation and 5-minute question and discussion time ☆Ryohei Kobayashi† 1 Shinya Takamaeda-Yamazaki† 1 † 2 Kenji Kise† 1 † 1 Tokyo Institute of Technology, Japan † 2 JSPS Research Fellow, Japan
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations
2012/12/07 The Third International Conference on Networking and Computing International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30) 25-minute presentation and 5-minute question and discussion time
Iterative computation updating data set by using nearest neighbor values called stencil
One of methods to obtain approximate solution of partial differential equation (e.g. Thermodynamics, Hydrodynamics, Electromagnetism …)
Time-step k
Update data set 3
v1[i][j] is updated by the summation of four values. Cx : weighting factor
Motivation(2/2)
Small or Big ??
4
or
ScalableCore System
Tile architecture simulator by Multiple low end FPGAs ►High speed simulation environment for many-core processors
research
►We use hardware components of the system as an infrastructure for HPC hardware accelerators.
5
One FPGA node
FPGA
SRAM PROM
*Takamaeda-Yamazaki, S., (ARC 2012) (2012).
Our Plan
6
One node 4 nodes(2×2) 100 nodes(10×10) Final goal
Now implementing
Parallel Stencil Computation by Using Multi-FPGA
7
Block Division and Assigned to Each FPGA
8
Group of grid-points assigned one FPGA
:communication :data subset communicated
with neighbor FPGAs
・Data set is divided into several blocks according to the number of FPGAs ・Each FPGA performs stencil computation in parallel
:grid-point
The Computing Order of Grid-points on FPGA
9
Proposed method
Our proposed method increases the acceptable communication latency! Now, let’s compare (a)’s model with proposed method
Comparison between (a) and (b) (1/2)
10
・”Iteration” : a sequent process to compute all the grid-points at a time-step ・Now we suppose a computation updating a value of one grid-point takes just a cycle. ・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15) during every Iteration.
In order not to stall the computation of B1, the value of A13 must be communicated within three cycles (14, 15, 16) after the computation…
In order not to stall the computation of D1 of Iteration 2 (17th cycle), the margin to send value of C1 (1st cycle) is 15 cycles
Comparison between (a) and (b) (N×M grid-points)
14
FP
GA
FP
GA
If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles
N
M
FP
GA
FP
GA
M
N
… …
Iteration end
N-1 cycles
… …
Iteration end
N×M-1 cycles
If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N×M–1cycles
(a)
(b)
Proposed method
Comparison between (a) and (b) (N×M grid-points)
15
FP
GA
FP
GA
If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles
N
M
FP
GA
FP
GA
M
N
… …
Iteration end
N-1 cycles
… …
Iteration end
N×M-1 cycles
If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N×M–1cycles
(a)
(b)
Proposed method
Proposed method gives increase acceptable latency of communication !!
Computing Order Applied Proposed Method
16
:computation order
This method ensures margin of about one Iteration. As the number of grid-points increases, acceptable latency is scaled.
Architecture and Implementation
17
System Architecture
18
to/from Adjacent
Units
Ser/Des
Ser/Des
Ser/Des
Ser/Des
Clock
Reset
FPGA Spartan-6
Configuration ROM
XCF04S JTAG port mux mux mux mux mux mux mux mux
mux8
to West
to North to South
to East
from West
from North from South
from East
MADD MADD MADD MADD MADD MADD MADD MADD
GATE[0]
GATE[1] GATE[2]
GATE[3]
mux2 Memory unit (BlockRAMs)
Computation unit
Relationship between The Data Subset and BlockRAM(Memory unit)
19
BlockRAM: low-latency SRAM which each FPGA has.
The data set which assigned to each FPGA is split in the vertical direction, and is stored in each BlockRAM (0~7)
If the data set of 64×128 is assigned to one FPGA, the split data set (8×128) is stored in each BlockRAM (0~7).
FPGA array 4×4 (Data is assigned)
BlockRAMs
Relationship between MADD and BlockRAM(Memory unit)
20
・The data set stored in each BlockRAM is computed by each MADD. ・Each MADD performs the computation in parallel ・The computed data is stored in BlockRAM.
MADD Architecture(Computation unit)
MADD ►Multiply: seven pipeline stages
►Adder: seven pipeline stages
►Both multiply and adder are single precision floating-point unit which conforms to IEEE 754.
The grid-points 1~8 are loaded from BlockRAM and they are input to the multiplier in cycles 0~7.
1 2 3 4 5 6 7 8
Input2(adder)
Input1(adder)
8-stages
8-stages
MADD Pipeline Operation (in cycles 8〜15)
The computation of grid-points 11~18
33
The computation result is output from multiplier, at the same times, grid-points 10~17 are input to the multiplier in cycles 8~15.
10 11 12 13 14 15 16 17
1 2 3 4 5 6 7 8
Input2(adder)
Input1(adder)
8-stages
8-stages
Input2(adder)
Input1(adder)
8-stages
8-stages
MADD Pipeline Operation (in cycles 16〜23)
The computation of grid-points 11~18
34
The grid-points 12~19 are input to the multiplier, at the same time, value of grid-points 1〜8 and 10~17 multiplied by a weighting factor are summed in cycles 16~23.
The computation results that data of up, down, left and right gird-points are multiplied by a weighting factor and summed are output in cycles 40~48.
11 12 13 14 15 16 17 18
20 21 22 23 24 25 26 27
MADD Pipeline Operation(Computation unit)
The filing rate of the pipeline: (N-8/N)×100% (N is
cycles which taken this computation.)
► Achievement of high computation performance and the small circuit area
► This scheduling is valid only when width of computed grid is equal to the pipeline stages of multiplier and adder.
38
Initialization Mechanism(1/2)
39
Master (0,0)
(1,0) (2,0) (3,0)
(0,1) (1,1) (2,1) (3,1)
(0,2) (1,2) (2,2) (3,2)
(0,3) (1,3) (2,3) (3,3)
:x-coordinate + 1
:y-coordinate + 1
・To determine the computation order of each FPGA, every FPGA uses own position coordinate in the system.
Initialization Mechanism(2/2)
40
Sending start signal of computation
・It is necessary for this array system to be synchronized precisely the timing of start of computation in the first Iteration. ・Because this array system is not able to get the data of communication region to be used for the next Iteration if there is a skew.
FPGA FPGA FPGA FPGA
FPGA FPGA FPGA FPGA
FPGA FPGA FPGA FPGA
FPGA FPGA FPGA FPGA
Evaluation
41
Environment
FPGA:Xilinx Spartan-6 XC6SLX16
► BlockRAM: 72KB
Design tool: Xilinx ISE webpack 13.3
Hardware description language: Verilog HDL
Implementation of MADD:IP core generated by Xilinx core-generator
► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6 FPGA has.
◇Therefore, the number of MADD to be able to be implemented in single FPGA is eight
42 ScalableCore board Hardware configuration of FPGA array
SRAM is not used.
Performance of Single FPGA Node(1/2)
Grid-size:64×128
Iteration:500,000
Performance and Power Consumption(160MHz)
►Performance:2.24GFlop/s
►Power Consumption:2.37W
43
Peak = 2×F×NFPGA×NMADD×7/8 Peak:Peak performance[GFlop/s] F:Operation frequency[GHz] NFPGA:the number of FPGA NMADD:the number of MADD 7/8: Average utilization of MADD unit → The four multiplications and the three additions