1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka, K. Inoue

1

An Accelerator Based on Single-Flux Quantum Circuits for a High-PerformanceReconfigurable Computer

F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*and K. Murakami*

*Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

**Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: [email protected], [email protected]

2WAHA 2009Kyushu University

Agenda

Introduction Large-Scale Reconfigurable Data-Path (LSRDP)

General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work


Introduction

Parallel computer clusters with General-Purpose Processors (GPP)are often used for HPC

Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar

performance

http://www.elsa-jp.co.jp/products/hpc/tesla/s1070/index.html

TSUBAME

http://it.nikkei.co.jp/

NVIDIA Tesla S1070

http://www.top500.org/system/9485

Roadrunner with PowerXcell


Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP)

A large memory bandwidth is demanded in conventional accelerators for high-performance computation

On chip memories are often used to hide memory access latency

Large-Scale Reconfigurable Data-Path (LSRDP): • is introduced as an alternative accelerator• reduces the no. of memory accesses• is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits• is suitable for high performance scientific computations


Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor

Features:Data Flow Graphs (DFGs) extracted

from critical calculation parts are directly mapped

Pipeline executionBurst transfer is used for input /output

rearranged data from/to memoryMainMemory

GPP

ORN

: : : :

ORN : Operand Routing Network

...FU FU FUFU

...FU FU FUFU

...FU FU FUFU

LSRDP

: : : ... :SB

SMAC

Scratchpad Memory

Reconfigurable data-path includes:A large number of floating point

Functional Units (FUs)Reconfigurable Operand Routing

Network : ORNDynamic reconfiguration facilitiesStreaming Buffers (SB) for I/O ports Implementation by SFQ circuits


Single-Flux Quantum (SFQ)against CMOS

CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed

SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing


CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum circuits

SFQ-LSRDP

Prof. K. MurakamiDr. K. InoueDr. H. Honda

Dr. F. MehdipourH. Kataoka

Kyushu Univ.Architecture, Compiler

and Applications

Dr. S. Nagasawa et al.

Superconducting Research Lab. (SRL)

SFQ process

Prof. N. Yoshikawa et al.

Yokohama National Univ.SFQ-FPU chip, cell library

Prof. A. Fujimaki et al.

Nagoya Univ.SFQ-RDP chip, cell library,

and wiring

Prof. N. Takagi (Leader) et al.

Nagoya Univ.CAD for logic design and arithmetic circuits


Goals of the Project

Discovering appropriate applications

Developing compiler tools

Developing performance analyzing tools

Designing and Implementing SFQ-LSRDP architecture Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuitsconsidering the features and limitations of SFQ circuits

9

LSRDP General Architecture and Specifications


Parameters Should Be DecidedWithin the LSRDP Design Procedure

Height

PE1 ...

...

...

PEm...

.

.

.

.

.

.

.

.

.

PE2 PE3

ORN

ORN

Width

...

...

Streaming Buffer (SB)

ORN

Operand Routing Network (ORN)

Streaming Buffer (SB)

Maximum Connection Length (MCL)between consecutive rows?

• PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU)

• Reconfiguration mechanism?

(PE, ORN, Immediate data)

Layout: FU types(ADD/SUB and MUL)?

• Core structure a matrix of PEs

Width and Height ?

• On-chip memory configuration?


LSRDP Architecture

Processing Elements FU

implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL

TU (transfer unit) as a routing resource for transferring datafrom a row to an inconsecutive row

FU TU

FU

TU FU TUTU

FU TUFU

PE including Two components

Four functionalities


Layout Types- Type IW

ORN

ORN

ORN

.

.

.

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

ADD/SUM

MUL

TU

Each PE implements ADD/SUB and MUL

M

A

T

: ADD/SUB

: MUL

: Transfer Unit

H

Flexible but consume a lot of resources


W

ORN

ORN

ORN

.

.

.

…M TA T A T A T M T




Layout Types- Type II (Checkered)

H

Each PE implements ADD/SUB or MUL Each PE implements

ADD/SUB or MUL

ADD/SUM TU MUL TU


W

ORN

ORN

ORN

.

.

.

…M TM T M T M T M T

…A TA T A T A T A T



Layout Types- Type III (Striped)

H

Each PE implements ADD/SUB or MUL


ADD/SUM TU

MUL TU

Type II or III, which one is more efficient?


Maximum Connection Length (MCL)

(i, 0)

(i+1,0)

(i+1,j)

...

...

(i,j)

ORN

...

... ...

...

(i+1,j+L)

Longest ConnectionLength= L

(i,j+2)

(i,j+1)

(i+1,j+2)

(i+1,j+1)

ConnectionLength= 0

ConnectionLength= 2

MCL: maximum horizontal distance between two PEs located in two consecutive rows


An ORN Structure

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

FPUFPUFPUFPUFPU TTTTT

FPUFPUFPUFPUFPU TT

T

TT

½CB½CB½CB½CB½CB

CB CB CB CBT2 T2


CB CB CB CBCB

CB CB CB CBCB CB CB CBCBCB

CB CB CB CBT2 T2CB CB CB CBCB

T2 CB T2 CBT2 CB T2 CBCBT2

FPUFPUFPUFPUFPU TTTTT

FPUFPUFPUFPUFPU TT

T

TT


CB CB CB CBT2 T2


CB CB CB CBCB

CB CB CB CBCB CB CB CBCBCB

CB CB CB CBT2 T2CB CB CB CBCB

T2 CB T2 CBT2 CB T2 CBCBT2

ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

FPU

2bit shiftregister

ORN


Dynamic Reconfiguration Mechanism

FU(A op B)

TransferUnit

ImmediateRegister (64b)

ORN

MUX

・・・・・・

ImmediateRegister

・・・・・・

PEInput-AInput-B Input-C

log(2x (2MCL+1)) x 3 [b]

Conf. Reg.[bit]

Three bit-stream lines for dynamic reconfiguration of:• Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs• Cross-bar switches in ORNs

18

Design Procedure and Tool Chain


Compiler and Design Flow

Application Code

Hardware-Software Partitioning(manual or automatic)

Critical Parts(h/w part)

Non-Critical Parts(s/w part)

Port positioning

s/w Part Modification

Binary Code(for GPP)

Configurations(for LSRDP)

LSRDP Architecture

Placement

Routing

DFG Genration

Bit-Stream Generation

DFGsDFGs

Analyzing DFG mappingresults

Design Phase

Mapping

• DFGs are manually generated from critical parts of applications• DFG mapping results are used for

• Analyzing LSRDP architecture statistics• Generating LSRDP configuration bit-streams


LSRDP Design Procedure

Choosing a design parameter

Mapping DFGs onto the LSRDP

Obtaining required statistics

Choosing the appropriate value

Analyzing the mapping results

For eachparameter

Appropriate value for each parameter

DFGs & LSRDP HW constraints


Benchmark Applicationsfor Design Procedures

Finite differential method calculation of2nd order partial differential equations 1dim-Heat equation 　　　　 (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson)

Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation

(ERI-Rec)

Only ADD/SUB and MUL operations are usedin the critical calculations of all above applications


DFG Extraction- Heat Equation

1-dim. heat equation for T(x,t)

　　　　　　　　　　　　　

Calculation by Finite DifferenceMethod (FDM)

2

2

( , ) ( , )T x t T x tA

t x

(A is const.)

T(i,j+1)

T(i-1,j) T(i,j) T(i+1,j)

+

*

*

+

D

B

T(i,j+1)

T(i-1,j) T(i,j) T(i+1,j)

+

*

*

+

D

B

),(),(*),(*

),(

11

1

jijiji

ji

txTtxTBtxTD

txT

Basic DFG corresponding to Minimum FDM calculation

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG


Example of extracted DFGs- Heat

Inputs: 32Outputs: 16Operations: 721 Immediates: 364

A huge sample DFG (Heat)


DFG Classification

Class # of FUs# of

Inputs# of

Outputs# of

DFGsHeat (3)Poi (1)Vib (2)Eri (4)

Heat (1)Poi (1)Vib (1)Eri (4)



12

12

24

52

19

19

38

64

128

512

1024

> 1024

RDP-S

RDP-M

RDP-L

RDP-XL

Due to broad range of DFG sizesDFGs are classified as S, M, L, XL with respect to their sizeand the number of Input/Output nodes

Totally,24 DFGs are preparedfor benchmark DFG


Mapping DFGs onto LSRDP

Longest connections

Placing DFGnodes on LSRDP

RoutingConnections

Placing IO nodes

Routing Inp/OutConnections

DFG

LSRDPArchitectureDescription

ConfigurationFile

26

Preliminary Results


LSRDP Specifications: Width & Height

# of Input ports

# of Output ports

Width Height

LSRDP-S 19 12 16 16

LSRDP-M 19 12 32 16

LSRDP-L 38 24 64 32

LSRDP Dimensions and the number of Input/Output Ports


LSRDP Specifications: MCL

Needs further MCL optimization

(i, j)

(i+2,j+1)

(i+L,j+1)

(i+1,j+1)

(i,j+1)

MCL = L

・・・

LSRDP MCL(avg/max)

ORN Size-No of Inps (avg/max),

Outs

LSRDP-S 4/8 18/34, 3

LSRDP-M 5/9 22/38, 3

LSRDP-L 5/9 22/34, 3


Analyzing Various LSRDP Layouts

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost.

(Except ERI1 DFG which gives better size for Layout III)

Layout SizeI 8x3II 8x3III 8x4I 10x8II 10x8III 10x11I 10x10II 10x12III 15x18I 6x2II 9x3III 6x2I 10x10II 10x10III 15x8

Viration

Poisson

ERI1

ERI2

Heat

Layout I Layout II


LSRDP at One Glance (1/2)

Functional units ADD/SUB, MUL

Layout Type II (checker pattern)

Operations 64-bit floating point

Processing structure Pipelined

PE structure FU, T, FU+T, T+T

LSRDP Size Small Medium Large

No. of inp/out ports 19/12 19/12 38/24

Width/Height 16/16 32/16 64/32

Conf. bit-stream size

Imm. Regs 16*16*64 32*16*64 64*32*64

ORNs 16*BSS(ORN) 32* BSS(ORN) 64*BSS(ORN)

PEs 16*16* 2 32*16*2 64*32* 2

ORN inputs, outputs 22 , 3 26 , 3 26 , 3

Structure Cross-bar switch

Conn. Type One-directional


LSRDP at One Glance (2/2)

Internal memory Type Immediate registers

Size and count 64-bit registers, One reg. for each PE

Communication mechanism Serial

External memory No. of memory modules 16

Date trans. rate 1800Mbps/pin

Overall data trans. rate 24 GB/s

Mem. to LSRDP bus width 64 bit

Channels per module Two

Reconf. mechanism Bit serial configuration through a serial chain


Preliminary Performance Evaluation

Processor type Out-of-order

GPP operating frequency 3.2GHz

Inst. issue width 4 instruction/cc

Inst. decode width 4 instruction/cc

Cache configuration L1 data 64KB(128B Entry, 2way, 2cc)

L1 instruction 64KB(64B Entry, 1way, 1cc)

L2 unified 4MB(128B Entry, 4way, 16cc)

Latency of main memory 300cc

L2 to main memory Bus width 64 Bytes

Freq 800 MHz

LSRDP operating frequency 80 GHz

Reconfiguration Latency 1cc

Latency SPM LSRDP latency 1cc

Latency Main Memory SPM 7500cc

Bandwidth SPMLSRDP Max. 64 * 8 Bytes/cc

Bandwidth Main Memory SPM 102.4GB/sec

Base processor configuration

GPP+LSRDP configuration

GPP ： Exec. time measurement by means of a processor simulatorLSRDP ： Estimation by performance modeling


Preliminary Performance Evaluation(Heat)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Basic Reuse Basic Reuse

Heat (M) Heat (L)

Nor

mal

ized

by

GPP E

xec.

Tim

e

Reconf.

Comm.

Rearrange

Stall

LSRDP

GPP

Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory.

Basic: SB onlyReuse: SB + SPM


Preliminary Performance Evaluation (Poisson)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

poisson(S) poisson(M) poisson(L)

Norm

aliz

ed

by G

PP E

xec. Tim

e

Reconf.

Comm.

Rearrange

Stall

LSRDP

GPP

A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP


Conclusions & Future Work

A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced.

24 benchmark Data Flow Graphs (DFGs) were manually generated.

LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach.

LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances.

Future Work:

•To achieve higher performance it is required to reduce various overhead costs mainly related to data management part.

•To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.


Acknowledgement

This research was supportedin part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

37

Thanks!

Any Questions?


Backup Slides

Backup Slides


SFQ (Single Flux Quantum) Circuit High speed, Low power consumption, and Operating by a different principle from the CMOS

Φ0

L

Ic

Ib

2mV

2ps

Tunneling effect

ジョセフソン接合

超伝導ループ

磁束量子Single Flux QuantumSuperconductivityloop

Josephson junction


Mapping Results

Width Height Total # ofPEs

Extra TUs

(max/avg) (max/avg) (max/avg) (max/avg)

RDP-S 128 19 12 26/14.9 10/6.7 98/51.7 56/23.5

RDP-M 512 19 12 26/17.1 16/9.3 170/77.8 92/37.1

RDP-L 1024 38 24 58/40 24/14.4 730/260.1 428/141.3

RDP-XL > 1024 64 52 122/45.3 25/12.4 1217/350.4 1065/240

# ofFUs

# ofInputs

# ofOutputs

For each class, a lot of extra TUsare needed to map all DFGs

PE types FU

TFU

T

TT


Connection Length Minimization- Results

MCL (ave/max)

RDP-S 4/9

RDP-M 5/9

RDP-L 9.3/19

Final optimized Maximum Connection Length (MCL) results

ORNs should provide the connection length of 9in LSRDP-S/M (MCL= 9). For LSRDP-L, MCL = 19 !!!

⇒ Serious Implementation Cost

(i, j)

(i+2,j+1)

(i+L,j+1)

(i+1,j+1)

(i,j+1)

MCL = L

・・・

Possible to decrease?


Distributions of Connection Lengths

Average Fraction of Connection Lengths inthe RDP/S Maps

79%

10%

4% 3% 2%1%1%0%0%0%

0

1

2

3

4

5

6

7

8

9

Connectionlength

93% of connection lengths are 0 ~ 2Only small fractions of connections results in larger ORNs


Analyzing Various LSRDP Layouts

Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well

• Almost a similar small size values are achievedfor Layout I and IIfor the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

Layout SizeI 8x3II 8x3III 8x4I 10x8II 10x8III 10x11I 10x10II 10x12III 15x18I 6x2II 9x3III 6x2I 10x10II 10x10III 15x8

Viration

Poisson

ERI1

ERI2

Heat


Why only ERI1 DFG is suitable to Layout III ?

RDP-Layout Type II

ADD/SUB

MUL

ADD/SUB

MULADD/SUB

MUL

ADD/SUB

MULADD/SUB

MUL

...

...

...

ADD/SUB

MULADD/SUB

MUL ...

.

.

.

.

.

.

.

.

.

MULADD/SUB

ORN

ORN

ORN

.

.

.

Heat

ERI 1 RDP-Layout Type III

MUL MUL

ADD/SUB

ADD/SUB

ADD/SUB

ADD/SUB

MUL MUL MUL MUL

...

...

...

ADD/SUB

ADD/SUB

ADD/SUB

ADD/SUB

...

.

.

.

.

.

.

.

.

.

MUL MUL

ORN

ORN

ORN

.

.

.

Layout III

Layout II


FU Layout for DIV, SQRT, EXP operations

ORN

: : : :

ORN : Operand Routing Network

...FU FU FUFU

...FU FU FUFU

...FU FU FUFU

ORN

...FU FU FUFU

...FU FU FUFU

DIV

Three timeslarger latency

Where ?

Where should we place different latency FU ?Heterogeneous configuration of FU array ?

16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology.

Pipeline execution basedon ADD and MUL latency


Estimated performance improvement of 2-dim Poisson equation by LSRDP calc.

0

1

2

3

4

5

6

7

8

12.8 102.4 12.8 102.4 12.8 102.4

poisson(S) poisson(M) poisson(L)

Treconf

Ttra

Trearr

Tst

Tcal

Tgpp

Nor

mal

ized

exe

c. ti

me

by

GPP

(3G

Hz)

cal

c.

Main Mem. bandwidth [GByte/sec]


Estimated performance improvement of ERI calculation by LSRDP

0

0.2

0.4

0.6

0.8

1

GPPonly LSRDP

ERI

Treconf

Ttra

Trearr

Tst

Tcal

Tgpp

(3GHz)


Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec)

DFG sizes have already determinedfrom original recursive formula

No. of Operations

No. of

Inputs

No. of

Output

(ps,ss) 9 8 3

(ps,ps) 51 16 9

(pp,ss) 66 14 9

(pp,ps) 252 22 27

(pp,pp) 1004 28 81


What types of software/algorithms are suitable for LSRDP ?

When same calculations have to be calculated repeatedly. LSRDP is used for high throughput accelerator.

Input/Output data size is small compared with the amount of the operations.

small sizeof input

small sizeof output

Large amountof calculations

Xmemory access

LSRDP


Exploration of suitable applicationsfor LSRDP

Application matrix elements calculation

Molecular integral calculations in molecular orbital method Monte Carlo type simulation etc…

Numerical calculation library special function (promising?) differential equation numerical integration matrix operation (difficult ??) Triangular matrix simultaneous equation etc…

Investigating applicability against various applications


Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc.

# of Inputs ： Max. 28# of Outputs ： 1 ~ 81

(0) (0) (1)

(0) (0) (1) (1)

(0) (0) (1) (0) (1)

(0) (

( , ) ( , ) ( , )

( , ) ( , ) ( , ) ( , )2

( , ) ( , ) ( , ) ( , ) ( , )2

( , ) ( , )

i i i

abiki k k i k i

aijai j i i j i

i j k k i j

p s ss PA ss ss WP ss ss

Zp s p s QC p s ss WQ p s ss ss ss

Zp p ss PB p s ss WP p s ss ss ss Z ss ss

p p p s QC p p ss

0) (1)

(1) (1)

(0) (0) (1) (1) (1)

(0) (1)

( , )

( , ) ( , )2

( , ) ( , ) ( , ) ( , ) ( , )2

( , ) ( , )2

k i j

abik j jk i

abi j k l k i j k l i j k il j k jl i k

bijbi j i j

WQ p p ss

Zsp ss p s ss

Zp p p p QD p p p s WQ p p p s sp p s p s p s

Zp p ss Z p p ss

(ss,ss)(m) and all coefficientsare given as input

(i,j,k,l = x,y,z): p function has 3 components (as 1dim array)Each DFG has only ADD (SUB) and MUL FUs.

~Up to (pp,pp) Recursive Calculation~

DFG sizes are determined by original calculation algorithm


0

200

400

600

800

1000

0 10 20 30 40 50 60 70

DFG Distribution for each application#

of F

Us

# of Inputs

Poisson (3)

Vibration (7)

Heat (6)

ERI-Rec (8 DFGs)

DFGs have different qualities in terms of the # of FUs, # of Inputs and Outputs


Example of MCL (Heat)

Heat original DFG(I/O: 8/4, FUs: 32) Mapping result

MCL


Example of extracted DFGs (ERI-Rec)

Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28Outputs: 81FUs: 1004Immediates: 0

Vertical PartitioningInputs: 24Outputs: 1FUs: 108Immediates: 0


Poisson Equation

),(),(),(

2

2

2

2

yxfy

yxu

x

yxu

2D – Poisson Eq.

),(),(),(

),(),(4

),(*)1(

),(

21

)(1

)(

1)(

1)(

)(

)1(

jijin

jin

jin

jin

jin

jin

yxfhyxuyxu

yxuyxu

yxu

yxu

ω is const.

Successive Over Relaxation method

In order to obtain u(n+1) (xi,yj) in the next iteration,current values of five variables i.e. u(n) (xi,yj), u(n) (xi±1,yj), u(n) (xi,yj±1) are needed

Red/Black Gauss Seidel

55


Example of extracted DFGs (Poisson)

Maximum Poisson DFG

Inputs: 32Outputs: 1FUs: 721 Immediates: 364


Performance Evaluation:Simulation Environment

57

GPP

MainMemory

LSRDP

GPP ： Exec. time measurement by processor simulatorLSRDP ： Estimation by performance modeling

Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.


Estimated performance improvementof 1-dim heat equation by LSRDP calc.

0

0.2

0.4

0.6

0.8

1

1.2

OnlyGPP

12.8 25.6 51.2 102.4

LSRDP

Trec

Ttra

Trearr

Tst

Tcal

ETgpp



0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

12.8 25.6 51.2 102.4

Heat

Trec

Ttra

Trearr

Tst

Tcal

ETgpp

Estimated performance improvement of 1-dim heat equation by LSRDP calc.


Nor

mal

ized

exe

c. ti

me

by

GPP

(3G

Hz)

cal

c.


Poisson Red/Black 法におけるDFG の拡大による繰り返し回数の増加

9+4 ノードの入力

中心 1 ノードの出力

SOR 式2 回の繰り返し

4+1 ノードの入力

中心 1 ノードの出力

SOR 式1 回の計算

これに伴い必要な入力数も増加•DFG の拡大により 1 度に計算可能な繰り返し回数が増加

60


Implementation of Heat calculation to LSRDP

Loop j　　 Loop i T(xi,tj) End LoopEnd Loop

Original GPP code

LSRDP ReconfigurationLoop j’　　 Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop　　 Output Data RearrangementEnd Loop

61

LSRDP code


Implementation of Poisson calculation to LSRDP

Loop Iter Loop i loop j u(xi,yj) End Loop　　 End LoopEnd Loop

Original GPP code

LSRDP ReconfigurationLoop Iter’ 　　 Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop　　 Output Data rearrangementEnd Loop

62

LSRDP code


Implementation of ERI-Rec calculation to LSRDP

Loop I,J,K,L　　 LSRDP Reconfiguration　 Loop contraction Initial Integral Calc. End Loop　　 Input Data rearrangement　　 Loop N　　 LSRDP pipeline calc. 　　　　　 (Recursive DFG calc.)　 End Loop Output Data rearrangement Partial Fock Calc.End Loop

Loop I,J,K,L　　 Loop contraction Initial Integral Calc. Recursive Calc. End Loop　　 Partial Fock Calc.End Loop

Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation

Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation

original GPP code LSRDP code

63


Vertical vs. HorizontalDFG Decomposition

Loop N　 Reconfiguration　 Loop M　　 LSRDP pipeline calc. 　　　 End LoopEnd Loop

Original

64

Loop n ( > N)　 Reconfiguration　 Loop M　　 LSRDP pipeline calc. 　　　 End LoopEnd Loop

Vertical Decomp.

Loop N　 Reconfiguration　 Loop M　　 1st LSRDP pipeline calc. 　　　 End LoopEnd LoopLoop N　 Reconfiguration　 Loop M　　 2nd LSRDP pipeline calc. 　　　 End LoopEnd Loop

Horizontal Decomp.


Example of extracted DFGs

Maximum DFG of ERI-Rec: (pipj,pkpl)

Inputs: 28Outputs: 81FUs: 1004Immediates: 0

Vertical PartitioningInputs: 24Outputs: 1FUs: 108Immediates: 0


Example of extracted DFGs- Heat

Inputs: 32Outputs: 16Operations: 721 Immediates: 364

A huge sample DFG (Heat)


Performance Evaluation:Simulation Environment

67

GPP

MainMemory

LSRDP

GPP ： Exec. time measurement by processor simulatorLSRDP ： Estimation by performance modeling

Variable parameters:• Freq. of GPP and LSRDP• Bandwidth between main memory and LSRDP• Latency of reconfiguration time• # of FPUs in LSRDP• Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported)Use streaming buffer in the LSRDP chip

I/O data is sorted in the main memory.


Performance Evaluation:Execution Time Modeling

68

LSRDPCalLSRDP

OverheadLSRDPGPPTOTAL

STTET

TETETET

calT

LSRDPST

Execution time

Calculation time

Stall time

Latency of LSRDP <->Mem For first Input and last output

Sort data + Reconfig.+ Send signal for comm.

+ Stall fromBandwidthreq > Bandwidthmem

Total pipeline depthin the given program +

# of rows of LSRDP(latency of LSRDP)


Layout Types- Type IW

ORN

ORN

ORN

.

.

.

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

Total No. of PEs= W * H

Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs)

ADD/SUM

MUL

TU

Each PE implements ADD/SUB and MUL

M

A

T

: ADD/SUB

: MUL

: Transfer Unit

H


W

ORN

ORN

ORN

.

.

.





Layout Types- Type II

H

Each PE implements ADD/SUB or MUL Each PE implements

ADD/SUB or MUL


Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

ADD/SUM TU MUL TU


W

ORN

ORN

ORN

.

.

.





Layout Types- Type III

H




Total Area= ½* W*H*[Area(MUL)+Area(TU)]+½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs)

ADD/SUM TU

MUL TU


CB Various Functionalities

CB two inputs/outputs four cases are possible reconfigurable

1/2CB one input/ two outputs four cases are possible reconfigurable

00 01 10 11 00 01 10 11

CB ½CB


The number of FPUs is M, the number of Transfer Units (T) is also M;MCL is a maximum connection length if we consider FPUs only =>

½ CB – 2×M T2 – (M+4×MCL+2) CB – (2×MCL+1) ×(4×M-1)

An ORN Structure

FP

UFP

UFP

UFP

UFP

U

T

T

T

T

T

FP

UFP

UFP

UFP

UFP

U

T

T

T

T

T

½CB

½CB

½CB

½CB

½CB

CB

CB

CB

CB

T2

T2

½CB

½CB

½CB

½CB

½CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

CB

T2

T2

CB

CB

CB

CB

CB

T2

CB

T2

CB

T2

CB

T2

CB

CB

T2

CB: 351 JJs

½ CB: 216 JJs

* T2 is a 2-bit shift register

“+”: scalable pipelined easily re-designed for any number of N and M

“–”: large number of Josephson junctions M number of ½ CB and (2×M+1)×MCL number of CB

Reduction of the number of Josephson junctions is

essential!

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.


Reconfiguration Mechanism

Execution

wait

Starting ofExecution

End ofExecution

Starting ofReconfiguration

End ofReconfiguration

idle

Reconfiguration

ORN

Immediate

PE

InitialState

TU

・・・・・・

・・・・・・

・・・FU

Imm.Reg.

ORN

MUX

FU

TUImm.Reg.

ORN

MUX

FU

TUImm.Reg.

ORN

MUX

FU

TUImm.Reg.

ORN

MUX

・・・・・・

・・・・・・

・・・MUX

MUX

FU(A op B)

TransferUnit

ImmediateRegister (64b)

ORN

MUX

・・・・・・

ImmediateRegister

・・・・・・

PEInput-AInput-B Input-C

log(2x (2MCL+1)) x 3 [b]

Conf. Reg.[bit]

PE & ORN reconfiguration structure

Reconfiguration in LSRDP

State Diagram


LSRDP Design Procedure

Choosing a design parameterbased on the priority of

parameters

Mapping DFGs onto the LSRDP(placement, routing. I/O

positioning)

Obtaining required statisticsfrom the mapping results

Choosing the appropriate valueof the respective design

parameter

Analyzing the mapping results

Start


SMACSMAC

10TFLOPS SFQ-RDP computer

:...:::

SMAC

SB

ORN

...

ORN

...

: : : :

ORN

...

ORN

FPU SFQ RDP（ 32FPU×32chips ）（４ GFLOPS ／FPU)

4.2 K

SFQ Streaming Buffer（ 64Kb×2chips ）

CMOSCPU

(1chip)

Memory band width per MCM ：256GB/ ｓ(=16GB/s ×16 channels)

1024FPU@MCM（３４

chips ） ×4MCM

2TB memory module（ FB-DIMM

[DDR3@1333MHz, 128GB]×16 modules ）

SFQ 0.5um process


Power Consumption Comparisonfor 10TFLOPS computers

MPU Mem. HD Freezer Air

Conditioning

Total

CMOS RISC (90 nm)

125 kW 12.5 kW 5 KW 43 kW 186 kW

SFQ RDP

(0.5 um)

3W

+3.3 W

250 W 100 W 1 kW 0.1 kW 1.5 kW


Power Consumption / Performance

Performance （ GFlops ）

Power Consumption (W)

Power Cons. ／Perf.(W/GFlops)

SFQ-RDP ~10K ~1.5K ~0.150.5 um,

Whole system

SFQ-RDP ~10K ~6.3 (MPU) ~0.63*10^(-3) 0.5 um, MPU

GRAPE-DR 512 65 0.12 Chip

CSX600 25 10 0.40 ClearSpeed Chip

Cell 192 (single) 32 0.17 Inside Chip, SPE core

Cell (eDP) 1.33PF(#12960) ？ 12960*110W ？ Roadrunner

GeForce8800GTX 518 (single) 150 0.29 Chip

SX-9 18.3512 nodes whole system: 15.4MW

CMOS RISC (90nm)

~10K ~186K ~18.6

1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka, K. Inoue

Documents

kyushu university

reconfigurable processor

kataoka kyushu

fu lsrdp

cmos circuits

memory accesses

large memory bandwidth

level processing slide

1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka, K. Inoue