Top Banner
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan
25

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Mar 30, 2015

Download

Documents

Peter Hatt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

MEMOCode 2007 Design Contest – MIT Submission

N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan

Page 2: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Resources

• Five “insufficiently busy” grad students

• Three weeks– Nine man weeks used

• Bluespec expertise– Easy parameterization/Fast Concurrency

• The promise of food

Page 3: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Basic Facts

• Matrix Multiply is embarrassingly parallel– More multipliers and adders should help

• Matrices are too large to be stored in FGPA memory

• Time was short, design needed to be partitioned to make use of all designers– Latency insensitive methodology

Page 4: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Outline

• The Problem • Partitioning the Computation • Architectural Overview• Implementation• Results• Things We Wish we could do

Page 5: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

The Standard N3 Algorithm

for(int i=0; i < N; i++)

for(int j=0; j < N; j++)

for(int k=0; k < N; k++)

c[i][j] += a[i][k] * b[k][j];

Page 6: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

and blocking is well understood…

for(int ib = 0; ib < N; ib+=K)

for(int io = 0; io < K; io++)

for(int jb = 0; jb < N/K; jb+=K)

for(int jo = 0; jo < K; jo++)

for(int k = 0; k < K; k++)

c[ib+io][jb+jo] +=a[ib+io][jb+k]

* b[ib+k][jb+jo];

split

split

reduces memory traffic

for(int ib = 0; ib < N; ib+=K)

for(int jb = 0; jb < N/K; jb+=K)

for(int io = 0; io < K; io++)

for(int jo = 0; jo < K; jo++)

for(int k = 0; k < K; k++)

c[ib+io][jb+jo] +=

(a[ib+io][jb+k] *

b[ib+k][jb+jo]);

swap

Kernel

Page 7: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Outline

• The Problem • Partitioning the Computation • Architectural Overview • Implementation• Results• Things We Wish we could do

Page 8: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Hardware Facts

• If we accelerate the computation, DRAM access becomes the bottleneck

• CPU has slow access to DRAM– HW can directly access DRAM via PLB

(Processor Local Bus)

Page 9: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Hardware Facts

• CPU to HW memory bandwidth bound at 150MB/sec– Software overhead in data orchestration, probably only

50% of this bandwidth can be used

• Memory Bus supports 800MB/sec– Direct interface can provide up to a 5x improvement

over software transfer

• Special hardware may not be complicated because memory access patterns are simple

Page 10: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

High Level ArchitecutureFunc

Unit

Func

Unit

Func

Unit

CPU

PLB

DRAM

Interconnection

Logic

Page 11: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

ArchitectureFunc

Unit

Func

Unit

Func

Unit

Controller

Feeder

CPU

PLB

Switch

PLB Master DRAM

Page 12: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Software Example (C = A x B)Func

Unit

Func

Unit

Func

Unit

Controller

Feeder

CPU

PLB

Switch

PLB Master DRAM

AB

Ld A 0Ld B 0St C 0MAC 0

C

In reality – the execution of several blocks will be overlapped

Page 13: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Outline

• The Problem • Partitioning the Computation • Architectural Overview • Implementation • Results• Things We Wish we could do

Page 14: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Functional Unit - Design

• Instructions:– Load operand (memory) – Store operand (memory)– Zero (C = 0)– Multiply-Add-Accumulate (C += A*B)

• Two FSMs (Read/Write and Compute)– Allows overlapping of Instructions

Page 15: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Functional Unit – Algorithm

• Take algo & unroll P loop iterations

• Adder Tree of P– Crit. path grows

logarithmically

• Can pipeline– Complicated because

of parameterization

for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[i][j] += a[i][k] * b[k][j];

* *+

* *+

* *+

* *+

+ +

+

+

A[i]

[k+

7]

A[i]

[k]

A[i]

[k+

1]

A[i]

[k+

2]

A[i]

[k+

3]

A[i]

[k+

4]

A[i]

[k+

5]

A[i]

[k+

6]

B[k

][j]

B[k

+1]

[j]

B[k

+2]

[j]

B[k

+3]

[j]

B[k

+4]

[j]

B[k

+5]

[j]

B[k

+6]

[j]

B[k

+7]

[j]

C[i]

[j]

C[i]

[j]

Page 16: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Functional Unit – Algorithm

• Different algorithm – reorder multiplies– writes c[i][j] multple

times

• Unroll by P – same # of adders and

multipliers– shorter critical path

• Pipelining is easy– 2 stages

for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[j][k] += a[i][k] * b[j][i];

A[i]

[k]

B[j]

[i]

C[j]

[k]

C[j]

[k]

*+

A[i]

[k+

1]

B[j]

[i]

C[j]

[k+

1]

C[j]

[k+

1]

*+

A[i]

[k+

2]

B[j]

[i]

C[j]

[k+

2]

C[j]

[k+

2]

*+

A[i]

[k+

3]

B[j]

[i]

C[j]

[k+

3]

C[j]

[k+

3]

*+

A[i]

[k+

4]

B[j]

[i]

C[j]

[k+

4]

C[j]

[k+

4]

*+

A[i]

[k+

5]

B[j]

[i]

C[j]

[k+

5]

C[j]

[k+

5]

*+

Page 17: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

FU Microarchitecture

FSM FSMFSM

BRAM A

BRAM B

BRAM C

LOAD/STOREFSM

COMPUTEFSM

* +

Page 18: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Memory Bus Master (PLB)• 32-bit bus interface

• 16-word burst transfers– Amortize bus setup

costs

• DRAM may refresh during transfer– Added burst buffer for

rapid recovery

PLB Bus

Input Burst Buffer

Output Burst Buffer

BusControl

FSM

Store Data

Load Data

Store FSM

LoadFSM

PLB Commands

Page 19: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Memory Bus Master (PLB)• Half of critical path

through bus arbiter– Beyond our control

• Substantial retiming needed– Register pushing– State decoupling

• Need fine-grained control over scheduling

PLB Bus

Input Burst Buffer

Output Burst Buffer

BusControl

FSM

Store Data

Load Data

Store FSM

LoadFSM

PLB Commands

Page 20: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Outline

• The Problem • Partitioning the Computation • Architectural Overview • Implementation • Results • Things We Wish we could do

Page 21: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Design Parameters

• Architecture: Number of functional units

• Functional Unit: degree of parallelism, matrix size

• Memory Bus (PLB) Master: matrix memory layout, matrix size

• Switch: Number of functional units• Algorithm Generator: Block size

Page 22: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Final Results

• 100MHz• 1 Functional Unit

– 642 subblocks – 8 Complex Multiplies

• Lines of code – 10K total– Unit Testing Framework – 1.5K– C Code – 2K– BSV – 5.5K– Multiple FU implementations 1K– Additional Unused Hardware 1K

• More than 3 GOps/Sec

Page 23: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Performance

Size Time (µs)

642 799

1282 5120

2562 45300

5122 332000

10242 2710000125x

Page 24: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Things we would have done with more time

• We believe we could have obtained 10 billion ops per second

• 32-PLB -> 64-bit PLB– Double memory bandwidth

• fairly simple improvement

• Multiple Clock Domains – implemented, but had trouble synthesizing in EDK

• Play with # of FUs / registers per FU– HW parameterized for this

• Explore alternative machine organization

• Algorithmic Exploration

Page 25: MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Fin