Top Banner
Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco 2009 1
38

Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Liquid Metal’s OPTIMUS:Synthesis of Efficient Streaming Hardware

Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM)DAC HLS Tutorial, San Francisco 2009

1

Page 2: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

your apphere

JIT compilerconfigureslogic

Dynamic Application Specific Customization of HW

2

Inspired by ASIC paradigm:• High Performance• Low Power

Page 3: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Liquid Metal: “JIT the Hardware”

3

Single language for programming HW & SW Run in a standard JVM, or synthesize in HW Fluidly move computation between HW & SW Do for HW (viz. FPGAs) what FORTRAN did for

computing Address critical technology trendsPower address impractical growth of power and

cooling demands

Architecture enabling million way parallelism vs. small scale multicores

Versatility in the field & on the fly customization to end-user applications

Applications demand for pervasive streaming and mobile content (WWW, multimedia, gaming)

ASIC-like

Reconfigurable

Page 4: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Lime: the Liquid Metal Language

4

Design Principles: Object-oriented, Java-like, Java-compatible Raise level of abstraction Parallel constructs that simplify code Target synthesis while retaining generality

Page 5: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

4 reasons not another *C to HDL approach

Emphasis on programmer productivity Leverage rich Java IDEs, libraries, and analysis

Not an auto-parallelization approach Lime is explicitly parallel and synthesizable

Fast fail-safe mechanism Lime may be refined into parallel SW implementation

Intrinsic opportunity for online optimizations Static optimizations with dynamic refinement

Page 6: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Lime Overview6

Computation is well encapsulated

Data-flow driven computation

Multiple “clock domains

Tasks, Value types

HW (FPGA): Lime:

Bit-level control and reasoning

Memory usage statically determined before layout

Abstract OO programming down to the bit-level!

Ordinal-indexed arrays, bounded loops

Streaming primitives

Template-like Generics

Rate “matching” operators

Page 7: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Streams: Exposing Computational Structure

7

Stream primitives are integral to the language

Tasks in streams are strongly isolated Only the endpoints may perform side-

effects Provide macro-level functional

programming abstraction… … While allowing traditional imperative

programming inside

Page 8: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

A Brief Introduction to Stream Operations

8

int stream s1 = { 1, 1, 2, 3, 5, 8 };

A finite stream literal:

int stream s2 = task 3;

An infinite stream of 3’s:

int stream s3 = s2 * 17;double stream s4 = Math.sin(s1);double stream s5 = s3 + s4;

Stream expressions:

These operations create and connect tasks. Execution occurs later: lazy computation, functional.

Page 9: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Simple Audio Processing9

value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength];

for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude;

return (value int[]) samples;}

int stream sqwaves = task squareWave(1000, 44100, 80));

task AudioSink(44100).play(sqwaves);

Page 10: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Liquid Metal Tool Chain10 Lime

QuicklimeFront-EndCompiler

StreamingIR

LM VMVirtex5 FPGA

LM VM

Xilinxbitfile

XilinxVHDL

Compiler

HDL

Cell BE

LM VM

Cell binary

Cell SDK

C

CrucibleBack-EndCompiler

OptimusBack-EndCompiler

FPGAModel

Page 11: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Streaming Intermediate Representation (SIR)

11

splitter joiner

joiner splitter

Task:

SplitJoin:

Feedback Loop:

switch joiner

Switch:

Pipeline:

• Task may be stateless or have state• Task mapped to “module” with FIFO I/O• Task graphs are hierarchical & structured

Page 12: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

SIR Compiler Optimizations12

Address FPGA compilation challenges Finite, non-virtualizable device Complex optimization space

Throughput, latency, power, area Very long synthesis times (minutes-hours)

Task fusion and fission load balancing, scalability

Stream buffer allocation locality enhancing, manage cache footprint or SRAM and control logic complexity

Data access fusion reduce critical path length, improve communication-to-computation balance

Page 13: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405

13

FFT

Pa

ralle

l A

...

Bu

bb

le S

ort

Me

rge

So

rt

Dis

cre

te .

..

DE

S

Ma

trix

Mu

lt..

.

Ma

trix

Blo

c...

Ave

rag

e

0

0.2

0.4

0.6

0.8

Fract

ion o

f Pow

erP

C E

nerg

y

~1.4~1.4~1.4 2.25

• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W

Page 14: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405

14

• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W

Page 15: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Handel-C Comparison

Compared DES and DCT with hand-optimized Handel-C implementation

Performance 5% faster before optimizations 12x faster after optimizations

Area 66% larger before optimizations 90% larger after optimizations

15

Page 16: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Overview

Compilation Flow

Scheduling

Optimizations

Results

16

Page 17: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Top Level Compilation

Filter

Controller

M0

Init

M1

. . .

i0 i1 ix

OmO0O0

Mn

Work Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

a[ ]

i

Init

Controller

Controller

Controller

Controller

Controller

Controller

Controller

Controller

A

B EC

HGF I

J

D

Work

Work

WorkWorkWork

Work

Work

Work

Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

B DC

F

E

A

J

IHG

17

Page 18: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Filter Compilation

sum = 0i = 0

temp = pop( )

sum = sum + tempi = i + 1Branch bb2 if i < 8

push(sum)

1

2

3

4

Basic Block

Register

Control in

Control outs

Mem

ory/Queue ports

Ack

Live data outsLive data ins

bb1

bb2

bb3

bb4

Live out Data

Live

ou

t Da

ta

Register

mux mux

Register

Register

Register

FIFO Read

FIFO Write

Control

Token

Control Token

Control Token

Ack

Ack

Ack

18

Page 19: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Operation Compilation

FU

i0 im

o0 on

predicate

ADDADD

CMP

Register

i 1 temp sum

8

Control out 3

11

1

temp

Control out 4

Control in

sum = sum + tempi = i + 1Branch bb2 if i < 8

19

Page 20: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Static Stream Scheduling

20

Filter 1

Filter 2

Push 2

Pop 3

Each queue has to be deep enough to hold values generated from a single execution of the connected filter

Double buffering is needed

Buffer access is non-blocking

A controller module is needed to orchestrate the schedule

Controller uses finite state machine to execute the steady state schedule

20

Page 21: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Greedy Stream Scheduling

Filter 1

Filter 2

Filters fire eagerly. Blocking channel access.

Allows for potentially smaller channels

Controller is not needed

Results produced with lower latency.

21

Page 22: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Latency Comparison

FF

T

Pa

ralle

l Ad

de

r

Bu

bb

le S

ort

Me

rge

So

rt

Dis

cre

te C

os.

..

DE

S

Ma

trix

Mu

ltip

ly

Ma

trix

Blo

ck M

...

Ave

rag

e

0

2

4

6

8

10

12

14

16

18

La

ten

cy o

f Sta

tic R

ela

tive

to G

ree

dy

22

Page 23: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Area Comparison

FF

T

Pa

ralle

l Ad

de

r

Bu

bb

le S

ort

Me

rge

So

rt

Dis

cre

te C

os.

..

DE

S

Ma

trix

Mu

ltip

ly

Ma

trix

Blo

ck M

...

Ave

rag

e

0

10

20

30

40

50

60

70

80

90

100Circuits with static schedulerCircuits with greedy scheduler

%

of

FP

GA

A

rea

23

Page 24: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Optimizations

Streaming optimizations (macro functional) Channel allocations, Channel access fusion, Critical Path

Balancing, Filter fission and fusion, etc. Doing these optimization needs global information about the

stream graph Typically performed manually using existing tools

Classic optimizations (micro functional) Flip-flop elimination, Common subexpression elimination,

Constant folding, Loop unrolling, etc. Typically included in existing compilers and tools

24

Page 25: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Channel Allocation

Larger channels: More SRAM More control logic Less stalls

Interlocking makes sure that each filter gets the

right data or blocks.

What is the right channel size?

25

Page 26: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Channel Allocation Algorithm

Set the size of the channels to infinity.

Warm-up the queues.

Record the steady state instruction schedules for each pair.

Unroll the schedules to have the same number of pushes and pops.

Find the maximum number of overlapping lifetimes.

26

Page 27: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Channel Allocation Example

----

----

push

----

push

----

push

push

push

----

----

push

----

----

pop

----

----

----

pop

----

pop

pop

pop

pop

Max overlap = 3

Producer Consumer

Source

Filter 1

Filter 2

Sink

27

Page 28: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Channel Allocation28

FFT

Para

llel A

...

Bubble

Sort

Merg

e S

ort

Dis

crete

...

DES

Matr

ix M

ul...

Matr

ix B

lo..

.

Ave

rage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Re

lati

ve

C

ha

nn

el

Siz

e

Aft

er

O

pti

miz

ati

on

Page 29: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Channel Access Fusion

Each channel access (push or pop) takes one cycle.

Communication to computation ratio

Longer critical path latency

Limit task-level parallelism

29

Page 30: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Channel Access Fusion Algorithm

Clustering channel access operations Loop Unrolling Code Motion Balancing the groups

Similar to vectorization Wide channels

30

rrrrrrrr

w

w

w

w

r

w

w

r

Write Mult. = 1

Read Mult. = 8

Write Mult. = 8

Read Mult. = 8

Write Mult. = 4

Read Mult. = 1

30

Page 31: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Access Fusion Example

Some caveats

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);

int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);

int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);

31

Page 32: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

FFT

Pa

ralle

l A

...

Bu

bb

le S

ort

Me

rge

So

rt

Dis

cre

te .

..

DE

S

Ma

trix

Mu

lt..

.

Ma

trix

Blo

c...

Ave

rag

e

0

1

2

3

4

5

6

7

8

Sp

eed

up

(x1

00

%)

Access Fusion32

Page 33: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Critical Path Balancing

Critical path is set by the longest combinational path in the filters

Optimus uses its internal FPGA model to estimate how this impacts throughput and latency

Balancing Algorithm: Optimus take target clock as input Start with least number of basic blocks Form USE/DEF chains for the filter Use the internal FPGA model to measure critical path

latency Break the paths whose latency exceeds the target

33

Page 34: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Critical Path Balancing Example

Mul

Add

MulMul

Sub

Add

MulMul

Sub

Mul

Sub

Add Sub Add Sub

Add Sub

Mul Mul

Add Add

Shift

Shift

Add

AddSub

Add

MulMul

Sub

Mul

Add

Add SubAdd Sub

Add

Shift

1

1

1

2

2

1

3

34

Operation

Delay

Add/Sub 4

Shift 2

Multiply 10

34

Page 35: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Liquid Metal 35

Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and

SW Liquid Metal VM: JIT the hardware!

GPU MulticoreCPU ???FPGA

LiquidMetal VM

Program all withLime

Page 36: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Streaming IR

Expose structure: computation and communication

Uniform framework for pipeline and data parallelism

Canonical representation for stream-aware optimizations

Page 37: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Streaming Optimizations

Macro-functional Fold streaming IR

graphs into FPGA… Fusion, fission,

replication …subject to

latency, area, and throughput constraints

Micro-functional Micro-pipelining Channel

allocation Access fusion Flip-flop

elimination

Page 38: Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco.

Ongoing Effort

Application development Streaming for enterprise and consumer Real-time applications

Compiler and JIT Pre-provisioning profitable HW implementations Runtime opportunities to “JIT” the HW

Advanced dynamic reconfiguration support in VM Predictive, hides latency

New platforms Tightly coupled, higher bandwidth, lower

latency communication Heterogeneous MPSoC systems – FPGA +

processors

38