Top Banner
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling
40

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

Dec 14, 2015

Download

Documents

Herbert Mayers
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

PresenterMaxAcademy Lecture Series – V1.0, September 2011

Stream Scheduling

Page 2: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

• Latencies in stream computing• Scheduling algorithms• Stream offsets

2

Overview

Page 3: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

• Consider a simple arithmetic pipeline

• Each operation has a latency– Number of cycles from input to output– May be zero– Throughput is still 1 value per cycle, L values can be

in-flight in the pipeline

3

Latencies in Stream Computing

(A + B) + C

Page 4: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

4

++

Output

InputA

InputB

InputC

Basic hardware implementation

Page 5: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

5

321

Data propagates through the circuit in “lock step”

Page 6: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

6

3

21

Page 7: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

7

3

21

X

Data arrives at wrong time due to pipeline latency

Page 8: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

8

++

Output

InputA

InputB

InputC

Insert buffering to correct

Page 9: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

9

1 2 3

Now with buffering

Page 10: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

10

1 2 3

Page 11: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

11

3 3

Page 12: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

12

3 3

Page 13: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

13

6

Page 14: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

++

Output

InputA

InputB

InputC

14

6Success!

Page 15: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

15

• A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations

• Can be automatically applied on a large dataflow graph (many thousands of nodes)

• Can try to optimize for various metrics– Latency from inputs to outputs– Amount of buffering inserted generally most interesting– Area (resource sharing)

Stream Scheduling Algorithms

Page 16: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

16

ASAPAs Soon As Possible

Page 17: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

17

Input InputInputA

InputA

InputB

InputC

0 00

Build up circuit incrementally

Keeping track of latencies

Page 18: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

18

+

Input InputInputA

InputA

InputB

InputC

0 00

1

Page 19: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

19

++

Input InputInputA

InputA

InputB

InputC

1

0 00

Input latencies are mismatched

Page 20: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

20

++

Input InputInputA

InputA

InputB

InputC

0 00

11

2

Insert buffering

Page 21: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

21

++

Output

Input InputInputA

InputA

InputB

InputC

0 00

11

2

Page 22: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

22

ALAPAs Late As Possible

Page 23: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

23

Output

0

Start at output

Page 24: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

24

+Output

-1

0

-1

Latencies are negativerelative to end of circuit

Page 25: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

25

++

Output

InputC

-2 -2

-1 -1

0

Page 26: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

26

++

Output

Input InputInputA

InputA

InputB

InputC

-2 -2

-1 -1

0

Page 27: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

27

++

Output

Input InputInputA

InputA

InputB

InputC

-2 -2

-1 -1

0

Buffering is saved

Page 28: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

28

++

Output1

Input InputInputA

InputA

InputB

InputC

Output2

Sometimes this is suboptimal

What if we addan extra output?

Page 29: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

29

++

Output1

Input InputInputA

InputA

InputB

InputC

-2 -2

-1 -1

0

Output2

Unnecessary bufferingis added

0

Neither ASAP nor ALAPcan schedule this design

optimally

Page 30: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

30

• ASAP and ALAP both fix either inputs or outputs in place

• More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP

Optimal Scheduling

Page 31: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

• Consider:

• We can see that we might need some explicit buffering to hold more than one data element on-chip

• We could do this explicitly, with buffering elements

31

Buffering data on-chip

a = a + (buffer(a, 1) + buffer(b, 1))

a[i] = a[i] + (a[i - 1] + b[i - 1])

Page 32: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

32

++

Output

InputA

InputB

Buffer(1) Buffer(1)

The buffer has zero latency in the schedule

Page 33: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

33

++

Output

InputA

InputB

Buffer(1) Buffer(1)

This will schedule thus

Buffering = 3

0 0

00

1

1

2

Page 34: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

34

• Accessing previous values with buffers is looking backwards in the stream

• This is equivalent to having a wire with negative latency– Can not be implemented directly, but can affect the

schedule

Buffers and Latency

Page 35: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

35

++

Output

InputA

InputB

0 0

-1-1

0

1

Offset wires can have negative latency

Offset(-1) Offset(-1)

-1-1

Page 36: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

36

++

Output

InputA

InputB

0 0

-1-1

0

1

This is scheduled

Buffering = 0

Offset(-1) Offset(-1)

-1-1

Page 37: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

• A stream offset is just a wire with a positive or negative latency• Negative latencies look backwards in the stream• Positive latencies look forwards in the stream

• The entire dataflow graph will re-schedule to make sure the right data value is present when needed

• Buffering could be placed anywhere, or pushed into inputs or outputs more optimal than manual instantiation

37

Stream Offsets

Page 38: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

38

+Output

InputA

0

Offset(1)

a = a + stream.offset(a, +1)

a[i] = a + a[i + 1]

Page 39: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

39

+Output

InputA

Scheduling produces a circuitwith 1 buffer

0

Offset(1)

11

2

Page 40: Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

40

For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles.

1. Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph

2. Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to:

a) c = ( (a1 + a2) + a3) + a4b) c = (a1 + a2) + (a3 + a4)

3. Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule:

a) c = ((a1 * a2) + (a3 * a4)) + a1b) c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4

How many values of stream a1 will be buffered on-chip for (b)?

Exercises