High Performance Architectures Dataflow Part 3
High Performance Architectures
Dataflow
Part 3
2
Dataflow Processors
Recall from Basic Processor Pipelining: Hazards limit performance
Structural hazards Data hazards due to
true dependences or name (false) dependences: anti and output dependences
Control hazards Name dependences can be removed by:
compiler (register) renaming renaming hardware advanced superscalars single-assignment rule dataflow computers
Data hazards due to true dependences and control hazards can be avoided if succeeding instructions in the pipeline stem from different contexts dataflow computers, multithreaded processors
3
Dataflow Model of Computation Enabling rule:
An instruction is enabled (i.e. executable) if all operands are available.
Von Neumann model: an instruction is enabled if it is pointed to by PC.
The computational rule or firing rule, specifies when an enabled instruction is actually executed.
Basic instruction firing rule: An instruction is fired (i.e. executed) when it becomes
enabled. The effect of firing an instruction is the consumption of its
input data (operands) and generation of output data (results)Where are the structural hazards?
4
Dataflow languages Main characteristic: The single-assignment rule
A variable may appear on the left side of an assignment only once within the area of the program in which it is active.
Examples: VAL, Id, LUCID
A dataflow program is compiled into a dataflow graph which is a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependences among instructions The dataflow graph is similar to a dependence graph used in
intermediate representations of compilers. During the execution of the program, data propagate along the arcs in
data packets, called tokens This flow of tokens enables some of the nodes (instructions) and fires
them
5
Dataflow Architectures - Overview Pure dataflow computers:
static, dynamic, and the explicit token store architecture.
Hybrid dataflow computers: Augmenting the dataflow computation model with
control-flow mechanisms, such as RISC approach, complex machine operations, multi-threading, large-grain computation, etc.
6
Pure Dataflow A dataflow computer executes a program by receiving, processing and sending
out tokens, each containing some data and a tag. Dependences between instructions are translated into tag matching and tag
transformation. Processing starts when a set of matched tokens arrives at the execution unit. The instruction which has to be fetched from the instruction store (according to
the tag information) contains information about what to do with data and how to transform the tags
The matching unit and the execution unit are connected by an asynchronous pipeline, with queues added between the stages
Some form of associative memory is required to support token matching a real memory with associative access, a simulated memory based on hashing, or a direct matched memory
7
Static Dataflow
A dataflow graph is represented as a collection of activity templates, each containing: the opcode of the represented instruction, operand slots for holding operand values, and destination address fields, referring to the
operand slots in sub-sequent activity templates that need to receive the result value.
Each token consists only of a value and a destination address.
8
Dataflow graph and Activity template
*
data token
acknowledge signal
data arc
acknowledgement arc
sqrt
x y
z
ni
nj
32x
y
z
sqrt
*ni
ni
nj
nj
x
y
z
23
9
Acknowledgement signals Notice, that different tokens destined for the same destination cannot be
distinguished. Static dataflow approach allows at most one token on any one arc.
Extending the basic firing rule as follows: An enabled node is fired if there is no token on any of its output
arcs.
Implementation of the restriction by acknowledge signals (additional tokens ), traveling along additional arcs from consuming to producing nodes.
Using acknowledgement signals, the firing rule can be changed to its original form: A node is fired at the moment when it becomes enabled.
Again: structural hazards are ignored assuming unlimited resources!
10
MIT Static Dataflow Machine
C o m m u n ic a tio n
N e tw o rk
P E
P E
. . .
A c tiv ityS to re
In s tru c tio nQ u e u e
F e tchU n it
U p d a teU n it
S U
R U
O p e ra tio n U n it(s )
lo c a lc o m m u n ic a tio n
to /fro m th eC o m m u n ic a tio nN e tw o rk
P ro cess in g E lem e n t
11
Deficiencies of Static Dataflow Consecutive iterations of a loop can only be
pipelined. Due to acknowledgment tokens, the token traffic
is doubled. Lack of support for programming constructs that
are essential to modern programming language no procedure calls, no recursion.
Advantage: simple model
12
Dynamic Dataflow Each loop iteration or subprogram invocation should be able to execute
in parallel as a separate instance of a reentrant subgraph. The replication is only conceptual. Each token has a tag:
address of the instruction for which the particular data value is destined
and context information Each arc can be viewed as a bag that may contain an arbitrary number
of tokens with different tags.
The enabling and firing rule is now: A node is enabled and fired as soon as tokens with identical
tags are present on all input arcs.
Structural hazards ignored!
13
MERGE and SWITCH nodes
(a) (b)
execution
executionX X
XX
T
T
F
F
T
T
F
F
T
F
execution
executionSWITCH SWITCH
SWITCH SWITCH
14
Branch Implementations
X
f g
T F
x
bni
nj
nk
SWITCH P
f g
T F
x
bn
i
nj
nk
CHOOSE
COPY
PBranchSpeculative branchevaluation
15
Basic Loop Implementation
X
f
T F
L
D
D -1
L-1
new x
ni
nj
nl
nm
nk SWITCH P
L: initiation, new loop context D: increments loop iteration number D-1: reset loop iteration number to 1 L-1: restore original context
16
Function application
ni
ni
nend
nbegin
nj
q a
A
A -1
q a
q
BEGIN
END
APPLY
A: create new context BEGIN: replicate tokens for each fork END: return results, unstack return
address A-1: replicate output for successors
17
MIT Tagged-Token Dataflow Architecture
Processing Element
Communication
Network
PE
PEI-Structure
Storage
I-Structure
Storage. . . . . .
localcommunication
SU
RU
FormTokenUnit
InstructionFetchUnit
ALU
TokenQueue
ProgramStore
& Form Tag
& ConstantStore
Wait-Match Unit & Waiting Token Store
to/from theCommunicationNetwork
18
Manchester Dataflow Machine
Processing Element
Switch Switch
Switch
StructureStorage
StructureStorage
PE
Switch
Matching
Unit
Instruction
Store
Token
Queue
Processing Unit
ALU ALU
outputinput
. . .
Host
19
Advantages and Deficiencies of Dynamic Dataflow
Major advantage: better performance (compared with static) because it allows multiple tokens on each arc thereby unfolding more parallelism.
Problems: efficient implementation of the matching unit that collects tokens with
matching tags. Associative memory would be ideal. Unfortunately, it is not cost-effective since the amount of memory
needed to store tokens waiting for a match tends to be very large. All existing machines use some form of hashing techniques.
bad single thread performance (when not enough workload is present) dyadic instructions lead to pipeline bubbles when first operand tokens
arrive no instruction locality no use of registers
20
Explicit Token Store (ETS) Approach
Target: efficient implementation of token matching.
Basic idea: allocate a separate frame in a frame memory for each active loop iteration or subprogram invocation.
A frame consists of slots; each slot holds an operand that is used in the corresponding activity.
Access to slots is direct (i.e. through offsets relative to the frame pointer) no associative search is needed.
21
Explicit Token Store
2.34
presencebit value
Frame Memory
FP
FP + 2
IP
*
+
-
sqrt
<FP, IP, 3.01>
sqrt*
3
5
+2
2 +1 +2
+1 +5
+3 +2
+
op-code
offsetin theactivationframe
destinationsleft right
Instruction Memory
-
22
Monsoon, an Explicit Token Store Machine
Processing Element
MultistagePacketSwitchingNetwork
PE
PE
I-Structure
Storage
I-Structure
Storage
. . . . . .
Fra
me
Mem
ory
FormToken
Use
r Q
ueue
Sys
tem
Que
ue
InstructionMemory
ALU
InstructionFetch
EffectiveAddressGeneration
PresenceBitOperation
FrameOperation
to/from theCommunicationNetwork
23
WaveCache: A Dataflow Processor
WaveScalar is an ISA
of a dataflow processor
named WaveCache
The WaveCache is a
grid of approximately
2K processing elements
(PEs) arranged into
clusters of 16
24
WaveCache: A Dataflow Processor
A WaveScalar executable contains
an encoding of the program
dataflow graph
The instructions explicitly send data
values to the instructions that need
them instead of broadcasting them
via the register file
The potential consumers are known
at compile time, but depending on
control flow, only a subset of them
should receive the values at run-
time
25
WaveCache: A Dataflow Processor
Traditional imperative languages
provide the programmer with a
model of memory known as total
load-store ordering
WaveScalar brings load-store
ordering to dataflow computing
using wave-ordered memory
Wave-ordered memory annotates
each memory operation with its
location in its wave and its ordering
relationships (defined by the control
flow graph) with other memory
operations in the same wave