Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz
Feb 01, 2016
Communication Overhead Estimation on Multicores
S. M. Farhad
The University of Sydney
Joint work with
Yousun Ko
Bernd Burgstaller
Bernhard Scholz
2
Outline
Motivation Multicore trend Stream programming
Profiling communication overhead Related works
2
3
Motivation
1
1975
2
4
8
16
32
64
128
256
512
1980 1985 1990 1995 2000 2005 2010
400480088080 8086 286 386 486 Pentium P2 P3 P4
Athlon Itanium Itanium2
Power4 PA8800400480088080
PA8800
Opteron CoreDuo
Power6Xbox 360
BCM 1480Opteron 4P
Xeon
Niagara Cell
RAW
RAZA XLR Cavium
Unicore
Homogeneous Multicore
Heterogeneous MulticoreCISCO CSR1
Larrabee
PicoChip AMBRIC
AMD Fusion
NVIDIA G80
Core
Core2Duo
Core2Quad
# co
res/
chip
Courtesy: Scott’08
C/C++/Java
CUDA
X10Peakstream
Fortress
Accelerator
Ct
C T M
Rstream
Rapidmind
Stream Programming
3
4
Stream Programming Paradigm Programs expressed as stream
graphs
Streams: Infinite sequence of data elements
Actors: Functions applied to streams
4
Actor
Stream
Stream
5
Properties of Stream Program Regular and repeating
computation Independent actors with explicit
communication Producer / Consumer
dependencies
5
Adder
Speaker
AtoD
FMDemod
LPF1
Splitter
Joiner
LPF2 LPF3
HPF1 HPF2 HPF3
6
StreamIt Language
An implementation of stream prog.
Hierarchical structure
Each construct has single input/output stream
parallel computation
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
6
How to Estimate the Communication Overhead?
7
Problems to Measure Communication Overhead Reasons:
Multicores are non-communication exposed architecture
Complex cache hierarchy Cache coherence protocols
Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring
the execution time of actors
8
Measuring the Communication Overhead of an Edge
9
i k
Processor 1
No communication cost
Processor 1
With communication cost
Processor 2
ki
kkiiki ttttC ),(
it ktit kt
How to Minimize the Required Number of Experiments
10
A
B
C
1
2
Pipeline
GraphColoring
Requires2+1 Exps
A
B
C
D
Processor 1 Processor 2
1
2
3
E
F
5
4
Even edgesacross partition
Processor 1
A
D
B
C
E
Processor 2
1
3
2
4
Odd edgesacross partition
Obs. 1: There is no loop of three actors in a stream graph
11
i k
l
Processor 1 Processor 2
Obs. 2: There is no interference of adjacent nodes between edges
12
A
B
C D
E
F
For blue color edges
P-1
P-2
P-3
P-4
Remove Interference
Convert to a line graph
Add interference edges
Use vertex coloring algorithm
13
A
B
C D
E
F
AB
BC
BDCE
DE
EF
Line graphStream graph
AB
BC
BDCE
DE
EF
Processor Leveling Graph
14
A
B
C D
E
F
For blue colored edge Processor leveling graph
A
B, C, D, E
F
Coloring the Processor Labelling Graph
15
A
B, C, D, E
F
Processor 2Processor 1
A
B, C, D, E
F
A
B, C, D, E
F
Measuring the Communication Cost
16
A
B
C D
E
F
A
B, C, D, E
F
Processor 2Processor 1
)()(
)()(
),(
),(
FFEEFE
BBAABA
ttttC
ttttC
At
Bt
Et
Ft
For blue colored edge
Profiling Performance
Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13
GM 17% 15%
17
18
Related Works
[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]
[2] StreamIt: A language for streaming applications [Thies ‘02]
[3] Phased Scheduling of Stream Programs [Thies ’03]
[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in
Stream Programs [Thies ‘06]
[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]
[6] Software Pipelined Execution of Stream Programs on GPUs
[Udupa‘09]
[7] Synergistic Execution of Stream Programs on Multicores with
Accelerators [Udupa ‘09]
[8] Orchestration by approximation [Farhad ‘11]
18
Questions?