Filter Decomposition for Supporting Coarse- grained Pipelined Parallelism Wei Du, Gagan Agrawal Ohio State University
Mar 13, 2016
Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism
Wei Du, Gagan Agrawal
Ohio State University
Distributed Data-Intensive Applications Fast growing datasets Remote data access
Distributed data storage More connected world
Internet
data
data
data
data
datadatadata
Requirements: Huge Storage/Powerful Computer/Fast Connection
Internet
data
data
data
data
datadatadata
Implementation: Local processing
Internet
data
data
data
data
datadatadata
Internet
data
data
data
data
datadatadata
Implementation: Remote processing
Requirements: Complex Analysis at Data Centers
Our hypothesis Coarse-grained pipelined execution
model is a good match
Internet
A Practical Solution
data
data
Coarse-Grained Pipelined Execution
Definition Computations associated with an application are
carried out in several stages, which are executed on a pipeline of computing units
Example — K-nearest Neighbor (KNN) Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and
a point p = (a, b, c). We want to find the nearest K neighbors of p within R.
Range_query Find the K-nearest neighbors
Challenges Computation associated with an application
needs to be decomposed into stages Decomposition decisions are dependent on
the execution environment Generating code for each stage (SC03) Other performance issues for the pipelined
execution (ICPP04) Adapting to the dynamic execution
environment (SC04)
RoadMap
Filter Decomposition Problem MIN_ONETRIP Algorithm MIN_BOTTLENECK Algorithm MIN_TOTAL Algorithm Experimental Results Related Work Conclusion
Filter DecompositionC1
C2
Cm-1
Cm
L1
Lm-1
computation pipeline
f1
f2
fn-1
fn
atomic filters
f3 - f6
fn
L1
C1
Cm
Cm-1
f1f1 , f2
C2
fn-1
Lm-1
f2 , f3
fn
L1
C1
Cm
Cm-1
f1f1
C2
fn-2,fn-1
Lm-1
Filter DecompositionC1
C2
Cm-1
Cm
L1
Lm-1
computation pipeline
f1
f2
fn-1
fn
atomic filters
Goal: Find a placement
p (f1,f2, …, fn) = (F1, F2, …, Fm) whereFi = fi1, fi1+1, …, fik , (1 ≤ i1,ik ≤ n) such that the predicted execution time is minimal (1≤ i ≤ m).
f3
f4
L1
C1
C3
f1f1 , f2
C2
L2
Cost Model
Bottleneck stage: bth stage the slowest stage in the pipeline
Execution timeT = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)
= ∑i≠bTi + (N-1)*Tb
Three Algorithms
MIN_ONETRIP Algorithm dynamic programming algorithm to minimize ∑Ti
MIN_BOTTLENECK Algorithm dynamic programming algorithm to minimize Tb
MIN_TOTAL Algorithm greedy algorithm try to minimize T
T = ∑i≠bTi + (N-1)*Tb
Filter Decomposition: MIN_ONETRIP
Cm-2
Cm-1
Cm
Lm-1
Lm-2 fn-1
fn-1fn
fn
Goal: minimize time spent by one packet on the pipeline
Cm-2
Cm-1
Cm
Lm-1
Lm-2
T[i,j]: min cost of doing computations f1 ,…,,…, fi on computing units C1,…, Cj,
where the results of fi are on Cj.
T[i,j] = minT[i-1,j] + Cost_comp(P(Cj),Task(fi))
T[i,j-1] + Cost_comm(B(Lj-1),Vol(fi))
Goal: T[n,m] Cost: O(mn)
Filter Decomposition: MIN_ONETRIP
Filter Decomposition: MIN_BOTTLENECK
Cm-2
Cm-1
Cm
Lm-1
Lm-2 fn
f1
…fn
fn-1
f1
…
fn-1fn
fn-2
f1
…
……
f2…fn
f1
Goal: minimize time spent at the bottleneck stage
N[i,j]: min cost of bottleneck stage for computing f1 ,…,,…, fi on computing units C1,…, Cj, where the results of fi are on Cj.
Cost: O(mn2)
N[i,j] = min
max{ N[i,j-1], Cost_comm(B(Lj-1),Vol(fi)) }
… …
max{ N[i-1,j-1], Cost_comm(B(Lj-1),Vol(fi-1)), Cost_Comp(P(Cj),Task(fi)) }
max{ N[1,j-1], Cost_comm(B(Lj-1),Vol(f1)), Cost_Comp(P(Cj), Task(f2) + … + Task(fi)) }
Filter Decomposition: MIN_BOTTLENECK
C1
C2
C3
C4
L1
L3
L2
f1
f2
f3
f4
f5
L1
C1
C3
C4
C2
f1f1 : T1
Estimated Costf1 , f2
f1, f2 : T2
f1 - f3 : T3
f1 - f4 : T4
Min{T1 … T4 } = T2
To minimize the predicted execution time T
Filter Decomposition: MIN_BOTTLENECK
RoadMap
Filter Decomposition Problem MIN_ONETRIP Algorithm MIN_BOTTLENECK Algorithm MIN_TOTAL Algorithm Experimental Results Related Work Conclusion
Experimental Results 4 Configurations
3 Applications Virtual Microscope Iso-Surface Rendering
1 1 11 1
1 1 10.1 0.5
1 1 0.011 0.001
0.1 1 0.011 0.001
Used Applications
Virtual Microscope (Vmscope) an emulation of a microscope input: a rectangular region, a resolution
value output: portion of the original image with
certain resolution
Experimental Results: Virtual Microscope
3 queries Q1 : 1 packet Q2 : 4 packets Q3 : 4500 packets
4 Algorithms MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exhaustive_Search
0
50
100
150
200
250
300
Q1 Q2 Q3
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
0
100
200
300
400
Q1 Q2 Q3
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
Application
0
150
300
450
600
750
Q1 Q2 Q3
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
0
200
400
600
800
1000
Q1 Q2 Q3
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
Application
Experimental Results: Virtual Microscope
Two observations The performance variance between different
algorithms is small The Exha_Search does not always give the best
placement characteristics based on one packet information combining two filters as one, saving copying
cost
Experimental Results: Virtual Microscope
Iso-surface rendering (Iso) input: a 3-D grid, a scalar value, a view
screen with angle specified output: a surface seen from certain angle,
which captures points in the grid whose scalar value matches the given iso-surface value
Used Applications
Experimental Results: Iso 2 Implementations
ZBUF ACTP
2 Datasets small : 3 packets large : 47 packets
4 Algorithms MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exhaustive_Search
0
0.5
1
1.5
2
ZBUF ACTP
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
Application
0
5
10
15
20
ZBUF ACTP
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
Application
Small dataset
Large dataset
Experimental Results: Iso
The MIN_TOTAL algorithm gives the best placement for small dataset
The MIN_ONETRIP algorithm finds the best placement for large dataset
This application is very data-dependent !
Experimental Results: Iso
0
10
20
30
40
50
60
70
80
3 10 100
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
Number of Runs
0
10
20
30
40
50
60
3 10 100
MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search
Execution Time (in ms)
Number of Runs
ZBUF
ACTP
Experimental Results: Iso
Conclusion & Future Work
Our algorithms perform quite well Future Work
To find more accurate characteristics of applications
estimate of the performance change resulting from combining multiple atomic filters
estimate of the impact of data dependence
Thank you !!!Thank you !!!