1 BERKELEY PAR LAB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar Berkeley, CA March 31, 2009 [email protected]{benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley
42
Embed
Lithe: Enabling Efficient Composition of Parallel Libraries
Lithe: Enabling Efficient Composition of Parallel Libraries. Heidi Pan , Benjamin Hindman, Krste Asanovi ć. [email protected] {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley. HotPar Berkeley, CA March 31, 2009. Functionality :. or. or. or. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
BERKELEY PAR LAB
Lithe: Enabling Efficient Composition of Parallel Libraries
Need both programmer productivity and performance!
Functionality: or or or
BERKELEY PAR LAB
3
Composability is Key to Productivity
Functional Composability
sort
App 1 App 2
code reuse
sort
same library implementation, different appsmodularity
App
same app, different library implementations
bubblesort
quicksort
BERKELEY PAR LAB
4
Composability is Key to Productivity
Performance Composability
fast
fast
fast+
faster
fast
fast(er)+
BERKELEY PAR LAB
5
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution:
Harts Lithe
Evaluation
BERKELEY PAR LAB
6
Motivational Example
Sparse QR Factorization(Tim Davis, Univ of Florida)
OS
MKL
OpenMP
System Stack
Hardware
TBB
SPQRFrontal MatrixFactorization
ColumnElimination
Tree
Software Architecture
BERKELEY PAR LAB
7
Out-of-the-Box PerformanceT
ime
(sec
)
Performance of SPQR on 16-core Machine
Out-of-the-Box
Input Matrix
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci0
0.5
1
1.5
2
2.5
3
3.5
landmark
sequential
BERKELEY PAR LAB
8
Out-of-the-Box Libraries Oversubscribe the Resources
OS
TBB OpenMP
Hardware
Core
0Core
1Core
2Core
3
virtualized kernel threads
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
A[0]A[1]A[2]
A[10]A[11]A[12] Prefetch
+Y
+Z
+X(unit stride)NY
NZ
NX CY
CZ
CX
TYTX
Cache Blocking
BERKELEY PAR LAB
9
MKL Quick Fix
Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm
If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.
BERKELEY PAR LAB
10
Sequential MKL in SPQR
OS
TBB OpenMP
Hardware
Core
0Core
1Core
2Core
3
BERKELEY PAR LAB
11
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Sequential MKL PerformanceT
ime
(sec
)
Performance of SPQR on 16-core Machine
Out-of-the-Box
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Sequential MKL
Input Matrix
BERKELEY PAR LAB
12
SPQR Wants to Use Parallel MKL
No task-level parallelism!
Want to exploit matrix-level parallelism.
BERKELEY PAR LAB
13
Share Resources Cooperatively
OS
TBB OpenMP
Hardware
Tim Davis manually tunes libraries to effectively partition the resources.
Core
0Core
1
TBB_NUM_THREADS = 2
Core
2Core
3
OMP_NUM_THREADS = 2
BERKELEY PAR LAB
14
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Manually Tuned PerformanceT
ime
(sec
)
Performance of SPQR on 16-core Machine
Out-of-the-Box Sequential MKL
0
5
10
15
20
25
deltaX0
0.5
1
1.5
2
2.5
3
3.5
landmark60
65
70
75
80
85
ESOC0
200
400
600
800
1000
1200
Rucci
Manually Tuned
Input Matrix
BERKELEY PAR LAB
15
Manual Tuning Cannot Share Resources Effectively
Give resources to OpenMP
Give resources to TBB
BERKELEY PAR LAB
16
Manual Tuning Destroys Functional Composability
Tim Davis
LAPACKAx=bMKL
OpenMP
OMP_NUM_THREADS = 4
BERKELEY PAR LAB
17
Manual Tuning Destroys Performance Composability
SPQR
MKLv1
MKLv2
MKLv3
App
SPQR
0 01 2 3
BERKELEY PAR LAB
18
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution:
Harts: better resource abstraction Lithe: framework for sharing resources
TBBLithe Performance (µbench included with release)
OpenMPLithe Performance (NAS parallel benchmarks)
tree sum preorder fibonacci
TBBLithe 54.80ms 228.20ms 8.42ms
TBB 54.80ms 242.51ms 8.72ms
conjugate gradient (cg) LU solver (lu) multigrid (mg)
OpenMPLithe 57.06s 122.15s 9.23s
OpenMP 57.00s 123.68s 9.54s
All results on Linux 2.6.18, 8-core Intel Clovertown.
BERKELEY PAR LAB
35
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
BERKELEY PAR LAB
36
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
SequentialTBB=1, OMP=1
172.1 sec
BERKELEY PAR LAB
37
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
Out-of-the-BoxTBB=8, OMP=8
111.8 sec
BERKELEY PAR LAB
38
Performance Characteristics of SPQR (Input = ESOC)
1 2 3 4 5 6 7 88
64
2
50
100
150
200
250
300
350
300-350
250-300
200-250
150-200
100-150
50-100
NUM_OMP_THREADS NU
M_T
BB
_TH
RE
AD
S
Tim
e (s
ec)
Out-of-the-Box
Manually Tuned70.8 sec
BERKELEY PAR LAB
39
Performance of SPQR with LitheT
ime
(sec
)
Out-of-the-Box Lithe
Input Matrix
0
20
40
60
80
100
120
ESOC
Manually Tuned
0
5
10
15
20
25
30
deltaX0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
landmark0
100
200
300
400
500
600
Rucci
BERKELEY PAR LAB
40
Future Work
SPQR
TBBLitheenter yield req reg unreg
OpenMPLitheenter yield req reg unreg
CtLitheenter yield req reg unreg
CilkLitheenter yield req reg unreg
BERKELEY PAR LAB
41
Conclusion
Composability essential for parallel programming to become widely adopted.
Lithe project contributions Harts: better resource model for parallel programming Lithe: enables parallel codes to interoperate by
standardizing the sharing of harts
MKL
OpenMPTBB
SPQR
resource management
functionality
0 1 2 3
Parallel libraries need to share resources cooperatively.
BERKELEY PAR LAB
42
Acknowledgements
We would like to thank George Necula and the rest of BerkeleyPar Lab for their feedback on this work.
Research supported by Microsoft (Award #024263 ) and Intel(Award #024894) funding and by matching funding by U.C.Discovery (Award #DIG07-10227). This work has also beenin part supported by a National Science Foundation GraduateResearch Fellowship. Any opinions, findings, conclusions, orrecommendations expressed in this publication are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.