Lithe: Enabling Efficient Composition of Parallel Libraries

1

BERKELEY PAR LAB

Lithe: Enabling Efficient Composition of Parallel Libraries

Heidi Pan, Benjamin Hindman, Krste Asanović

HotPar Berkeley, CA March 31, 2009

[email protected] {benh, krste}@eecs.berkeley.edu

Massachusetts Institute of Technology UC Berkeley

BERKELEY PAR LAB

2

How to Build Parallel Apps?

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7Hardware

OS

App

Resource Management:

Need both programmer productivity and performance!

Functionality: or or or

BERKELEY PAR LAB

3

Composability is Key to Productivity

Functional Composability

sort

App 1 App 2

code reuse

sort

same library implementation, different appsmodularity

App

same app, different library implementations

bubblesort

quicksort

BERKELEY PAR LAB

4

Composability is Key to Productivity

Performance Composability

fast

fast

fast+

faster

fast

fast(er)+

BERKELEY PAR LAB

5

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution:

Harts Lithe

Evaluation

BERKELEY PAR LAB

6

Motivational Example

Sparse QR Factorization(Tim Davis, Univ of Florida)

OS

MKL

OpenMP

System Stack

Hardware

TBB

SPQRFrontal MatrixFactorization

ColumnElimination

Tree

Software Architecture

BERKELEY PAR LAB

7

Out-of-the-Box PerformanceT

ime

(sec

)

Performance of SPQR on 16-core Machine

Out-of-the-Box

Input Matrix

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci0

0.5

1

1.5

2

2.5

3

3.5

landmark

sequential

BERKELEY PAR LAB

8

Out-of-the-Box Libraries Oversubscribe the Resources

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

virtualized kernel threads

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]


+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

BERKELEY PAR LAB

9

MKL Quick Fix

Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm

If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.

BERKELEY PAR LAB

10

Sequential MKL in SPQR

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

BERKELEY PAR LAB

11

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Sequential MKL PerformanceT

ime

(sec

)


Out-of-the-Box

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Sequential MKL

Input Matrix

BERKELEY PAR LAB

12

SPQR Wants to Use Parallel MKL

No task-level parallelism!

Want to exploit matrix-level parallelism.

BERKELEY PAR LAB

13

Share Resources Cooperatively

OS

TBB OpenMP

Hardware

Tim Davis manually tunes libraries to effectively partition the resources.

Core

0Core

1

TBB_NUM_THREADS = 2

Core

2Core

3

OMP_NUM_THREADS = 2

BERKELEY PAR LAB

14

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Manually Tuned PerformanceT

ime

(sec

)


Out-of-the-Box Sequential MKL

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Manually Tuned

Input Matrix

BERKELEY PAR LAB

15

Manual Tuning Cannot Share Resources Effectively

Give resources to OpenMP

Give resources to TBB

BERKELEY PAR LAB

16

Manual Tuning Destroys Functional Composability

Tim Davis

LAPACKAx=bMKL

OpenMP

OMP_NUM_THREADS = 4

BERKELEY PAR LAB

17

Manual Tuning Destroys Performance Composability

SPQR

MKLv1

MKLv2

MKLv3

App

SPQR

0 01 2 3

BERKELEY PAR LAB

18

Talk Roadmap


Harts: better resource abstraction Lithe: framework for sharing resources

Evaluation

BERKELEY PAR LAB

19

Virtualized Threads are Bad

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

App 1 (TBB) App 1 (OpenMP) App 2

Different codes compete unproductively for resources.

BERKELEY PAR LAB

20

Space-Time Partitions aren’t Enough

OS


Space-time partitions isolate diff apps.

MKL

OpenMPTBB

SPQR App 2

Partition 1 Partition 2

What to do within an app?

BERKELEY PAR LAB

21

Harts: Hardware Thread Contexts

Represent real hw resources. Requested, not created. OS doesn’t manage harts for app.

OS


MKL

OpenMPTBB

SPQR

Harts

BERKELEY PAR LAB

22

Sharing Harts

OS

TBB OpenMP

Hardware

time

Hart 0 Hart 1 Hart 2 Hart 3

Partition

BERKELEY PAR LAB

23

Cooperative Hierarchical Schedulers

OMP

TBB

Cilk

Ct

application call graph

Cilk

library (scheduler) hierarchy

TBB Ct

OpenMP

Modular: Each piece of the app scheduled independently.

Hierarchical: Caller gives resources to callee to execute on its behalf.

Cooperative: Callee gives resources back to caller when done.

BERKELEY PAR LAB

24

A Day in the Life of a Hart

Cilk

TBB Ct

OpenMP

TBB Sched: next?

time

TBB Tasks

executeTBB task

TBB Sched: next?

execute TBB task

TBB Sched: next?nothing left to do, give hart back to parent

Cilk Sched: next?don‘t start new task, finish existing one first

Ct Sched: next?

BERKELEY PAR LAB

25

Child Scheduler

Parent Scheduler

Standard Lithe ABI

CilkLithe Scheduler

interface for sharing harts

TBBLithe Scheduler

Caller

Callee

returncall

returncall

interface for exchanging values

Analogous to function call ABI for enabling interoperable codes.

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register


Mechanism for sharing harts, not policy.

BERKELEY PAR LAB

26

OS

TBB OpenMP

Hardware

Lithe Runtime

harts

current scheduler

OS

TBBLithe OpenMPLithe

Hardware

Lithe

TBBLithe

scheduler hierarchy

enter yield request register unregister

OpenMPLithe

enter yield request register unregister

yield

current scheduler

BERKELEY PAR LAB

27

}

:

:

register(OpenMPLithe);

Register / Unregister

TBBLithe Scheduler



matmult(){

time

Register dynamically adds the new scheduler to the hierarchy.

unregisterenter yield request registerregister unregister

unregister(OpenMPLithe);

BERKELEY PAR LAB

28

}

:

register(OpenMPLithe);

Request

TBBLithe Scheduler



matmult(){

time

unregisterenter yield request registerrequest

unregister(OpenMPLithe);

Request asks for more harts from the parent scheduler.

request(n);

BERKELEY PAR LAB

29

:

:

Enter / Yield

TBBLithe Scheduler



time


enter yield();

yield enter(OpenMPLithe);

Enter/Yield transfers additional harts between the parent and child.

BERKELEY PAR LAB

30

SPQR with Lithe

time

reg

enterenter

enter

yieldyield

MKL

OpenMPLithe

TBBLithe

SPQR

unreg

yield

matmult

req

BERKELEY PAR LAB

31

SPQR with Lithe

time

MKL

OpenMPLithe

TBBLithe

SPQR

unreg unreg unreg unreg

reg reg reg reg matmult matmult matmult matmult

req req req req

BERKELEY PAR LAB

32

Talk Roadmap


Harts Lithe

Evaluation

BERKELEY PAR LAB

33

Implementation

Harts

TBBLithe OpenMPLithe

Lithe

Harts: simulated using pinned Pthreads on x86-Linux

Lithe: user-level library (register, unregister, request, enter, yield, ...)

TBBLithe

OpenMPLithe (GCC4.4)

~600 lines of C & assembly

~2000 lines of C, C++, assembly

~1500 / ~8000 relevant lines added/removed/modified

~1000 / ~6000 relevant lines added/removed/modified

BERKELEY PAR LAB

34

No Lithe Overhead w/o Composing

TBBLithe Performance (µbench included with release)

OpenMPLithe Performance (NAS parallel benchmarks)

tree sum preorder fibonacci

TBBLithe 54.80ms 228.20ms 8.42ms

TBB 54.80ms 242.51ms 8.72ms

conjugate gradient (cg) LU solver (lu) multigrid (mg)

OpenMPLithe 57.06s 122.15s 9.23s

OpenMP 57.00s 123.68s 9.54s

All results on Linux 2.6.18, 8-core Intel Clovertown.

BERKELEY PAR LAB

35

Performance Characteristics of SPQR (Input = ESOC)

1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

BERKELEY PAR LAB

36


1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

SequentialTBB=1, OMP=1

172.1 sec

BERKELEY PAR LAB

37


1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

Out-of-the-BoxTBB=8, OMP=8

111.8 sec

BERKELEY PAR LAB

38


1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

Out-of-the-Box

Manually Tuned70.8 sec

BERKELEY PAR LAB

39

Performance of SPQR with LitheT

ime

(sec

)

Out-of-the-Box Lithe

Input Matrix

0

20

40

60

80

100

120

ESOC

Manually Tuned

0

5

10

15

20

25

30

deltaX0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

landmark0

100

200

300

400

500

600

Rucci

BERKELEY PAR LAB

40

Future Work

SPQR

TBBLitheenter yield req reg unreg

OpenMPLitheenter yield req reg unreg

CtLitheenter yield req reg unreg

CilkLitheenter yield req reg unreg

BERKELEY PAR LAB

41

Conclusion

Composability essential for parallel programming to become widely adopted.

Lithe project contributions Harts: better resource model for parallel programming Lithe: enables parallel codes to interoperate by

standardizing the sharing of harts

MKL

OpenMPTBB

SPQR

resource management

functionality

0 1 2 3

Parallel libraries need to share resources cooperatively.

BERKELEY PAR LAB

42

Acknowledgements

We would like to thank George Necula and the rest of BerkeleyPar Lab for their feedback on this work.

Research supported by Microsoft (Award #024263 ) and Intel(Award #024894) funding and by matching funding by U.C.Discovery (Award #DIG07-10227). This work has also beenin part supported by a National Science Foundation GraduateResearch Fellowship. Any opinions, findings, conclusions, orrecommendations expressed in this publication are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.

Lithe: Enabling Efficient Composition of Parallel Libraries

Documents