Top Banner
Cilk-5 Based on Charles E. Leiserson (2006): http://supertech.csail.mit.edu/cilk/lecture-1.ppt The implementation of the Cilk-5 Multithreaded language By: Matteo Frigo Charles E. Leiserson KeithH Randall MIT, 1998. Presented by: Roni Lich Technion, 2013.
38

Cilk-5

Feb 23, 2016

Download

Documents

Rie vivian

Cilk-5. The implementation of the Cilk-5 Multithreaded language. By : Matteo Frigo Charles E. Leiserson KeithH Randall MIT, 1998. Presented by : Roni Licher Technion , 2013. Based on Charles E. Leiserson (2006) : http://supertech.csail.mit.edu/ cilk /lecture-1.ppt ‎. Cilk. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cilk-5

Cilk-5

Based on Charles E. Leiserson (2006): http://supertech.csail.mit.edu/cilk/lecture-1.ppt

The implementation of the Cilk-5 Multithreaded language

By: Matteo Frigo Charles E. Leiserson KeithH RandallMIT, 1998.

Presented by: Roni Licher Technion, 2013.

Page 2: Cilk-5

A C language for programming dynamic multithreaded applications on shared-memory multiprocessors.

Cilk

● virus shell assembly● graphics rendering● n-body simulation● heuristic search

● dense and sparse matrix computations

● friction-stir welding simulation

● artificial evolution

Example applications:

Page 3: Cilk-5

Shared-Memory Multiprocessor

In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!

P P P

Network

Memory I/O

$ $ $

Page 4: Cilk-5

Cilk Is Simple• Cilk extends the C language with just a handful

of keywords.• Every Cilk program has a serial semantics.• Not only is Cilk fast, it provides performance

guarantees based on performance abstractions.• Cilk is processor-oblivious.• Cilk’s provably good runtime system auto-

matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.

• Cilk supports speculative parallelism.

Page 5: Cilk-5

Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

C elision

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Cilk code

Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.

Page 6: Cilk-5

Basic Cilk Keywords

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Identifies a function as a Cilk procedure, capable of being spawned in parallel.

The named child Cilk procedure can execute in parallel with the parent caller.Control cannot pass this

point until all spawned children have returned.

Page 7: Cilk-5

July 13, 2006 7

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Dynamic Multithreading

The computation dag unfolds dynamically.

Example: fib(4)

“Processor oblivious”

4

3

2

2

1

1 1 0

0

Page 8: Cilk-5

Multithreaded Computation

• The dag G = (V, E) represents a parallel instruction stream.• Each vertex v 2 V represents a (Cilk) thread: a maximal

sequence of instructions not containing parallel control (spawn, sync, return).

• Every edge e 2 E is either a spawn edge, a return edge, or a continue edge.

spawn edgereturn edge

continue edge

initial thread final thread

Page 9: Cilk-5

Cactus Stack

B

A

C

ED

A A

B

AC

AC

D

ACE

Views of stack

CBA D E

Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)

Cilk’s cactus stack supports several views in parallel.

Page 10: Cilk-5

Algorithmic Complexity MeasuresTP = execution time on P processors

Page 11: Cilk-5

Algorithmic Complexity MeasuresTP = execution time on P processors

T1 = work

Page 12: Cilk-5

Algorithmic Complexity MeasuresTP = execution time on P processors

T1 = workT1 = span*

* Also called critical-path length or computational depth.

Page 13: Cilk-5

Algorithmic Complexity MeasuresTP = execution time on P processors

T1 = work

LOWER BOUNDS•TP ¸ T1/P•TP ¸ T1

*Also called critical-path length or computational depth.

T1 = span*

Page 14: Cilk-5

Speedup

Definition: T1/TP = speedup on P processors.

If T1/TP = (P) · P, we have linear speedup;= P, we have perfect linear speedup;> P, we have superlinear speedup,

which is not possible in our model, because of the lower bound TP ¸ T1/P.

Page 15: Cilk-5

Parallelism

Because we have the lower bound TP ¸ T1, the maximum possible speedup given T1 and T1 isT1/T1 = parallelism

= the average amount of work per step along the span.

Page 16: Cilk-5

Span: T1 = ?Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute.

Span: T1 = 8

3 4

5

6

1

2 7

8

Work: T1 = 17

Page 17: Cilk-5

Parallelism: T1/T1 = 2.125Span: T1 = ?Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute.

Span: T1 = 8Work: T1 = 17 Using many more

than 2 processors makes little sense.

Page 18: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PPSpawn!

Page 19: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PPSpawn!Spawn!

Page 20: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PPReturn!

Page 21: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PPReturn!

Page 22: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Steal!

Page 23: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Steal!

Page 24: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Page 25: Cilk-5

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Spawn!

Page 26: Cilk-5

The T.H.E. Protocol• Deques held in shared memory.

* Workers operate at the end, thiefs at the front.

• We must prevent race conditions where a thief and victim try to access the same procedure frame.

• Locking deques would be expensive for workers.

• The T.H.E Protocol removes overhead of the common case, where there is no conflict.

Page 27: Cilk-5

27

The T.H.E. Protocol• Assumes only reads and writes are atomic.• Head of the deque is H, tail is T, and (T ≥ H)

– Only thief can change H.– Only worker can change T.

• To steal thiefs must get the lock L.– At most two processors operating on deque.

• Three cases of interaction:– Two or more items on deque – each gets one.– One item on deque – either worker or thief gets frame, but not both.– No items on deque – both worker and thief fail.

Page 28: Cilk-5

29

T.H.E. Protocol: The Worker/Victim

pop() { T--; if (H > T) { T++; lock(L); T--; if (H > T) { T++; unlock(L); return FAILURE; } unlock(L); } return SUCCESS;}

push(){ T++; } steal() {

lock(L); H++; if (H > T) { H--; unlock(L); return FAILURE; } unlock(L); return SUCCESS;}

Page 29: Cilk-5

Performance of Work-StealingTheorem: Cilk’s work-stealing scheduler achieves an expected running time of

TP T1/P + O(T1)on P processors.Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. The expected cost of all steals is O(PT1). Since there are P processors, the expected time is (T1 + O(PT1))/P = T1/P + O(T1) . ■

Page 30: Cilk-5

The Work First PrincipleDefinition: TS – The running time of the C elision

T1/TS – Work overhead – Critical-path overheadAssumption: P ¿ T1/T1

Page 31: Cilk-5

The Work First PrincipleDefinition: TS – The running time of the C elision T1/TS – Work overhead – Critical-path overhead Assumption: P ¿ T1/T1

Work-first justification: Since P ¿ T1/T1 is equivalent to T1 ¿ T1/P, (1) TPT1/P + O(T1) (2) TP T1/P + T1 (3) TP /P + T1

TP ¼ /P𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑐1𝑒𝑣𝑒𝑛𝑜𝑛𝑡 h𝑒𝑒𝑥𝑝𝑒𝑛𝑠𝑒𝑜𝑓 𝑎𝑙𝑎𝑟𝑔𝑒𝑐∞

Page 32: Cilk-5

Cilk Chess Programs● Socrates placed 3rd in the 1994 International

Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5.

● Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon.

● Cilkchess placed 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000.

● Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000.

Page 33: Cilk-5

Socrates Normalized Speedup

T1/TPT1/T

PT1/T

TP = T1/P + T

measured speedup0.01

0.1

1

0.01 0.1 1

TP = T

T P = T 1

/P

Page 34: Cilk-5

Developing Socrates • For the competition, Socrates was to run on

a 512-processor Connection Machine Model CM5 supercomputer at the University of Illinois.

• The developers had easy access to a similar 32-processor CM5 at MIT.

• One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine.

• After a back-of-the-envelope calculation, the proposed “improvement” was rejected!

Page 35: Cilk-5

T32 = 2048/32 + 1 = 65 seconds = 40 seconds

T32 = 1024/32 + 8

Socrates Speedup Paradox

TP T1/P + T

Original program Proposed programT32 = 65 seconds T

32 = 40 seconds

T1 = 2048 secondsT = 1 second

T1 = 1024 seconds

T = 8 seconds

T512 = 2048/512 + 1 = 5 seconds

T512= 1024/512 + 8

= 10 seconds

Page 36: Cilk-5

Cilk Performance● Cilk’s “work-stealing” scheduler achieves

■ TP = T1/P + O(T1) expected time (provably);

■ TP T1/P + T1 time (empirically).● Near-perfect linear speedup if P ¿ T1/T1 .● The average cost of a spawn in Cilk-5 is

only 2–6 times the cost of an ordinary C function call, depending on the platform.

● Empirical results show speed up average of 6.2 on an 8 processor machine.

Page 37: Cilk-5

Cilk quick history• 1994 - Cilk 1• 1998 - Cilk 5• 2005 - JCilk• 2006 - Cilk Arts • 2008 - Cilk++• 2009 - Intel buys Cilk Arts • 2010 - Intel released a commercial

implementation in its compilers 

Page 38: Cilk-5

Questions?