Cilk-5 Based on Charles E. Leiserson (2006): http://supertech.csail.mit.edu/cilk/lecture-1.ppt The implementation of the Cilk-5 Multithreaded language By: Matteo Frigo Charles E. Leiserson KeithH Randall MIT, 1998. Presented by: Roni Lich Technion, 2013.
Cilk-5. The implementation of the Cilk-5 Multithreaded language. By : Matteo Frigo Charles E. Leiserson KeithH Randall MIT, 1998. Presented by : Roni Licher Technion , 2013. Based on Charles E. Leiserson (2006) : http://supertech.csail.mit.edu/ cilk /lecture-1.ppt . Cilk. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Cilk-5
Based on Charles E. Leiserson (2006): http://supertech.csail.mit.edu/cilk/lecture-1.ppt
The implementation of the Cilk-5 Multithreaded language
By: Matteo Frigo Charles E. Leiserson KeithH RandallMIT, 1998.
Presented by: Roni Licher Technion, 2013.
A C language for programming dynamic multithreaded applications on shared-memory multiprocessors.
In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!
P P P
Network
…
Memory I/O
$ $ $
Cilk Is Simple• Cilk extends the C language with just a handful
of keywords.• Every Cilk program has a serial semantics.• Not only is Cilk fast, it provides performance
guarantees based on performance abstractions.• Cilk is processor-oblivious.• Cilk’s provably good runtime system auto-
matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.
• Cilk supports speculative parallelism.
Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}
C elision
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Cilk code
Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.
Basic Cilk Keywords
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Identifies a function as a Cilk procedure, capable of being spawned in parallel.
The named child Cilk procedure can execute in parallel with the parent caller.Control cannot pass this
point until all spawned children have returned.
July 13, 2006 7
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Dynamic Multithreading
The computation dag unfolds dynamically.
Example: fib(4)
“Processor oblivious”
4
3
2
2
1
1 1 0
0
Multithreaded Computation
• The dag G = (V, E) represents a parallel instruction stream.• Each vertex v 2 V represents a (Cilk) thread: a maximal
sequence of instructions not containing parallel control (spawn, sync, return).
• Every edge e 2 E is either a spawn edge, a return edge, or a continue edge.
spawn edgereturn edge
continue edge
initial thread final thread
Cactus Stack
B
A
C
ED
A A
B
AC
AC
D
ACE
Views of stack
CBA D E
Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)
Cilk’s cactus stack supports several views in parallel.
Algorithmic Complexity MeasuresTP = execution time on P processors
Algorithmic Complexity MeasuresTP = execution time on P processors
T1 = work
Algorithmic Complexity MeasuresTP = execution time on P processors
T1 = workT1 = span*
* Also called critical-path length or computational depth.
Algorithmic Complexity MeasuresTP = execution time on P processors
T1 = work
LOWER BOUNDS•TP ¸ T1/P•TP ¸ T1
*Also called critical-path length or computational depth.
T1 = span*
Speedup
Definition: T1/TP = speedup on P processors.
If T1/TP = (P) · P, we have linear speedup;= P, we have perfect linear speedup;> P, we have superlinear speedup,
which is not possible in our model, because of the lower bound TP ¸ T1/P.
Parallelism
Because we have the lower bound TP ¸ T1, the maximum possible speedup given T1 and T1 isT1/T1 = parallelism
= the average amount of work per step along the span.
Span: T1 = ?Work: T1 = ?
Example: fib(4)
Assume for simplicity that each Cilk thread in fib() takes unit time to execute.
Assume for simplicity that each Cilk thread in fib() takes unit time to execute.
Span: T1 = 8Work: T1 = 17 Using many more
than 2 processors makes little sense.
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PPSpawn!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PPSpawn!Spawn!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PPReturn!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PPReturn!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Steal!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Steal!
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.
P P PP
When a processor runs out of work, it steals a thread from the top of a random victim’s deque.
Spawn!
The T.H.E. Protocol• Deques held in shared memory.
* Workers operate at the end, thiefs at the front.
• We must prevent race conditions where a thief and victim try to access the same procedure frame.
• Locking deques would be expensive for workers.
• The T.H.E Protocol removes overhead of the common case, where there is no conflict.
27
The T.H.E. Protocol• Assumes only reads and writes are atomic.• Head of the deque is H, tail is T, and (T ≥ H)
– Only thief can change H.– Only worker can change T.
• To steal thiefs must get the lock L.– At most two processors operating on deque.
• Three cases of interaction:– Two or more items on deque – each gets one.– One item on deque – either worker or thief gets frame, but not both.– No items on deque – both worker and thief fail.
Performance of Work-StealingTheorem: Cilk’s work-stealing scheduler achieves an expected running time of
TP T1/P + O(T1)on P processors.Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. The expected cost of all steals is O(PT1). Since there are P processors, the expected time is (T1 + O(PT1))/P = T1/P + O(T1) . ■
The Work First PrincipleDefinition: TS – The running time of the C elision
T1/TS – Work overhead – Critical-path overheadAssumption: P ¿ T1/T1
The Work First PrincipleDefinition: TS – The running time of the C elision T1/TS – Work overhead – Critical-path overhead Assumption: P ¿ T1/T1
Work-first justification: Since P ¿ T1/T1 is equivalent to T1 ¿ T1/P, (1) TPT1/P + O(T1) (2) TP T1/P + T1 (3) TP /P + T1
TP ¼ /P𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑐1𝑒𝑣𝑒𝑛𝑜𝑛𝑡 h𝑒𝑒𝑥𝑝𝑒𝑛𝑠𝑒𝑜𝑓 𝑎𝑙𝑎𝑟𝑔𝑒𝑐∞
Cilk Chess Programs● Socrates placed 3rd in the 1994 International
Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5.
● Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon.
● Cilkchess placed 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000.
● Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000.
Socrates Normalized Speedup
T1/TPT1/T
PT1/T
TP = T1/P + T
measured speedup0.01
0.1
1
0.01 0.1 1
TP = T
T P = T 1
/P
Developing Socrates • For the competition, Socrates was to run on
a 512-processor Connection Machine Model CM5 supercomputer at the University of Illinois.
• The developers had easy access to a similar 32-processor CM5 at MIT.
• One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine.
• After a back-of-the-envelope calculation, the proposed “improvement” was rejected!
T32 = 2048/32 + 1 = 65 seconds = 40 seconds
T32 = 1024/32 + 8
Socrates Speedup Paradox
TP T1/P + T
Original program Proposed programT32 = 65 seconds T