Cilk-5

Cilk-5

Based on Charles E. Leiserson (2006): http://supertech.csail.mit.edu/cilk/lecture-1.ppt

The implementation of the Cilk-5 Multithreaded language

By: Matteo Frigo Charles E. Leiserson KeithH RandallMIT, 1998.

Presented by: Roni Licher Technion, 2013.

A C language for programming dynamic multithreaded applications on shared-memory multiprocessors.

Cilk

● virus shell assembly● graphics rendering● n-body simulation● heuristic search

● dense and sparse matrix computations

● friction-stir welding simulation

● artificial evolution

Example applications:

Shared-Memory Multiprocessor

In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform!

P P P

Network

…

Memory I/O

$ $ $

Cilk Is Simple• Cilk extends the C language with just a handful

of keywords.• Every Cilk program has a serial semantics.• Not only is Cilk fast, it provides performance

guarantees based on performance abstractions.• Cilk is processor-oblivious.• Cilk’s provably good runtime system auto-

matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.

• Cilk supports speculative parallelism.

Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

C elision

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Cilk code

Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.

Basic Cilk Keywords


Identifies a function as a Cilk procedure, capable of being spawned in parallel.

The named child Cilk procedure can execute in parallel with the parent caller.Control cannot pass this

point until all spawned children have returned.

July 13, 2006 7


Dynamic Multithreading

The computation dag unfolds dynamically.

Example: fib(4)

“Processor oblivious”

4

3

2

2

1

1 1 0

0

Multithreaded Computation

• The dag G = (V, E) represents a parallel instruction stream.• Each vertex v 2 V represents a (Cilk) thread: a maximal

sequence of instructions not containing parallel control (spawn, sync, return).

• Every edge e 2 E is either a spawn edge, a return edge, or a continue edge.

spawn edgereturn edge

continue edge

initial thread final thread

Cactus Stack

B

A

C

ED

A A

B

AC

AC

D

ACE

Views of stack

CBA D E

Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)

Cilk’s cactus stack supports several views in parallel.

Algorithmic Complexity MeasuresTP = execution time on P processors


T1 = work


T1 = workT1 = span*

* Also called critical-path length or computational depth.


T1 = work

LOWER BOUNDS•TP ¸ T1/P•TP ¸ T1

*Also called critical-path length or computational depth.

T1 = span*

Speedup

Definition: T1/TP = speedup on P processors.

If T1/TP = (P) · P, we have linear speedup;= P, we have perfect linear speedup;> P, we have superlinear speedup,

which is not possible in our model, because of the lower bound TP ¸ T1/P.

Parallelism

Because we have the lower bound TP ¸ T1, the maximum possible speedup given T1 and T1 isT1/T1 = parallelism

= the average amount of work per step along the span.

Span: T1 = ?Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute.

Span: T1 = 8

3 4

5

6

1

2 7

8

Work: T1 = 17

Parallelism: T1/T1 = 2.125Span: T1 = ?Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute.

Span: T1 = 8Work: T1 = 17 Using many more

than 2 processors makes little sense.

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

P P PPSpawn!


P P PPSpawn!Spawn!


P P PPReturn!


P P PPReturn!


P P PP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Steal!


P P PP


Steal!


P P PP



P P PP


Spawn!

The T.H.E. Protocol• Deques held in shared memory.

* Workers operate at the end, thiefs at the front.

• We must prevent race conditions where a thief and victim try to access the same procedure frame.

• Locking deques would be expensive for workers.

• The T.H.E Protocol removes overhead of the common case, where there is no conflict.

27

The T.H.E. Protocol• Assumes only reads and writes are atomic.• Head of the deque is H, tail is T, and (T ≥ H)

– Only thief can change H.– Only worker can change T.

• To steal thiefs must get the lock L.– At most two processors operating on deque.

• Three cases of interaction:– Two or more items on deque – each gets one.– One item on deque – either worker or thief gets frame, but not both.– No items on deque – both worker and thief fail.

29

T.H.E. Protocol: The Worker/Victim

pop() { T--; if (H > T) { T++; lock(L); T--; if (H > T) { T++; unlock(L); return FAILURE; } unlock(L); } return SUCCESS;}

push(){ T++; } steal() {

lock(L); H++; if (H > T) { H--; unlock(L); return FAILURE; } unlock(L); return SUCCESS;}

Performance of Work-StealingTheorem: Cilk’s work-stealing scheduler achieves an expected running time of

TP T1/P + O(T1)on P processors.Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. The expected cost of all steals is O(PT1). Since there are P processors, the expected time is (T1 + O(PT1))/P = T1/P + O(T1) . ■

The Work First PrincipleDefinition: TS – The running time of the C elision

T1/TS – Work overhead – Critical-path overheadAssumption: P ¿ T1/T1

The Work First PrincipleDefinition: TS – The running time of the C elision T1/TS – Work overhead – Critical-path overhead Assumption: P ¿ T1/T1

Work-first justification: Since P ¿ T1/T1 is equivalent to T1 ¿ T1/P, (1) TPT1/P + O(T1) (2) TP T1/P + T1 (3) TP /P + T1

TP ¼ /P𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑐1𝑒𝑣𝑒𝑛𝑜𝑛𝑡 h𝑒𝑒𝑥𝑝𝑒𝑛𝑠𝑒𝑜𝑓 𝑎𝑙𝑎𝑟𝑔𝑒𝑐∞

Cilk Chess Programs● Socrates placed 3rd in the 1994 International

Computer Chess Championship running on NCSA’s 512-node Connection Machine CM5.

● Socrates 2.0 took 2nd place in the 1995 World Computer Chess Championship running on Sandia National Labs’ 1824-node Intel Paragon.

● Cilkchess placed 1st in the 1996 Dutch Open running on a 12-processor Sun Enterprise 5000. It placed 2nd in 1997 and 1998 running on Boston University’s 64-processor SGI Origin 2000.

● Cilkchess tied for 3rd in the 1999 WCCC running on NASA’s 256-node SGI Origin 2000.

Socrates Normalized Speedup

T1/TPT1/T

PT1/T

TP = T1/P + T

measured speedup0.01

0.1

1

0.01 0.1 1

TP = T

T P = T 1

/P

Developing Socrates • For the competition, Socrates was to run on

a 512-processor Connection Machine Model CM5 supercomputer at the University of Illinois.

• The developers had easy access to a similar 32-processor CM5 at MIT.

• One of the developers proposed a change to the program that produced a speedup of over 20% on the MIT machine.

• After a back-of-the-envelope calculation, the proposed “improvement” was rejected!

T32 = 2048/32 + 1 = 65 seconds = 40 seconds

T32 = 1024/32 + 8

Socrates Speedup Paradox

TP T1/P + T

Original program Proposed programT32 = 65 seconds T

32 = 40 seconds

T1 = 2048 secondsT = 1 second

T1 = 1024 seconds

T = 8 seconds

T512 = 2048/512 + 1 = 5 seconds

T512= 1024/512 + 8

= 10 seconds

Cilk Performance● Cilk’s “work-stealing” scheduler achieves

■ TP = T1/P + O(T1) expected time (provably);

■ TP T1/P + T1 time (empirically).● Near-perfect linear speedup if P ¿ T1/T1 .● The average cost of a spawn in Cilk-5 is

only 2–6 times the cost of an ordinary C function call, depending on the platform.

● Empirical results show speed up average of 6.2 on an 8 processor machine.

Cilk quick history• 1994 - Cilk 1• 1998 - Cilk 5• 2005 - JCilk• 2006 - Cilk Arts • 2008 - Cilk++• 2009 - Intel buys Cilk Arts • 2010 - Intel released a commercial

implementation in its compilers

Questions?

Cilk-5

Documents

cilk cilk

cilk lecture

cilk program

multithreaded languageby

t lockl t

t unlockl

matteo frigo charles

leiserson keithh randallmit