Top Banner
Toward Extreme-Scale Manycore Architectures Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu USC October 2016
76

Toward Extreme-Scale Manycore Architectures

Jan 26, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Toward Extreme-Scale Manycore Architectures

Toward Extreme-Scale Manycore Architectures

Josep Torrellas Department of Computer Science

University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

USC

October 2016

Page 2: Toward Extreme-Scale Manycore Architectures

2

Accelerated Progress in Transistor Integration

•  Large multicores for data centers and cloud

Intel Xeon Phi 7290F (Oct 2016) 72 cores, 288 contexts, 260W

Intel 3D Xpoint memory

•  3D stacked chips

Micron’s Hybrid Memory Cube

Page 3: Toward Extreme-Scale Manycore Architectures

3

Research is Pushing Ever Farther Ahead

Heat  Sink

Integrated  Heat  Spreader  (IHS)

Thermal  Interface  Material  (TIM)

Motherboard

Processor  SiliconProcessor  Frontside Metal  (Cu)

DRAM  Frontside Metal  (Al)

DRAM  Silicon

Die  to  Die  (D2D)  Layer

Through  Silicon  Vias (TSVs)

C4  pads

•  Research on stacking multiple processor and memory dies

Runnemede prototype [HPCA-13]

•  More integration à 1,000 cores/chip

Page 4: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 4

Meanwhile: Power Wall… and Performance Wall

Performance: 11 PF Power: 6-11 MW (idle to loaded) 1MW = $1M per year electricity

•  University of Illinois Blue Waters Supercomputer

and Performance Wall

•  Technology improvements in speed and power slowing down

Computer architecture innovations become strategic

à

Page 5: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

All at the same time

5

•  Very high energy efficiency

•  Faster communication and synchronization

•  Ease of programming

What We Need

Page 6: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 6

Today’s Discussion

•  Focus: Reducing the cost of basic primitives for parallelism •  Flavor of other challenges: energy, programmability

Page 7: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 7

Quest

Making synchronization inexpensive

Page 8: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 8

Making Synchronization Inexpensive

•  Scalable concurrent priority queues

QueueHead

node

node

node

node

•  Breaking serialization in lock-free synchronization

x

Compare&Swap(CAS) CAS CAS CAS CAS

[ISCA-13][ASPLOS-15][ASPLOS-16]

wr x rd z

wr y rd x

wr z rd y

fence fence fence

•  Make memory fences free

Page 9: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 9

Making Synchronization Inexpensive

•  Make memory fences free (WeeFence)

•  Breaking serialization in lock-free synchronization

•  Scalable concurrent priority queues

wr x rd z

wr y rd x

wr z rd y

fence fence fence

[ISCA-13]

Page 10: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

Fence: a Primitive for Parallelism

•  Instruction inserted by programmers or compilers •  Prevents the compiler and HW from reordering memory accesses

10

Until these are finished •  reads retired •  writes retired + drained from write buffer

Cannot be observed by another processor

Write y

Fence

Read x

Read z

Tim

e

Page 11: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

take() Tail = … fence …= Head

steal () Head = … fence …= Tail

11

Use of Fences (I)

Enforce the correct order between accesses

•  Programmers insert fences in codes with fine-grain sharing: –  Work-stealing algorithm in Cilk

Worker dequeues from tail and checks head

Thief takes from head and checks tail

Page 12: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

•  Compilers insert fences in C++: –  Programmer uses intentional data race for performance à declares

variable as atomic –  Compiler inserts fence after the access, does not reorder –  Hardware does not reorder across fence

12

Use of Fences (II)

Page 13: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 13

If We Remove Fences: Incorrect Execution

With fences: t1=1, t0=1 or both=1

A0: x =1

A1: t0 = yB0: y = 1

B1: t1 = x

x = y = 0PA PB

fence

fenceUnintuitive bug: Sequential Consistency(SC) Violation

t0 = t1 = 0A1B0B1A0

Without fences:

wr x

rd y

PA PBwr y

rd x

SC: execution appears as if accesses from multiple threads were interleaved in a uniprocessor

A0A1B0B1

B0B1A0A1

A0B0A1B1

write propagatedto memory

Page 14: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 14

Fence Overhead

•  Naïve implementation: stall all memory operations following the fence –  The processor quickly stalls

Page 15: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 15

Modern Implementations: Perform Speculation

w2 f r

w1

w2 f r

w1

Reorder Buffer (ROB) WB (Write Buffer)

Write

Fence

Read

Tim

e Expensive: Fence in Xeon desktop stalls for 20—200 cycles. In a large MP?

•  Reads following fences can load data speculatively –  If no processor observes it, no problem –  If coherence transaction received, rd is squashed and retried

•  Still: speculative reads cannot retire until the WB is drained

f r

Page 16: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 16

What if Fences Were Free?

•  Programmers could write faster fine-grained concurrent algorithms

•  C++/Java programmers would not have to worry about data races –  Declare all shared variables as atomic –  Compiler puts many fences, hardware still runs fast –  Guaranteed Sequential Consistency (SC)

Page 17: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 17

Proposal: WeeFence (or WFence)

•  Post-fence read retires before the pre-fence writes have drained –  “Skip” the fence

Substantial gains when write misses pile-up before the fence

w2 f r

w1

Spec execution

•  Goal: Eliminate any stall in the pipeline [ISCA-13]

Write

Fence

Read

Tim

e

w1

w2 f r

WB

Reorder Buffer (ROB)

w2 f

w1

Page 18: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 18

But… Not Stalling Can Cause Incorrect Execution

A0: x =1

A1: t0 = yB0: y = 1

B1: t1 = x

x = y = 0PA PB

WeeFence

Solution: Allow the reorder, check for this case, and stall the read (B1)

WeeFence

What we want: not stall but avoid these SC violations

Write

Fence

Read

Tim

e

write propagatedto memory

Page 19: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 19

But… Not Stalling Can Cause Incorrect Execution

A0: x =1

A1: t0 = yB0: y = 1

B1: t1 = x

x = y = 0PA PB

WeeFence

Solution: Allow the reorder, check for this case, and stall the read (B1)

WeeFence

What we want: not stall but avoid these SC violations

Conventional fences always conservatively stall ßà Not WeeFence

Write

Fence

Read

Tim

e

write propagatedto memory

Page 20: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

WeeFence: The Idea

•  At a fence: record the thread’s incomplete writes in a HW structure

•  Allow post-fence reads to execute before pre-fence writes complete •  Check post-fence reads (rd x) against HW structure to find conflicts

with other threads’ incomplete writes. –  Conflict? Stall read –  Else: Retire

20

rd y

PA PB

WeeFencewr y

WeeFencerd x

wr xPrevent “rd x” retiring early if: - There is a concurrent fence - Accesses vars in opposite order

Page 21: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 21

(3)

execute

wr x

rd y

PA PB

Wfence1

wr y

rd x

Wfence2

How WFence Works

PS: Pending Set

BS: Bypass Set rd y

PA PB

Wfence1wr y

Wfence2rd x

wr x

(1)PS

x

Table

(2)

Page 22: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 22

wr x

rd y

PA PB

Wfence1

(1)(3) PS

execute

wr y

x

(5)

local check stall

(6)

How WFence Works

PS

y

(4)Wfence2

rd x

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1wr y

Wfence2rd x

x

Table

(2)

Page 23: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 23

(3)

execute

wr x

rd y

PA PB

Wfence1

wr y

wr x

y BS

(4)

How WFence Works (II)

(1)PS

x

Table

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1wr y

wr xNo fence present in x86

(2)

Page 24: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 24

wr x

rd y

PA PB

Wfence1

(1)(3) PS execute

x

wr y

wr x

y BS

(4)

How WFence Works (II)

Table

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1wr y

wr xNo fence present in x86

(5) coherence

stall (2)

Page 25: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 25

wr x

rd y

PA

Wfence1

wr x

rd y

PA

Wfence1

(1)PS

x

Summary: How WFence Works

z

Table

(6)stall

(4)y BS

(5)execute & retire

z

(2)check

(3)

PS: Pending Set BS: Bypass Set

Page 26: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

•  Cycles are rare: Wfence typically executes without stalling the processor •  Works with cycles with any number of processors

•  No compiler support needed: Unmodified off-the shelf executable

26

WFence

wr x Wfence

rd z

wr y Wfence

rd x

PA PB

wr z Wfence

rd y

PC

Page 27: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

Improving WeeFence

27

•  The Global State is expensive to maintain with many threads

•  Can we eliminate the Global State (Pending Set)

Page 28: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 28

wr x

rd y

PA PB

Wfence1

wr y

rd x

Eliminating the Global State

wr x

rd y

PA PB

Wfence1wr y

rd xWfence2

Wfence2

y

(1)

x

(2)

Deadlock…. Insight: no deadlock if one processor stalls at the fence and generates no BS

[ASPLOS-15]

(3)

stall

(4)

stall

Page 29: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

29

Asymmetric Fences: Strong Fence + Weak Fences

[ASPLOS-15]

wr x rd z

wr y rd x

PA PB

wr z rd y

PC

x z

Conventional fence

WeeFence Without PS

–  N-1 weak fences that allow reordering = WeeFences without PS

•  Given a conflict cycle with N processors: –  1 strong fence (no BS ) = conventional fence

Page 30: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

30

Where to Put Strong Fence?

•  Work stealing algorithm in Cilk: –  Weak fences à workers –  Strong fences à thiefs

tmp->field = 10; fence1; obj = tmp;

if (obj) { fence2; a = obj->field;

PA PB

Put strong fence in fence1, why?

It only executes once, at initialization

•  Software transaction memory: –  Weak fences à reads –  Strong fences à writes

Page 31: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 31

Results

Full apps: WFence reduces the overhead of fences-everywhere (hence guaranteeing SC) from 40% to 2%

Kernels with fences: WFence eliminates >90% of the fence stall time Baseline WFence

Page 32: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 32

Making Synchronization Inexpensive

•  Make memory fences free

•  Breaking serialization in lock-free synchronization (CASPAR)

•  Scalable concurrent priority queues

[ASPLOS-16]

x

Compare&Swap(CAS) CAS CAS CAS CAS

Page 33: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 33

Bottleneck: Many Processors Synch on Same Var

•  Operating systems, databases, language runtimes, mem allocators •  Lock-free synchronization: Manipulates data using atomic instructions

instead of locks

if (mem[addr] == old) { mem[addr]=new }

Compare&Swap(addr,old,new)

[ASPLOS-16]

Page 34: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 34

Simple Example Lock-Free Synchronization

x

CAS CAS CAS CAS CAS

if (mem[addr] == old) { mem[addr]=new }

Compare&Swap(addr,old,new)

Everyone adds 1:

while (true) { old = x new = old +1 if (CAS(mem, old, new)) return }

Page 35: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 35

Example: Pushing Nodes into Stack

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

top

node

PA PB

Page 36: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 36

Example: Pushing Nodes into Stack

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

top

node

PA PBnew

new

Page 37: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 37

Example: Pushing Nodes into Stack

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

top

node

PA PB

oldA=top

oldB=top

new

new

Page 38: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 38

Example: Pushing Nodes into Stack

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

top

node

PA PB

oldA=top

oldB=top

node

top

new

new

new

new

Page 39: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 39

Example: Pushing Nodes into Stack

CAS

failed

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

top

node

PA PB

oldA=top

oldB=top

node

top

node

top

new

new

new

new

new

new

Page 40: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 40

Problem: Serialization

Tim

e

newA

PA PB PC

newB

newC

ld old CAS

ld old CAS

Waste

Waste

ld old ld old

. . ld old CAS

. .

. .

. .

. .

. .

Our Goal: All processors perform a successful CAS at the same time, in parallel

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

CAS … ld old … CAS …

ld old CAS

. . ld old CAS

. .

Page 41: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

CASPAR: Main Idea

Two steps: •  Queue the “ld old” requests in HW in the directory

–  Provides efficient serialization: only one proc attempts the CAS at a time (others remain idle)

–  Similar to past work

•  Break serialization: Two new ideas: –  Eager forwarding –  Parallel validation

41

Page 42: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 42

Directory

Queue “ld old” Requests in HW in Directory

PA PC PB

ldPA ldPB ldPC

PD

ldPD

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Similar to past work….

line CAS

Page 43: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 43

Directory

Basic Hardware: Queue of Requests in Directory

PA PC PB

ldPA ldPB ldPC

Cache line

PD

ldPD

Page 44: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 44

Directory

Basic Hardware: Queue of Requests in Directory

PA PC PB

ldPB ldPC

PD

ldPD

CAS

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Page 45: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 45

Directory

Basic Hardware: Queue of Requests in Directory

PA PC PB

ldPB ldPC

PD

ldPD

Page 46: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 46

Directory

Basic Hardware: Queue of Requests in Directory

PA PC PB

ldPC ldPD

PD CAS

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Page 47: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 47

Directory

Basic Hardware: Queue of Requests in Directory

PA PC PB

ldPC

PD

ldPD

Page 48: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 48

Directory

Basic Hardware: Queue of Requests in Directory

PA PC PB

Completely serial execution

ldPD

PD CAS

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Page 49: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 49

Breaking Serialization (1): Eager Forwarding

Observation: In a proc,

“new” does not depend on “old”

“new” is ready well before CAS

ld old CAS

Tim

e

newA

PA PB PC

newB

newC

ld old CAS

ld old CAS

Waste

Waste

ld old ld old

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Page 50: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 50

Breaking Serialization (1): Eager Forwarding

ld old CAS

Tim

e

newA

PA PB PC

newB

newC

ld old CAS

ld old CAS

Waste

Waste

ld old ld old

Predecessors: Eagerly forward “new” Successors: * Use “new” to satisfy “ld old” * Perform a successful CAS, * Continue speculatively like TM, no stall

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Page 51: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

Predecessors: Eagerly forward “new” Successors: * Use “new” to satisfy “ld old” * Perform a successful CAS, * Continue speculatively like TM, no stall

51

Breaking Serialization (1): Eager Forwarding

ld old CAS

Tim

e

newA

PA PB PC

newB

newC

ld old ld old CAS

CAS Speculative

execution

Speculative execution

* All CAS succeeded in parallel * All wasted time is eliminated * Execution continues; does not stop * Need a validation step to compare forwarded and real value

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

i1

i2

i3

i4

i1

i2

i3

i4

i5

i5

Page 52: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 52

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newA

ldPA ldPB ldPC

newB newC

line PD

ldPD

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

Page 53: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 53

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newA

ldPA ldPB ldPC

newB newC

line PD

ldPD

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return }

All procs decode CAS, find that “new” has been produced, and forward it to the directory in parallel

Page 54: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 54

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newA newB newC

newA

ldPA ldPB ldPC

newB newC

PD

ldPD

newD

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return } newD

Page 55: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 55

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newA newB newC

newA

ldPA ldPB ldPC

newB newC

PD

ldPD

newD

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return } newD

Proci uses newi-1 as the response to its “ld old” and proceeds speculatively in parallel

Page 56: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 56

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newA newB newC

newA

ldPA ldPB ldPC

newB newC

PD

ldPD

newD

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return } newD

CAS CAS CAS Parallel CAS CAS

Speculative

Page 57: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 57

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newA

ldPA ldPB ldPC

newB newC

Cache line

PD

ldPD

newD

Page 58: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 58

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newB

ldPB ldPC

newC

Cache line

PD

ldPD

newD

Validate Validate: * Compare the final value of the line to newA forwarded earlier on * Commit speculative execution

Page 59: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 59

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newB

ldPB

Validate

PD

ldPC

newC

ldPD

newD

Page 60: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 60

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newC

ldPC

Validate

PD

ldPD

newD

Page 61: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 61

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newC

ldPC

Validate

PD

ldPD

newD

Page 62: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 62

Directory

Breaking Serialization (1): Eager Forwarding

PA PC PB

newD

ldPD

Still serial validation

Parallel CAS execution

PD Validate

Page 63: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 63

Limitation of Eager Forwarding

ld old CAS

Tim

e

newA

PA PB PC

newB

newC

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return } ld old ld old

CAS

CAS Speculative

execution

Speculative execution

i1

i2

i3

i4

i1

i2

i3

i4

i5

i5

Long speculation increases the chances of squashing the threads

Page 64: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 64

Breaking Serialization (2): Parallel Validation

ld old CAS

Tim

e

newA

PA PB PC

newB

newC

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return } ld old ld old

CAS

CAS Speculative

execution

Speculative execution

i1

i2

i3

i4

i1

i2

i3

i4

i5

i5

Idea: Validate in the directory without ever sending line to cores How: Use newi stored in directory

Page 65: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

Idea: Validate in the directory without ever sending line to core How: Use newi stored in directory

65

Breaking Serialization (2): Parallel Validation

ld old CAS

Tim

e

newA

PA PB PC

newB

newC

Parallel validation

DIRECTORY

Speculative execution reduced to a minimum Execution does not stop

new = malloc(); while (true) { old = top newànext = old if (CAS(&top, old, new)) return } ld old ld old

CAS

CAS Speculative

execution

Speculative execution

i1

i2

i3

i1

i2

i3

Page 66: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 66

Directory

Breaking Serialization (2): Parallel Validation

PA PC PB

newA

ldPA ldPB ldPC

newB newC

PD

ldPD

newD

CAS CAS CAS CAS

Speculative

Parallel CAS

Page 67: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 67

Directory

Breaking Serialization (2): Parallel Validation

PA PC PB

Cache line

PD

newA

ldPA ldPB ldPC

newB newC

ldPD

newD

Page 68: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 68

Directory

Breaking Serialization (2): Parallel Validation

PA PC PB

newB

ldPB ldPC ldPD

newC newD

PD

Validate & Commit

newB newC

newD

Validate Validate Validate

Parallel CAS execution

Parallel validation

Page 69: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 69

Summary

•  Full parallel synchronization –  Parallel successful CAS execution –  Parallel validation

•  Large speedups for 64-core runs: –  Throughput of kernels increases by 80% avg –  Execution time of application sections reduces by 60% avg

Page 70: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 70

Making Synchronization Inexpensive

•  WeeFence: Make memory fences free

•  CASPAR: Breaking the serialization in lock-free synchronization

•  Scalable concurrent priority queues

QueueHead

node

node

node

node

Page 71: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 71

Today’s Discussion

•  Focus: Reducing the cost of basic primitives for parallelism •  Flavor of other challenges: energy, programmability

Energy-efficiency Performance

Programmability

Page 72: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 72

QuickRec: A Prototype of Record and Replay (RnR)

•  Finds non-deterministic software bugs and security intrusions

•  Built FPGA platform with a Pentium multicore

[ISCA-13]

•  HW + OS record all non-deterministic events, so that a parallel program can be replayed deterministically

Page 73: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

ScalCore: a Core for Voltage Scalability

73

[HPCA-16]

•  Decouple the Vdd of logic and storage structures in the pipeline

Enable Flow-through

0

HPMode

Lat

ch

Storage Stage 2a L

atch

Storage Stage 2b L

atch

CLK

Logic Stage 3 L

atch

Logic Stage 1 L

atch

Vnom Vnom Vnom Vnom

Lat

ch

2a

Lat

ch

Lat

ch CLK

Logic Stage 3 L

atch

Logic Stage 1 L

atch

2b

Vlogic Vlogic Vop Vop Enable Flow-through

1

EEMode

•  Reconfigure pipeline to fuse the faster storage-intensive stages

Page 74: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale…

Control-Theoretic Energy/Performance Controllers

•  Design control-theoretic controllers that track multiple outputs while actuating on multiple inputs

BIPS Ref

System Controller BIPS

Power

Outputs (y) Inputs (u)

Power Ref

Cache size Freq

ROB size

[ISCA-16]

•  Attains most efficient use of resources to deliver highest performance

Page 75: Toward Extreme-Scale Manycore Architectures

Josep Torrellas Toward Extreme Scale… 75

Conclusion

•  Lots of room to innovate in computer architecture at this time •  Many exciting interdisciplinary venues of research:

–  Performance, energy-efficiency & programmability

Non-volatile memory Volatile memory Compute layer Volatile memory

Non-volatile memory

Monolithic architecture

3D-stacked layers

Page 76: Toward Extreme-Scale Manycore Architectures

Toward Extreme-Scale Manycore Architectures

Josep Torrellas Department of Computer Science

University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

USC

October 2016