Top Banner
Parallelism and Heterogenity Scaling Laws
31

Parallelism and Heterogenity Scaling Laws

Apr 23, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallelism and Heterogenity Scaling Laws

Parallelism and Heterogenity Scaling Laws

Page 2: Parallelism and Heterogenity Scaling Laws

Amdahl’s Law [1967, Gene Amdahl]

2

Maximum speedup achievable on a multicore

}F

Time on 1 core = 1� F

1+

F

1

Time on N cores = (parallel)

(Serial)

1� F

1+

F

NSerial No

Program Phases

Speedup = 1

1�F1 + F

N

If F = 0.35@ 4 cores, speedup = 2@ cores, speedup = 31

Page 3: Parallelism and Heterogenity Scaling Laws

Strong scaling vs Weak scalingStrong Scaling : If new machine has K times more resources, how much does perf. improve ?

3

Spee

dup

01020304050607080

# of cores1 2 4 8 16 32 64 128 256

0.50.90.99

Weak Scaling : If new machine has K times more resources, can we solve a bigger problem size ?

99% Parallel 72x speedup

Page 4: Parallelism and Heterogenity Scaling Laws

Amdahl’s Law for Multicores [Marty and Hill, 2009]

Multicore Chip partitioned into multiple cores (includes L1 cache) uncore (Intel terminology for Shared L2 cache, L3)

Resources per-chip bounded Area, Power, $, or a combination Bound of total N resources per-chip. How many cores ? How big each ?

4

L1$

Shared

L1$ L1$ L1$

Mem.

C0 C1 C2 C3

Page 5: Parallelism and Heterogenity Scaling Laws

Core Types

Your favorite trick can be used to improve single-core performance using same resource

becoming increasingly hard to do power-efficiently

Wimpy Core : Consumes 1 CU (CU: measure of core resources) performance = 1

Hulk Core: consumes R CUs performance = perf(R)

5

Page 6: Parallelism and Heterogenity Scaling Laws

If Perf (R) >= R ; always use the hulk cores. speeds up everything

Unfortunately, life isn’t easy Perf (R) < R

Assume Perf (R) = reasonable assumption? Microprocessor examples seem to indicate

How to design core for specific Perf (R) basic idea: do many instructions in parallel

Hulk Cores

6

pR

<latexit sha1_base64="MjdJCE7F2MmKym4U1RE7P/GFSrc=">AAAB/nicdVDLSgNBEOyNrxhfUY9eBhPBg4TdqITcAl5yjGIemCxhdjJJhszOrjOzgbAE/AivevQmXv0VT/6Kk02EKFrQUFR1093lhZwpbdsfVmpldW19I72Z2dre2d3L7h80VBBJQusk4IFseVhRzgSta6Y5bYWSYt/jtOmNrmZ+c0ylYoG41ZOQuj4eCNZnBGsj3eU76l7q+Gaa72ZzTsFOgOxC2aB0sSBlB31bOVig1s1+dnoBiXwqNOFYqbZjh9qNsdSMcDrNdCJFQ0xGeEDbhgrsU3XWG7NQJdSNk+On6MSYPdQPpCmhUaIuD8fYV2rie6bTx3qofnsz8S+vzc2LwnFjJsJIU0Hmq/oRRzpAsyxQj0lKNJ8Ygolk5nBEhlhiok1imeVE/ieNYsE5L1xeF3OV6iKbNBzBMZyCAyWoQBVqUAcCAh7hCZ6tB+vFerXe5q0pazFzCD9gvX8BHmGWdQ==</latexit>

Page 7: Parallelism and Heterogenity Scaling Laws

Multicores under consideration

7

Symmetric

Asymmetric

Morphing

Page 8: Parallelism and Heterogenity Scaling Laws

Symmetric Multicores

How many cores ? How big each core ?

Chip is bounded to N CUs each core has R CUs

Number of cores per-chip = N/R

For example, lets say N = 16

8

R = 1 R = 4 R = 16

Page 9: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore : Performance

Serial Phase (1-F) runs on 1 thread on 1 core performance Perf (R) Execution time = (1-F) / Perf (R)

Parallel Phase uses all N/R cores. Core @ Perf (R) Execution time = F / [Perf (R) * N/R]

9

Speedup =

/

11�F

Perf(R) + F⇤RPerf(R)⇤N

II Phase More cores!

Serial Phase Perf(R)

Page 10: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 16 CUs)

Need lots of parallelism in multicore world!

10

Spee

dup

0

4

8

12

16

Per-core CU1 2 4 8 16

0.5

(16 cores) (4 cores) (1 core)

R=16 Core=1 Optimal!

Page 11: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 16 CUs)

More parallelism helps; but limited speedup!

11

Spee

dup

0

4

8

12

16

Per-core CU1 2 4 8 16

0.50.9

(16 cores) (4 cores) (1 core)

F=0.9,R=2 Speedup 6.7x @ 8 cores

Page 12: Parallelism and Heterogenity Scaling Laws

Applications with high F; significant performance loss with bigger cores Performance loss

Symmetric Multicore (Chip = 16 CUs)

12

Spee

dup

0

4

8

12

16

Per-core CU1 2 4 8 16

0.5 0.90.99 0.999

(16 cores) (4 cores) (1 core)

/ RpR

=p

R

Page 13: Parallelism and Heterogenity Scaling Laws

Remember Perf (R) when scaling up CPU = √R Lets say 1st gen 1 CU system = 1 CU

Now consider 2nd gen 4 CU system Four 1CU cores or One 4CU core? When F=0.999; always pick Four 1CU cores

Even parallel fraction not perfectly parallel Synchronization, Contention, Locks etc Need SW-Perf(R) (depends on application)

Model-bias towards parallelism

13

F=0.999 Speedup ~4 Speedup = 2

Page 14: Parallelism and Heterogenity Scaling Laws

Multicore Moore’s Law

Since 1970s Technology Moore’s Law Double transistors every 2 years. Should possibly continue....

Microarchitect’s Moore’s Law double single-thread performance every 2 years Stopped due to power required

Multicore’s Moore’s Law 2x cores every 2 years (1 in 2007- 8 in 2010) Need to double software threads every two years Need HW to enable 2x threads every two years

14

Page 15: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 256 CUs)

15

Spee

dup

0

32

64

96

128

160

192

224

256

Per-core CU1 2 4 8 16 32 64 128 256

0.99 0.999

(256 cores) (16 cores) (1 core)

R=2 (vs R=1@16) 80x @ 128 cores

More cores, lil hulk cores!

R=1 204x @ 256 cores More wimpy cores

Page 16: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 256 CUs)

16

Spee

dup

0

32

64

96

128

160

192

224

256

Per-core CU1 2 4 8 16 32 64 128 256

0.5 0.90.99 0.999

(256 cores) (16 cores) (1 core)

R=2 (vs R=1@16) 80x @ 128 cores

More cores, lil hulk cores!

R=1 204x @ 256 cores More wimpy cores

R=32 (vs R=2@16) 8 cores @ 256/16 CU chip

Hulk cores

Page 17: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 256 CUs)

17

Spee

dup

0

32

64

96

128

160

192

224

256

Per-core CU1 2 4 8 16 32 64 128 256

0.5 0.90.99 0.999

(256 cores) (16 cores) (1 core)

R=2 (vs R=1@16) 80x @ 128 cores

More cores, lil hulk cores!

R=1 204x @ 256 cores More wimpy cores

R=32 (vs R=2@16) 8 cores @ 256/16 CU chip

Hulk cores

With more CUs per chip, need hulk cores

Page 18: Parallelism and Heterogenity Scaling Laws

Cost-Effective Multicore ComputingIs Speedup (N cores) < N that bad ?

It depends on cost of adding cores. $$$, Power Cost-ratio = Cost (Ncores) / Cost (1)

If chip budget is cost, Cost-ratio << 1. Much of multicore cost outside core [IEEE 1995] Caches, Memory Controller, SSD etc.

If power is cost, cost-ratio can approach 1

Multicore computing effective if Cost-ratio > N Intel 6 core = $1600; AMD 10-core 2000$ If 10-core speedup >1.25x, then cost-effective 18

Page 19: Parallelism and Heterogenity Scaling Laws

Multicores in Servers and Clients

Multicore parallelism where cost-ratio is low and applications have the parallelism (high F)

Clients (high F is hard) Smart-phones just moved to dual-cores how many cores?

Servers can use vast parallelism (Mapreduce, data analysis) natural overlap across clients 19

Causing move to cloud computing

Page 20: Parallelism and Heterogenity Scaling Laws

Asymmetric Multicores

Enhance some cores to improve performance for serial phase.

Many designs possible (In this talk, 1 Hulk core)

How to enhance core ? coming up in last 1/3rd of class

20

Page 21: Parallelism and Heterogenity Scaling Laws

Total chip resources = N CUs

Assume two-types of cores on-chip One core = R CU, N-R 1 CU cores Total cores = N-R+1

Asymmetric Multicores

21

Page 22: Parallelism and Heterogenity Scaling Laws

Asymmetric Cores : Performance

22

}F (parallel)

(Serial) Serial Phase = (1-F) / K*Perf (R) Parallel Phase = (F) / [K*Perf (R) + N-(K*R)]

where K is # of Hulk cores.

In our case, K = 1

Speedup = 1

1�FPerf(R) + F

Perf(R)+N�R

Page 23: Parallelism and Heterogenity Scaling Laws

Asymmetric cores offer great potential with 1 Hulk core, speedup increases significantly helps take care of Amdahl’s law

Asymmetric Multicore (Chip = 256 CUs)

23

Spee

dup

0.00

57.65

115.30

172.95

230.60

Per-core CU1 2 4 8 16 32 64 128 256

0.5 0.9 0.99 0.999

(256 cores) (1 Hulk, 240 cores) (1 core)

R=41 (vs 3) 216 (vs. 85 cores)

Speedup = 166 (vs 80)

R=118 (vs 28) 139 (vs 9 cores) Speedup = 65.6

Asymmetric cores provide bang for the buck

Page 24: Parallelism and Heterogenity Scaling Laws

Low parallelism only Hulk!

Asymmetric Multicore (Chip = 256 CUs)

24

Spee

dup

0

40

80

120

160

200

240

# of Hulk Cores1 2 4 8 16

0.5 0.9 0.99 0.999

(1 Hulk 240 Wimpy)

(4 Hulk, 192 Wimpy) (16 Hulk)

Higher parallelism, more wimpy!

As F increases, always increase wimpy cores!

Page 25: Parallelism and Heterogenity Scaling Laws

Asymmetric Multicores : Challenge

25

Task Management : How to schedule computation?

Locality : How to keep data close to task?

Coordinate Tasks : How to synchronize data?

Page 26: Parallelism and Heterogenity Scaling Laws

Morphing Multicores

Chip consists of N 1CU cores efficient for parallel phase

At runtime glue R 1CU cores to create R CU core improves performance for serial phase

How to dynamically glue cores ? Not the focus; need’s future research

26

Advantage : Can harness all cores on the chip Core optimized

Page 27: Parallelism and Heterogenity Scaling Laws

Morphing Multicores : Performance

N 1CU cores, from which R 1CU cores glued

Serial phase uses R CU core at Perf (R) execution time = (1-F)/R

Parallel phases uses N cores execution time = (1-F)/N

27

Speedup = 1

1�FPerf(R) + F

N

Page 28: Parallelism and Heterogenity Scaling Laws

Morphing Multicore (Chip = 256 CUs)

28

Spee

dup

0326496

128160192224256

Hulk-Core CU1 2 4 8 16 32 64 128 256

0.5 0.9 0.99 0.999

Spee

dup

0326496

128160192224256

Hulk-Core CU1 2 4 8 16 32 64 128 256

0.5 0.9 0.99 0.999Morphing multicores are awesome!

Especially at higher chip resource levels

How to glue!

Page 29: Parallelism and Heterogenity Scaling Laws

Multicore Amdahl’s Law

29

Symmetric

Asymmetric

Morphing

11�F

Perf(R) + F⇤RPerf(R)⇤N

11�F

Perf(R) + FPerf(R)+N�R

11�F

Perf(R) + FN

Page 30: Parallelism and Heterogenity Scaling Laws

Challenges (1/2)

Serial Fraction (1-F) has fine-grain parallelism

Parallel Fraction (F) has serialization overheads You will learn in the next 2-3 weeks.

Software challenges for asymmetric and dynamic multicores

How much parallelism in future software?

30

Page 31: Parallelism and Heterogenity Scaling Laws

Challenges (2/2)

31

Parallelism all the time ?

Amdahl’s Law affects serial fraction ? Need to increase core speed.

Lots of walls: Power, Area, Shared caches How to scale CPU performance?