Faster unicores are still needed

1

Faster unicores are still needed

André Seznec

INRIA/IRISA

2

DAL: Defying Amdahl’s Law

• ERC advanced grant to A. Seznec (2011-2016)

DAL objective:

« Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020

General Purpose manycore »

3

Multicores are everywhere

• Multicores in servers, desktop, laptops 2-4-8-12 O-O-O cores

• Multicores in smart phones, tablets 2-4-(not that simple) cores

• Manycores for niche markets 48-80-100 simple cores

Tilera, Intel Phi

4Multicore/multithread for everyone

• End-user : improved usage comfort Can surf on the web and hear MP3

• Parallel performance for the masses? Very few (scalable) mainstream // apps

Graphics Niche market segments

5

No parallel software bonanza

in the near future

• Inheritage of sequential legacy codes

• Parallelism is not cost-effective for most apps

• Sequential programming will remain dominant

6 Inheritage of sequential legacy codes

• Software is more resilient than hardware Apps are surviving/evolving for years, often

decades Very few parallel apps now

• Unlikely redevelopment of parallel apps from scratch

• Computing intensive sections will be parallelized But significant code sections will remain sequential

7

Parallelism is not cost-effective

for most apps

• Why parallelism ? Only for performance

• But costly: Difficult, man-time consuming, error prone Poorly portable: functionality and

performance

8

Sequential programming will

remain dominant

Just easier The « Joe » programmer Portability, maintenance, debug

+ compiler to parallelize + parallel libraries + software components (developped by

experts)

9

Looking backwards

102002: The End of the Uniprocessor Road

• Power and temperature walls: Stopped the frequency increase

• 2x transistors: 5 %? 10 % ? perf. (if any)

economical logic : buy smaller chips !

IC industry needs to sell new (expensive) chips:

Marketing:

« You need hyperthreading, 2, 4, 8 cores »

11

Marketing multicores to the masses2002- ..

GREAT !!

SMT Dual-core

SMT

Quad-core

SMT

12

And now ?

The end user is not such a fool ..

13

Following the trend: 2020

• Silicon area, power envelope ≈ 100 Nehalem class cores

or

≈ 1,000 simple cores

(VLIW, in-order superscalar)

14

Amdahl’s Law“Cannot run faster than sequential part”

seq. parallel

15OK, parallel applications do not scale

• Our recent study on parallel application scaling:

• In general: bp> -1 : sublinear scaling

• Sometimes: bs > 0 : sequential part increases

Execution time Input set Processor number

16But let us use a naive (overoptimistic) model

• A parallel application:

Parallel section: can use 1000 processors

Sequential section: run on a single processor

SEQ: constant fraction of sequential code

linear speed-up

17Complex cores against simple cores

• CC: 100 complex vs SC :1000 simple cores

with complex 2X faster than simple

if SEQ > 0.8 % then CC > SC

18

And hybrid SC + CC ?

CC_SC: 50 complex 500 simple

if SEQ> 0.2% then CC_SC > SC

19

And if ..

• Use a huge amount of resource for a single core:

10X the area of the complex core

10X the power of the complex core

Use all the uniprocessor techniques Very wide issue (8 – 16 ?), Ultimate frequency ( « heat

and run »), Helper threads, Value prediction

Invent new techniquesUltra Complex cores

20

DAL architecture proposition

• Heterogeneous architecture: A few ultra complex cores

to enable performance on sequential codes and/or critical sections

A « sea » of simple cores for parallel sections

21

For the naive model

« DAL » : UC_SC

5 ultra complex cores + 500 simple cores

• If SEQ > 0.13 % then « DAL » > SC

• « DAL » always better than UC, CC, CC_SC

22Need for research on faster unicores

• Silicon area is 2nd order issue can use the area of 10 complex cores

• Power/energy is 2nd order issue

can use the power of 10 complex cores

23

On going work:Revisiting Value Prediction

with Arthur Pérais

24

Value prediction ?Lipasti et al, Gabbay and Mendelson 1996

Basic idea: Eliminate (some) true data dependencies through

predicting instruction results

I0 I1 I3 +2

+3 +1I4 I5

+3

25Value Prediction:

• Large body of research 96-02

• Quite efficient: Surprisingly high number of predictable

instructions

• Not implemented so far: High cost : is it still relevant now ? High penalty on misp.: don’t lose all the

benefit

26

Last Value Predictor

• Just predict the last produced value

Set Associative Table Use confidence counters

Analogy with PC-based branch prediction

27

Stride value predictor

• Add last value + (last difference)

PC +

Analogy with stride prefetcher, but also with loop predictor

28

Finite Context Method predictors

Use history of the last values by the instruction

PC

Analogy with local history branch predictor

29

And global value history

• Just no sense ! Need the history of the last instructions

Too late !!

• But global branch history !?! ITTAGE is the state-of-the-art indirect

branch predictor !! And it predicts values !

branch

30

ITTAGE

pc h[0:L1]

=? =? =?

prediction

pc pc h[0:L2] pc h[0:L3]

3232 1 32 1 32 1

32

32Tagless base Predictor

VTAGE

Longest matching component provides the prediction

31

The repair issue on misprediction

I0 I1 I3 I4 I5

misprediction

32

Pipeline squash

• Acts as on exception, branch misprediction

• Very high penalty

I0 I1 I3 I4 I5

33

Selective replay

• Cancel all dependent instructions, but save the others

• Very complex to implement: Unlimited dependence chains

I0 I1 I3 I4 I5

34

Critical path

• Predicted value needed late in the pipeline: Disptach time is sufficient

• Except that:

35

A FCM implementation issue

PC

Spe

cula

tive

Win

dow

Must take the last local values

Might be a critical path

36Critical path on the stride value predictor

PC +

Spe

cula

tive

Win

dow

Stride AND spec. last valuemust be high confidence

Can be reused on the next cycle

37

Experiments

• 8-way superscalar, deep pipeline

• Use prediction only on high confidence 3-bit counters + saturated + reset

38

Squashing

39

Selective replay

40High confidence through probabilistic counters

• Need for very high confidence: 95 % accuracy unsufficient >> 99 % needed

TRADING ACCURACY AGAINST COVERAGE

• Saturation with only very low probability 1/32, 1/256

41

Squashing

42

And hybrids

43

Current status

• All value predictors amenable to very high confidence No complex selective repair needed

• No need for local value prediction No complex critical path in the local

value predictor

44

On going work:Selective Prediction of Predicated

Instructions

with Nathanael Prémillieu

45Who cares about predicated instructions ?

• CMOV in all ISA

• ARM, Itanium : All instructions are predicated

out-of-order execution: just a nightmare

46

Mapping Table

I1: R1 R2, R3 (p)

I2: R4 R1, R2

Before renaming:

After renaming:

I1: P1 P15, P22 (p)

I2: P13 ???, P15

The multiple definition problem

47

After renaming:

I1a: P1 P15, P22

I1b: P27 (p) ? P1, P11

I2: P13 P27, P15

Expansion/Serialization

• Create an extra instruction

• Force I1bI2 dependency

48

Aggressive serialization

I1: P18 (p) ? (op P15, P22) : P23

I2: P13 P18, P15

• No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network

• Force I1I2 dependency

49

Predicting the predicates

• branch history or branch+predicate history to predict the predicates

Eliminate multiple definitions

Predicate mispredictions become branch mispredictions

50

Not that convincing !

51

• Filter the predicate prediction

• Replay at rename time the mispredicted predicates

52

53

• Predicate prediction + filtering allows:

Better performance

Without aggressive out-of-order implementation

• Current compilers « shy » on predication usage

might be worth to reconsider

54

Conclusion

Faster cores are needed:

Amdahl’s law,

Uniprocessor workload

Silicon, power, etc are available:

Just grab the resource from the rest of the system

Do research as if (area, power) was not a constraint:

Then, take into account the constraints

(or somebody else will manage to do it)

Faster unicores are still needed

Documents