1 Faster unicores are still needed André Seznec INRIA/IRISA
Jan 19, 2016
1
Faster unicores are still needed
André Seznec
INRIA/IRISA
2
DAL: Defying Amdahl’s Law
• ERC advanced grant to A. Seznec (2011-2016)
DAL objective:
« Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020
General Purpose manycore »
3
Multicores are everywhere
• Multicores in servers, desktop, laptops 2-4-8-12 O-O-O cores
• Multicores in smart phones, tablets 2-4-(not that simple) cores
• Manycores for niche markets 48-80-100 simple cores
Tilera, Intel Phi
4Multicore/multithread for everyone
• End-user : improved usage comfort Can surf on the web and hear MP3
• Parallel performance for the masses? Very few (scalable) mainstream // apps
Graphics Niche market segments
5
No parallel software bonanza
in the near future
• Inheritage of sequential legacy codes
• Parallelism is not cost-effective for most apps
• Sequential programming will remain dominant
6 Inheritage of sequential legacy codes
• Software is more resilient than hardware Apps are surviving/evolving for years, often
decades Very few parallel apps now
• Unlikely redevelopment of parallel apps from scratch
• Computing intensive sections will be parallelized But significant code sections will remain sequential
7
Parallelism is not cost-effective
for most apps
• Why parallelism ? Only for performance
• But costly: Difficult, man-time consuming, error prone Poorly portable: functionality and
performance
8
Sequential programming will
remain dominant
Just easier The « Joe » programmer Portability, maintenance, debug
+ compiler to parallelize + parallel libraries + software components (developped by
experts)
9
Looking backwards
102002: The End of the Uniprocessor Road
• Power and temperature walls: Stopped the frequency increase
• 2x transistors: 5 %? 10 % ? perf. (if any)
economical logic : buy smaller chips !
IC industry needs to sell new (expensive) chips:
Marketing:
« You need hyperthreading, 2, 4, 8 cores »
11
Marketing multicores to the masses2002- ..
GREAT !!
SMT Dual-core
SMT
Quad-core
SMT
12
And now ?
The end user is not such a fool ..
13
Following the trend: 2020
• Silicon area, power envelope ≈ 100 Nehalem class cores
or
≈ 1,000 simple cores
(VLIW, in-order superscalar)
14
Amdahl’s Law“Cannot run faster than sequential part”
seq. parallel
15OK, parallel applications do not scale
• Our recent study on parallel application scaling:
• In general: bp> -1 : sublinear scaling
• Sometimes: bs > 0 : sequential part increases
Execution time Input set Processor number
16But let us use a naive (overoptimistic) model
• A parallel application:
Parallel section: can use 1000 processors
Sequential section: run on a single processor
SEQ: constant fraction of sequential code
linear speed-up
17Complex cores against simple cores
• CC: 100 complex vs SC :1000 simple cores
with complex 2X faster than simple
if SEQ > 0.8 % then CC > SC
18
And hybrid SC + CC ?
CC_SC: 50 complex 500 simple
if SEQ> 0.2% then CC_SC > SC
19
And if ..
• Use a huge amount of resource for a single core:
10X the area of the complex core
10X the power of the complex core
Use all the uniprocessor techniques Very wide issue (8 – 16 ?), Ultimate frequency ( « heat
and run »), Helper threads, Value prediction
Invent new techniquesUltra Complex cores
20
DAL architecture proposition
• Heterogeneous architecture: A few ultra complex cores
to enable performance on sequential codes and/or critical sections
A « sea » of simple cores for parallel sections
21
For the naive model
« DAL » : UC_SC
5 ultra complex cores + 500 simple cores
• If SEQ > 0.13 % then « DAL » > SC
• « DAL » always better than UC, CC, CC_SC
22Need for research on faster unicores
• Silicon area is 2nd order issue can use the area of 10 complex cores
• Power/energy is 2nd order issue
can use the power of 10 complex cores
23
On going work:Revisiting Value Prediction
with Arthur Pérais
24
Value prediction ?Lipasti et al, Gabbay and Mendelson 1996
Basic idea: Eliminate (some) true data dependencies through
predicting instruction results
I0 I1 I3 +2
+3 +1I4 I5
+3
25Value Prediction:
• Large body of research 96-02
• Quite efficient: Surprisingly high number of predictable
instructions
• Not implemented so far: High cost : is it still relevant now ? High penalty on misp.: don’t lose all the
benefit
26
Last Value Predictor
• Just predict the last produced value
Set Associative Table Use confidence counters
Analogy with PC-based branch prediction
27
Stride value predictor
• Add last value + (last difference)
PC +
Analogy with stride prefetcher, but also with loop predictor
28
Finite Context Method predictors
Use history of the last values by the instruction
PC
Analogy with local history branch predictor
29
And global value history
• Just no sense ! Need the history of the last instructions
Too late !!
• But global branch history !?! ITTAGE is the state-of-the-art indirect
branch predictor !! And it predicts values !
branch
30
ITTAGE
pc h[0:L1]
=? =? =?
prediction
pc pc h[0:L2] pc h[0:L3]
3232 1 32 1 32 1
32
32Tagless base Predictor
VTAGE
Longest matching component provides the prediction
31
The repair issue on misprediction
I0 I1 I3 I4 I5
misprediction
32
Pipeline squash
• Acts as on exception, branch misprediction
• Very high penalty
I0 I1 I3 I4 I5
33
Selective replay
• Cancel all dependent instructions, but save the others
• Very complex to implement: Unlimited dependence chains
I0 I1 I3 I4 I5
34
Critical path
• Predicted value needed late in the pipeline: Disptach time is sufficient
• Except that:
35
A FCM implementation issue
PC
Spe
cula
tive
Win
dow
Must take the last local values
Might be a critical path
36Critical path on the stride value predictor
PC +
Spe
cula
tive
Win
dow
Stride AND spec. last valuemust be high confidence
Can be reused on the next cycle
37
Experiments
• 8-way superscalar, deep pipeline
• Use prediction only on high confidence 3-bit counters + saturated + reset
38
Squashing
39
Selective replay
40High confidence through probabilistic counters
• Need for very high confidence: 95 % accuracy unsufficient >> 99 % needed
TRADING ACCURACY AGAINST COVERAGE
• Saturation with only very low probability 1/32, 1/256
41
Squashing
42
And hybrids
43
Current status
• All value predictors amenable to very high confidence No complex selective repair needed
• No need for local value prediction No complex critical path in the local
value predictor
44
On going work:Selective Prediction of Predicated
Instructions
with Nathanael Prémillieu
45Who cares about predicated instructions ?
• CMOV in all ISA
• ARM, Itanium : All instructions are predicated
out-of-order execution: just a nightmare
46
Mapping Table
I1: R1 R2, R3 (p)
I2: R4 R1, R2
Before renaming:
After renaming:
I1: P1 P15, P22 (p)
I2: P13 ???, P15
The multiple definition problem
47
After renaming:
I1a: P1 P15, P22
I1b: P27 (p) ? P1, P11
I2: P13 P27, P15
Expansion/Serialization
• Create an extra instruction
• Force I1bI2 dependency
48
Aggressive serialization
I1: P18 (p) ? (op P15, P22) : P23
I2: P13 P18, P15
• No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network
• Force I1I2 dependency
49
Predicting the predicates
• branch history or branch+predicate history to predict the predicates
Eliminate multiple definitions
Predicate mispredictions become branch mispredictions
50
Not that convincing !
51
• Filter the predicate prediction
• Replay at rename time the mispredicted predicates
52
53
• Predicate prediction + filtering allows:
Better performance
Without aggressive out-of-order implementation
• Current compilers « shy » on predication usage
might be worth to reconsider
54
Conclusion
Faster cores are needed:
Amdahl’s law,
Uniprocessor workload
Silicon, power, etc are available:
Just grab the resource from the rest of the system
Do research as if (area, power) was not a constraint:
Then, take into account the constraints
(or somebody else will manage to do it)