U U P P C C Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI, Nara City (Japan) - September 7-9, 2005 λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain [email protected]m ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain {antonio,cmolina,jordit}@ac.upc ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected]
31
Embed
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UU PP CC
Reducing Misspeculation Penalty in Trace-Level Speculative
Multithreaded Architectures
Reducing Misspeculation Penalty in Trace-Level Speculative
ARITH IS: if source values do not match, instruction is re-executed.ARITH IS: if source values do not match, instruction is re-executed.
Non-Speculative
Memory Hierarchy
Non-Speculative
Register File
STORES: effective address is re-computed if fails and memory is updated with value obtained from the non-speculative architectural state.
STORES: effective address is re-computed if fails and memory is updated with value obtained from the non-speculative architectural state.
Non-Speculative
Register File
LOADS: effective address is re-computed if fails and destination value obtained from memory is commited to register file.
LOADS: effective address is re-computed if fails and destination value obtained from memory is commited to register file.
Rdest M [ Rsource1 , literal ]
Non-Speculative
Memory Hierarchy
Non-Speculative
Register File
Incorrect Speculated IsIncorrect Speculated Is
0%10%20%30%40%50%60%70%80%90%
100%
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
RestStoresLoadsBranches Simple Is
On average, close to 90% of the instructions are branches, loads, stores and single-cycle instructions
Only 1% Is inserted in LAB are incorrectly predicted
Experimental FrameworkExperimental Framework
Simulator Alpha version of the SimpleScalar Toolset
Benchmarks Spec2000, ref input
Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5
Statistics Collected for 250 million instructions Skipping an initial part of 500 million instructions
Simulation ParametersSimulation Parameters
Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor
TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries verification engine: up to 8 instructions per cycle only one I reexecuted per cycle
Thread SynchronizationsThread Synchronizations
0102030405060708090
100
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
Conventional VE Enhanced VE
On average, the number of thread synchronizations
is about 10% lower (from 30% to 20%)
SpeedupSpeedup
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
1.40
1.45
Conventional VE Enhanced VE
On average, the average performance improvement
is around 9%
Executed Is ReducedExecuted Is Reduced
0102030405060708090
100
Amm
pApsi
Crafty Eon
Equake
Gcc Mcf
Mes
a
Mgrid
Sixtra
ck
Vortex
Vpr
A_Mean
On average, almost 8% of the instructions are
reduced in execution with the enhanced VE
ConclusionsConclusions
TSMA significant number of Is are correctly executed, but
discarded when synchronizing novel hardware technique to enhance TSMA
Enhanced Verification Engine thread synchros are delayed or even aborted
branches, loads, stores and single-cycle Is are reconsidered
Results show speedup of 38% (9% improvement) misprediction rate of 20% (10% reduction)