Page 1
idempotent (ī-dəm-pō-tənt) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself; 2 of, relating to, or being an operation under which a mathematical quantity is idempotent.
idempotent processing (ī-dəm-pō-tənt prə-ses-iŋ) n. the application of only idempotent operations in sequence; said of the execution of computer programs in units of only idempotent computations, typically, to achieve restartable behavior.
Page 2
Static Analysis and Compiler Design for Idempotent Processing
Marc de KruijfKarthikeyan Sankaralingam
Somesh Jha
PLDI 2012, Beijing
Page 3
Example
3
int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}
source code
Page 4
Example
4
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
assembly code
F F F F0
faults
exceptions
x
load ?
mis-speculations
Page 5
Example
5
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
assembly code
BAD STUFF HAPPENS!
Page 6
R0 and R1 are unmodified
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
Example
6
assembly code
just re-execute!
convention:use checkpoints/buffers
Page 7
It’s Idempotent!
7
idempoh… what…?
int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}
=
Page 8
7
Idempotent Processing
idempotent regionsALL THE TIME
Page 9
8
Idempotent Processing
normal compiler:custom compiler:
executive summary
low runtime overhead (typically 2-12%)
cut semantic clobber antidependences
how?idempotence inhibited by clobber antidependences
Page 10
9
Presentation Overview
❶ Idempotence
❷ Algorithm
❸ Results
=
Page 11
What is Idempotence?
11
Yes
2
is this idempotent?
Page 12
What is Idempotence?
12
No
2
how about this?
Page 13
What is Idempotence?
13
Yes
2
maybe this?
Page 14
What is Idempotence?
14
operation sequence dependence chain idempotent?
write
read, write
write, read, write Yes
No
Yes
it’s all about the data dependences
Page 15
What is Idempotence?
15
operation sequence dependence chain idempotent?
write, read
read, write
write, read, write Yes
No
Yes
it’s all about the data dependences
CLOBBER ANTIDEPENDENCEantidependence with an exposed read
Page 16
15
Semantic Idempotence
(1) local (“pseudoregister”) state:can be renamed to remove clobber antidependences*
does not semantically constrain idempotence
two types of program state
(2) non-local (“memory”) state:cannot “rename” to avoid clobber antidependences
semantically constrains idempotence
semantic idempotence = no non-local clobber antidep.
preserve local state by renaming and careful allocation
Page 17
16
Presentation Overview
❶ Idempotence
❷ Algorithm
❸ Results
=
Page 18
Region Construction Algorithm
18
steps one, two, and three
Step 1: transform functionremove artificial dependences, remove non-clobbers
Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior
Step 2: construct regions around antidependencescut all non-local antidependences in the CFG
Page 19
18
Step 1: Transform
Transformation 1: SSA for pseudoregister antidependences
not one, but two transformations
clobber antidependences
region boundaries
region identification
But we still have a problem:
depends on
Page 20
19
Step 1: Transform
Transformation 1: SSA for pseudoregister antidependences
not one, but two transformations
Transformation 2: Scalar replacement of memory variables
[x] = a;b = [x];[x] = c;
[x] = a;b = a;[x] = c;
non-clobber antidependences… GONE!
Page 21
20
Step 1: Transform
Transformation 1: SSA for pseudoregister antidependences
not one, but two transformations
Transformation 2: Scalar replacement of memory variables
clobber antidependences
region boundaries
region identification depends on
Page 22
Region Construction Algorithm
22
steps one, two, and three
Step 1: transform functionremove artificial dependences, remove non-clobbers
Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior
Step 2: construct regions around antidependencescut all non-local antidependences in the CFG
Page 23
22
Step 2: Cut the CFGcut, cut, cut…
construct regions by “cutting” non-local antidependences
antidependence
Page 24
23
larger is (generally) better:large regions amortize the cost of input preservation
region size
over
head sources of overhead
optimal region size?
Step 2: Cut the CFG
rough sketch
but where to cut…?
Page 25
24
Step 2: Cut the CFGbut where to cut…?
goal: the minimum set of cuts that cuts allantidependence paths
intuition: minimum cuts fewest regions large regions
approach: a series of reductions:minimum vertex multi-cut (NP-complete)minimum hitting set among pathsminimum hitting set among “dominating nodes”
details in paper…
Page 26
Region Construction Algorithm
26
steps one, two, and three
Step 1: transform functionremove artificial dependences, remove non-clobbers
Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior
Step 2: construct regions around antidependencescut all non-local antidependences in the CFG
Page 27
26
Step 3: Loop-Related Refinements
correctness: Not all local antidependences removed by SSA…
loops affect correctness and performance
loop-carried antidependences may clobberdepends on boundary placement; handled as a post-pass
performance: Loops tend to execute multiple times…
to maximize region size, place cuts outside of loopalgorithm modified to prefer cuts outside of loops
details in paper…
Page 28
27
Presentation Overview
❶ Idempotence
❷ Algorithm
❸ Results
=
Page 29
Results
compiler implementation – Paper compiler implementation in LLVM v2.9 – LLVM v3.1 source code release in July timeframe
experimental data (1) runtime overhead (2) region size (3) use case
29
Page 30
29
Runtime Overhead
SPEC INTSPEC FP
PARSECOVERALL
02468
1012
instruction countexecution time
as a percentage
perc
ent o
verh
ead 7.6
7.7
benchmark suites (gmean) (gmean)
Page 31
30
Region Size
SPEC INTSPEC FP
PARSECOVERALL
1
10
100
compiler-generated
average number of instructions
28
dyn
amic
regi
on si
ze
(gmean)benchmark suites (gmean)
Page 32
31
Use Case
SPEC INTSPEC FP
PARSECOVERALL
05
101520253035
idempotencecheckpoint/loginstruction TMR
hardware fault recovery
(gmean)
8.2
24.030.5
perc
ent o
verh
ead
benchmark suites (gmean)
Page 33
32
Presentation Overview
❶ Idempotence
❸ Results
=
❷ Algorithm
Page 34
33
Summary & Conclusions
idempotent processing – large (low-overhead) idempotent regions all the time
static analysis, compiler algorithm – (a) remove artifacts (b) partition (c) compile
summary
low overhead – 2-12% runtime overhead typical
Page 35
34
Summary & Conclusions
several applications already demonstrated – CPU hardware simplification (MICRO ’11) – GPU exceptions and speculation (ISCA ’12) – hardware fault recovery (this paper)
conclusions
future work – more applications, hybrid techniques – optimal region size? – enabling even larger region sizes
Page 36
35
Back-up Slides
Page 37
36
Error recovery
mis-speculation (e.g. branch misprediction) – compiler handles for pseudoregister state – for non-local memory, store buffer assumed
arbitrary failure (e.g. hardware fault) – ECC and other verification assumed – variety of existing techniques; details in paper
exceptions – generally no side-effects beyond out-of-order-ness – fairly easy to handle
dealing with side-effects
Page 38
37
Optimal Region Size?
region size
over
head
detectionlatency
registerpressure
re-executiontime
it depends… (rough sketch not to scale)
Page 39
38
Prior Workrelating to idempotence
Technique Year Domain
Sentinel Scheduling 1992 Speculative memory re-ordering
Fast Mutual Exclusion 1992 Uniprocessor mutual exclusion
Multi-Instruction Retry 1995 Branch and hardware fault recovery
Atomic Heap Transactions 1999 Atomic memory allocation
Reference Idempotency 2006 Reducing speculative storage
Restart Markers 2006 Virtual memory in vector machines
Data-Triggered Threads 2011 Data-triggered multi-threading
Idempotent Processors 2011 Hardware simplification for exceptions
Encore 2011 Hardware fault recovery
iGPU 2012 GPU exception/speculation support
Page 40
39
Detailed Runtime Overhead
SPEC INTSPEC FP
PARSEC gccnamd
OVERALL05
1015202530
instruction countexecution time
as a percentage
suites (gmean) outliers (gmean)
perc
ent o
verh
ead
7.67.7
non-idempotent inner loops + high register pressure
Page 41
40
Detailed Region Size
SPEC INTSPEC FP
PARSEChmmer lbm
OVERALL10
100
1000
10000
compileridealIdeal w/o outliers
average number of instructions
suites (gmean) outliers (gmean)
28
/
11645
>1,000,000limited aliasing
information