Top Banner
idempotent (ī-dəm-pō-tənt) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself; 2 of, relating to, or being an operation under which a mathematical quantity is idempotent. idempotent processing (ī-dəm-pō-tənt prə-ses-iŋ) n. the application of only idempotent operations in sequence; said of the execution of computer programs in units of only
41

Static Analysis and Compiler Design for Idempotent Processing

Feb 23, 2016

Download

Documents

olesia

idempotent (ī- dəm - pō - tənt ) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself;  2  of, relating to, or being an operation under which a mathematical quantity is idempotent. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Static Analysis and Compiler Design for Idempotent Processing

idempotent (ī-dəm-pō-tənt) adj. 1 of, relating to, or being a mathematical quantity which when applied to itself equals itself; 2 of, relating to, or being an operation under which a mathematical quantity is idempotent.

idempotent processing (ī-dəm-pō-tənt prə-ses-iŋ) n. the application of only idempotent operations in sequence; said of the execution of computer programs in units of only idempotent computations, typically, to achieve restartable behavior.

Page 2: Static Analysis and Compiler Design for Idempotent Processing

Static Analysis and Compiler Design for Idempotent Processing

Marc de KruijfKarthikeyan Sankaralingam

Somesh Jha

PLDI 2012, Beijing

Page 3: Static Analysis and Compiler Design for Idempotent Processing

Example

3

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

source code

Page 4: Static Analysis and Compiler Design for Idempotent Processing

Example

4

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

F F F F0

faults

exceptions

x

load ?

mis-speculations

Page 5: Static Analysis and Compiler Design for Idempotent Processing

Example

5

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

BAD STUFF HAPPENS!

Page 6: Static Analysis and Compiler Design for Idempotent Processing

R0 and R1 are unmodified

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Example

6

assembly code

just re-execute!

convention:use checkpoints/buffers

Page 7: Static Analysis and Compiler Design for Idempotent Processing

It’s Idempotent!

7

idempoh… what…?

int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}

=

Page 8: Static Analysis and Compiler Design for Idempotent Processing

7

Idempotent Processing

idempotent regionsALL THE TIME

Page 9: Static Analysis and Compiler Design for Idempotent Processing

8

Idempotent Processing

normal compiler:custom compiler:

executive summary

low runtime overhead (typically 2-12%)

cut semantic clobber antidependences

how?idempotence inhibited by clobber antidependences

Page 10: Static Analysis and Compiler Design for Idempotent Processing

9

Presentation Overview

❶ Idempotence

❷ Algorithm

❸ Results

=

Page 11: Static Analysis and Compiler Design for Idempotent Processing

What is Idempotence?

11

Yes

2

is this idempotent?

Page 12: Static Analysis and Compiler Design for Idempotent Processing

What is Idempotence?

12

No

2

how about this?

Page 13: Static Analysis and Compiler Design for Idempotent Processing

What is Idempotence?

13

Yes

2

maybe this?

Page 14: Static Analysis and Compiler Design for Idempotent Processing

What is Idempotence?

14

operation sequence dependence chain idempotent?

write

read, write

write, read, write Yes

No

Yes

it’s all about the data dependences

Page 15: Static Analysis and Compiler Design for Idempotent Processing

What is Idempotence?

15

operation sequence dependence chain idempotent?

write, read

read, write

write, read, write Yes

No

Yes

it’s all about the data dependences

CLOBBER ANTIDEPENDENCEantidependence with an exposed read

Page 16: Static Analysis and Compiler Design for Idempotent Processing

15

Semantic Idempotence

(1) local (“pseudoregister”) state:can be renamed to remove clobber antidependences*

does not semantically constrain idempotence

two types of program state

(2) non-local (“memory”) state:cannot “rename” to avoid clobber antidependences

semantically constrains idempotence

semantic idempotence = no non-local clobber antidep.

preserve local state by renaming and careful allocation

Page 17: Static Analysis and Compiler Design for Idempotent Processing

16

Presentation Overview

❶ Idempotence

❷ Algorithm

❸ Results

=

Page 18: Static Analysis and Compiler Design for Idempotent Processing

Region Construction Algorithm

18

steps one, two, and three

Step 1: transform functionremove artificial dependences, remove non-clobbers

Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior

Step 2: construct regions around antidependencescut all non-local antidependences in the CFG

Page 19: Static Analysis and Compiler Design for Idempotent Processing

18

Step 1: Transform

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

clobber antidependences

region boundaries

region identification

But we still have a problem:

depends on

Page 20: Static Analysis and Compiler Design for Idempotent Processing

19

Step 1: Transform

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

[x] = a;b = [x];[x] = c;

[x] = a;b = a;[x] = c;

non-clobber antidependences… GONE!

Page 21: Static Analysis and Compiler Design for Idempotent Processing

20

Step 1: Transform

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

clobber antidependences

region boundaries

region identification depends on

Page 22: Static Analysis and Compiler Design for Idempotent Processing

Region Construction Algorithm

22

steps one, two, and three

Step 1: transform functionremove artificial dependences, remove non-clobbers

Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior

Step 2: construct regions around antidependencescut all non-local antidependences in the CFG

Page 23: Static Analysis and Compiler Design for Idempotent Processing

22

Step 2: Cut the CFGcut, cut, cut…

construct regions by “cutting” non-local antidependences

antidependence

Page 24: Static Analysis and Compiler Design for Idempotent Processing

23

larger is (generally) better:large regions amortize the cost of input preservation

region size

over

head sources of overhead

optimal region size?

Step 2: Cut the CFG

rough sketch

but where to cut…?

Page 25: Static Analysis and Compiler Design for Idempotent Processing

24

Step 2: Cut the CFGbut where to cut…?

goal: the minimum set of cuts that cuts allantidependence paths

intuition: minimum cuts fewest regions large regions

approach: a series of reductions:minimum vertex multi-cut (NP-complete)minimum hitting set among pathsminimum hitting set among “dominating nodes”

details in paper…

Page 26: Static Analysis and Compiler Design for Idempotent Processing

Region Construction Algorithm

26

steps one, two, and three

Step 1: transform functionremove artificial dependences, remove non-clobbers

Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior

Step 2: construct regions around antidependencescut all non-local antidependences in the CFG

Page 27: Static Analysis and Compiler Design for Idempotent Processing

26

Step 3: Loop-Related Refinements

correctness: Not all local antidependences removed by SSA…

loops affect correctness and performance

loop-carried antidependences may clobberdepends on boundary placement; handled as a post-pass

performance: Loops tend to execute multiple times…

to maximize region size, place cuts outside of loopalgorithm modified to prefer cuts outside of loops

details in paper…

Page 28: Static Analysis and Compiler Design for Idempotent Processing

27

Presentation Overview

❶ Idempotence

❷ Algorithm

❸ Results

=

Page 29: Static Analysis and Compiler Design for Idempotent Processing

Results

compiler implementation – Paper compiler implementation in LLVM v2.9 – LLVM v3.1 source code release in July timeframe

experimental data (1) runtime overhead (2) region size (3) use case

29

Page 30: Static Analysis and Compiler Design for Idempotent Processing

29

Runtime Overhead

SPEC INTSPEC FP

PARSECOVERALL

02468

1012

instruction countexecution time

as a percentage

perc

ent o

verh

ead 7.6

7.7

benchmark suites (gmean) (gmean)

Page 31: Static Analysis and Compiler Design for Idempotent Processing

30

Region Size

SPEC INTSPEC FP

PARSECOVERALL

1

10

100

compiler-generated

average number of instructions

28

dyn

amic

regi

on si

ze

(gmean)benchmark suites (gmean)

Page 32: Static Analysis and Compiler Design for Idempotent Processing

31

Use Case

SPEC INTSPEC FP

PARSECOVERALL

05

101520253035

idempotencecheckpoint/loginstruction TMR

hardware fault recovery

(gmean)

8.2

24.030.5

perc

ent o

verh

ead

benchmark suites (gmean)

Page 33: Static Analysis and Compiler Design for Idempotent Processing

32

Presentation Overview

❶ Idempotence

❸ Results

=

❷ Algorithm

Page 34: Static Analysis and Compiler Design for Idempotent Processing

33

Summary & Conclusions

idempotent processing – large (low-overhead) idempotent regions all the time

static analysis, compiler algorithm – (a) remove artifacts (b) partition (c) compile

summary

low overhead – 2-12% runtime overhead typical

Page 35: Static Analysis and Compiler Design for Idempotent Processing

34

Summary & Conclusions

several applications already demonstrated – CPU hardware simplification (MICRO ’11) – GPU exceptions and speculation (ISCA ’12) – hardware fault recovery (this paper)

conclusions

future work – more applications, hybrid techniques – optimal region size? – enabling even larger region sizes

Page 36: Static Analysis and Compiler Design for Idempotent Processing

35

Back-up Slides

Page 37: Static Analysis and Compiler Design for Idempotent Processing

36

Error recovery

mis-speculation (e.g. branch misprediction) – compiler handles for pseudoregister state – for non-local memory, store buffer assumed

arbitrary failure (e.g. hardware fault) – ECC and other verification assumed – variety of existing techniques; details in paper

exceptions – generally no side-effects beyond out-of-order-ness – fairly easy to handle

dealing with side-effects

Page 38: Static Analysis and Compiler Design for Idempotent Processing

37

Optimal Region Size?

region size

over

head

detectionlatency

registerpressure

re-executiontime

it depends… (rough sketch not to scale)

Page 39: Static Analysis and Compiler Design for Idempotent Processing

38

Prior Workrelating to idempotence

Technique Year Domain

Sentinel Scheduling 1992 Speculative memory re-ordering

Fast Mutual Exclusion 1992 Uniprocessor mutual exclusion

Multi-Instruction Retry 1995 Branch and hardware fault recovery

Atomic Heap Transactions 1999 Atomic memory allocation

Reference Idempotency 2006 Reducing speculative storage

Restart Markers 2006 Virtual memory in vector machines

Data-Triggered Threads 2011 Data-triggered multi-threading

Idempotent Processors 2011 Hardware simplification for exceptions

Encore 2011 Hardware fault recovery

iGPU 2012 GPU exception/speculation support

Page 40: Static Analysis and Compiler Design for Idempotent Processing

39

Detailed Runtime Overhead

SPEC INTSPEC FP

PARSEC gccnamd

OVERALL05

1015202530

instruction countexecution time

as a percentage

suites (gmean) outliers (gmean)

perc

ent o

verh

ead

7.67.7

non-idempotent inner loops + high register pressure

Page 41: Static Analysis and Compiler Design for Idempotent Processing

40

Detailed Region Size

SPEC INTSPEC FP

PARSEChmmer lbm

OVERALL10

100

1000

10000

compileridealIdeal w/o outliers

average number of instructions

suites (gmean) outliers (gmean)

28

/

11645

>1,000,000limited aliasing

information