Top Banner
Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy Paul Yuan (Peking University) Wenguang Chen, Weimin Zheng (Tsinghua University) 1
19

Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

Dec 25, 2015

Download

Documents

Caitlin Sims
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

1

Taming Hardware Event Samples for FDO Compilation

Dehao Chen (Tsinghua University)Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google)

Vinodha RamasamyPaul Yuan (Peking University)

Wenguang Chen, Weimin Zheng (Tsinghua University)

Page 2: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

2

Why FDO?

• Feedback Directed Optimization• Performance Improvements

– 5% speedup on SPEC2000 INT– Small? Huge for millions of computers

• Not widely adopted

Page 3: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

3

Instrumentation based FDOgcc –fprofile-generate …

Instrumented Binary

Representative Workload

Run the instrumented binary .gcda files

gcc –fprofile-use …FDO

optimized binary

1

2

3

1.Have to build twice2. Instrumentation run is slow3.Need representative input 4.Perturbs execution

Page 4: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

4

Sample based FDO

Running Environment

gcc –O2 -g …

Normal Binary

Real-World Workload

Profile Data

gcc –fsample-profile …FDO

optimized binary

1

1.Previous deployment/test binary to collect profile

2.Profiling input: real traffic3.Profiling does not perturb code

Profiling Tools(Oprofile, Pfmon)

Page 5: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

5

PMU Sampling• Performance monitoring unit (PMU)

o Captures events generated by CPU  cache miss instruction retired clock tick

o Configurable counters increment on selected eventso Optional interrupt on counter overflow

• Samplingo On interrupt capture instruction pointer (IP)o Can also sample other state

registers other PMU counters

Page 6: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

6

Sampling Instructions RetiredInstruction Samples

1499 0x76a1f41517 0x76a48a1498 0x76aea11527 0x76c09d

1 0x77e3cf733 0x77ee7e

1242 0x78109d

Symbolized Samples0x76a1f4 : 1499

foo.c:11830x76a48a : 1517

foo.c:9920x76aea1 : 1489

foo.c:9060x76c09d : 1527

foo.h: 18210x77e3cf : 1

bar.c:34810x77ee7e : 733

bar.c 47590x78109d : 1242

bar.c 4762

Symbolizer

GCC

foo.c:853

foo.c:906foo.c:992 foo.c:1183

foo.c:1325

0

3006 1499

0

Page 7: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

7

Sampling Instructions RetiredInstruction Samples

1499 0x76a1f41517 0x76a48a1498 0x76aea11527 0x76c09d

1 0x77e3cf733 0x77ee7e

1242 0x78109d

Symbolized Samples0x76a1f4 : 1499

foo.c:11830x76a48a : 1517

foo.c:9920x76aea1 : 1489

foo.c:9060x76c09d : 1527

foo.h: 18210x77e3cf : 1

bar.c:34810x77ee7e : 733

bar.c 47590x78109d : 1242

bar.c 4762

Symbolizer

GCC

foo.c:853

foo.c:906foo.c:992 foo.c:1183

foo.c:1325

3006 1499

4505

4505

3006

3006

1499

1499

[Levin et.al. Complementing Missing and Inaccurate Profiling using a Minimum Cost Circulation Algorithm. HIPEAC’08]

Page 8: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

8

Accuracy Challenge

Page 9: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

9

Accuracy Challenge

Page 10: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

10

Accuracy Challenge

Page 11: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

11

Improving the Accuracy• Flow Consistency

– Use Minimum Cost Circulation Algorithm– Control flow Network flow

• Predict Aggregation/Shadow Effect– Sampling Multiple Events– Using the prediction to adjust the frequency

Page 12: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

Branch: 7954

Taken: 7922

Join: 1049

7954

79220

1049

6905

Source

Sink

326873

7922

7922

1049

6905

Source

Sink

326873

690532

6873 + 32

12

[Levin et.al. Complementing Missing and Inaccurate Profiling using a Minimum Cost Circulation Algorithm. HIPEAC’08]

Page 13: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

13

Use Prediction to Adjust Profile• Cost Function in MCC

– Each basic block is attached by two edges• Forward (flow represents increasing the count)• Backward (flow represents decreasing the count)

– Cost function for each edge• Larger cost means prevent changing in this direction

• Using the prediction– Over-sampled: high cost on forward edge– Under-sampled: high cost on backward edge

BB1’

BB1’’

Forward Edge Backward Edge

Page 14: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

14

Predict Aggregation/Shadow• Model Aggregation Effects

– Long latency instructions– Sample major long latency events

• Branch Mispredict, Cache/DTLB Miss, etc• Estimate the stalls these events will cause• Skid has little influence on long latency events

Page 15: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

15

Predict Aggregation/Shadow• Model Shadow Effects

– CPU_CORE_CYCLES event• Time based sampling• Skid will only shift the profile• CPU_CYCLE – INST_RETIRED Stalled Cycle (with skid)• Each stalled cycle will set a shadow area

• Aggregation and Shadow co-exist– Heuristic to check which one dominates

Page 16: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

16

Evaluation: Accuracy

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%Static Estimation MCC Our Prediction Perfect Prediction

Page 17: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

17

Evaluation: Performance

164.gzip

175.vpr

176.gcc

181.mcf

186.craft

y

197.parser

252.eon

253.perlbmk

254.gap

255.vorte

x

256.bzip2

300.twolf

Geomean-5%

0%

5%

10%

15%Sample FDO Instr FDO

Page 18: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

18

Conclusion and Future Work• Sampling based FDO is promising• The artifacts in PMU data can be compensated

for with appropriate understanding and heuristics, which improves the accuracy by 6%

• Sample based Value Profiling• Future: Last Branch RegisterMore precise

edge profile at binary level• Sample based LIPO

Page 19: Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

19

Questions?