Top Banner
Oct 2005 1 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich
25

Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Dec 31, 2015

Download

Documents

Shawn Charles
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 1

Using Platform-Specific Performance Counters for

Dynamic Compilation

Florian Schneider and Thomas Gross

ETH Zurich

Page 2: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 2

Introduction & Motivation

• Dynamic compilers common execution platform for OO languages (Java, C#)

• Properties of OO programs difficult to analyze at compile-time

• JIT compiler can immediately use information obtained at run-time

Page 3: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 3

Introduction & Motivation

Types of information:

1. Profiles: e.g. execution frequency of methods / basic blocks

2. Hardware-specific properties: cache misses, TLB misses, branch prediction failures

Page 4: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 4

Outline

1. Introduction

2. Requirements

3. Related work

4. Implementation

5. Results

6. Conclusions

Page 5: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 5

Requirements

• Infrastructure flexible enough to measure different execution metrics – Hide machine-specific details from VM– Keep changes to the VM/compiler minimal

• Runtime overhead of collecting information from the CPU low

• Information must be precise to be useful for online optimization

Page 6: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 6

Related work

• Profile guided optimization – Code positioning [PettisPLDI90]

• Hardware performance monitors– Relating HPM data to basic blocks [Ammons PLDI97]– “Vertical profiling” [Hauswirth OOPSLA 2004]

• Dynamic optimization– Mississippi delta [Adl-Tabatabai PLDI2004]– Object reordering [Huang OOPSLA 2004]

• Our work:– No instrumentation– Use profile data + hardware info– Targets fully automatic dynamic optimization

Page 7: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 7

Hardware performance monitors

• Sampling-based counting– CPU reports state every n events– Precision platform-dependent (pipelines,

out-of-order execution)

• Sampling provides method, basic block, or instruction-level information

– Newer CPUs support precise sampling (e.g. P4, Itanium)

Page 8: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 8

Hardware performance monitors

• Way to localize performance bottlenecks– Sampling interval determines how fine-

grained the information is

• Smaller sampling interval more data– Trade-off: precision vs. runtime overhead– Need enough samples for a representative

picture of the program behavior

Page 9: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 9

Implementation

Main parts1. Kernel module: low level access to

hardware, per process counting

2. User-space library: hides kernel & device driver details from VM

3. Java VM thread: collects samples periodically, maps samples to Java code– Implemented on top of Jikes RVM

Page 10: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 10

System overview

Page 11: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 11

Implementation

• Supported events:– L1 and L2 cache misses– DTLB misses– Branch prediction

• Parameters of the monitoring module: – Buffer size (fixed)– Polling interval (fixed)– Sampling interval (adaptive)

• Keep runtime overhead constant by changing interval during run-time automatically

Page 12: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 12

From raw data to Java

• Determine method + bytecode instr– Build sorted method table– Map offset to bytecode

0x080485e1: mov 0x4(%esi),%esi0x080485e4: mov $0x4,%edi0x080485e9: mov (%esi,%edi,4),%esi0x080485ec: mov %ebx,0x4(%esi)0x080485ef: mov $0x4,%ebx0x080485f4: push %ebx0x080485f5: mov $0x0,%ebx0x080485fa: push %ebx0x080485fb: mov 0x8(%ebp),%ebx0x080485fe: push %ebx0x080485ff: mov (%ebx),%ebx0x08048601: call *0x4(%ebx)0x08048604: add $0xc,%esp0x08048607: mov 0x8(%ebp),%ebx0x0804860a: mov 0x4(%ebx),%ebx

GETFIELD

ARRAYLOAD

INVOKEVIRTUAL

Page 13: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 13

From raw data to Java• Sample gives PC + register contents• PC machine code compiled Java code

bytecode instruction

• For data address: use registers + machine code to calculate target address:– GETFIELD indirect loadmov 12(eax), eax // 12 = offset of field

Page 14: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 14

Engineering issues

• Lookup of PC to get method / BC instr must be efficient – Done in parallel with user program– Use binary search / hash table– Update at recompilation, GC

• Identify 100% of instructions (PCs):– Include samples from application, VM, and

library code– Dealing with native parts

Page 15: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 15

Infrastructure

• Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform

• Pentium 4, 3 GHz, 1G RAM, 1M L2 cache

• Measured data show:– Runtime overhead– Extraction of meaningful information

Page 16: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 16

Runtime overhead

Program Orig

[sec], [score]

Sampling interval 10000

Sampling interval 1000

javac 7.18 2.0% 2.4%

raytrace 4.04 2.4% 2.0%

jess 2.93 0.6% 0.1%

jack 2.73 3.5% 2.7%

db 10.49 0.1% 3.1%

compress 6.50 0.9% 1.5%

mpegaudio 6.54 1.3% 0.3%

jbb 6209.67 2.4% 4.6%

average 1.6% 2.1%

• Experiment setup: monitor L2 cache misses

Page 17: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 17

Runtime overhead: specJBB

1

1.01

1.02

1.03

1.04

1.05

1.06

0 10000 20000 30000 40000

sampling interval

Rel

. per

form

ance

to "

no s

ampl

ing"

Total cost / sample: ~ 3000 cycles

Page 18: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 18

Measurements

• Measure which instructions produce most events (cache misses, branch mispred)– Potential for data locality and control flow

optimizations

• Compare different spec-benchmarks– Find “hot spots”: instructions that produce

80% of all measured events

Page 19: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 19

L1/L2 Cache missesdb L1 misses

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

top-100 memory load instructions

# of

sam

ples

db L2 misses

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

top-100 memory load instructions

# of

sam

ples

80% quantile = 21 instructions

(N=571)

80% quantile = 13

(N=295)

Page 20: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 20

L1/L2 Cache missesspecJBB L1 misses

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

memory load instructions

# of

sam

ples

specJBB L2 misses

0

5000

10000

15000

20000

25000

memory load instructions

# o

f sa

mp

les

80% quantile = 76

(N=2361)

80% quantile = 477(N=8526)

Page 21: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 21

L1/L2 Cache missesjavac L1 misses

0

20

40

60

80

100

120

140

160

memory load instructions

# of

sam

ples

javac L2 misses

0

100

200

300

400

500

600

memory load instructions

# of

sam

ples

80% quantile = 1296(N=3172)

80% quantile = 153

(N=672)

Page 22: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 22

Branch predictionspecJBB branch mispredictions

0

500

1000

1500

2000

2500

3000

3500

4000

branch instructions

# o

f s

am

ple

s

javac branch mispredictions

0

100

200

300

400

500

600

700

800

branch instructions

# o

f s

am

ple

s

db branch mispredictions

0

2000

4000

6000

8000

10000

12000

branch instructions

# o

f sa

mp

les

80% quantile = 307(N=4193) 80% quantile = 1575

(N=7478)

Page 23: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 23

Summary

80%-quantile in % of total L1 misses L2 misses Branch pred.

specJBB 5.6% 3.2% 7.3%

javac 40.9% 22.7% 21.1%

db 3.7% 4.4% 0.8%

• Distribution of events over program differ significantly between benchmarks

• Challenge: Are data precise enough to guide optimizations in a dynamic compiler?

Page 24: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 24

Further work

• Apply information in optimizer– Data: access path expressions p.x.y– Control flow: inlining, I-cache locality

• Investigate flexible sampling interval

• Further optimizations of monitoring system– Replacing expensive JNI calls– Avoid copying of samples

Page 25: Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 25

Concluding remarks

• Precise performance event monitoring is possible with low overhead (~ 2%)

• Monitoring infrastructure tied into Jikes RVM compiler

• Instruction level information allows optimizations to focus on “hot spots”

• Good platform to study coupling compiler decisions to hardware-specific platform properties