Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 2005 1

Using Platform-Specific Performance Counters for

Dynamic Compilation

Florian Schneider and Thomas Gross

ETH Zurich

Oct 2005 2

Introduction & Motivation

• Dynamic compilers common execution platform for OO languages (Java, C#)

• Properties of OO programs difficult to analyze at compile-time

• JIT compiler can immediately use information obtained at run-time

Oct 2005 3

Introduction & Motivation

Types of information:

1. Profiles: e.g. execution frequency of methods / basic blocks

2. Hardware-specific properties: cache misses, TLB misses, branch prediction failures

Oct 2005 4

Outline

1. Introduction

2. Requirements

3. Related work

4. Implementation

5. Results

6. Conclusions

Oct 2005 5

Requirements

• Infrastructure flexible enough to measure different execution metrics – Hide machine-specific details from VM– Keep changes to the VM/compiler minimal

• Runtime overhead of collecting information from the CPU low

• Information must be precise to be useful for online optimization

Oct 2005 6

Related work

• Profile guided optimization – Code positioning [PettisPLDI90]

• Hardware performance monitors– Relating HPM data to basic blocks [Ammons PLDI97]– “Vertical profiling” [Hauswirth OOPSLA 2004]

• Dynamic optimization– Mississippi delta [Adl-Tabatabai PLDI2004]– Object reordering [Huang OOPSLA 2004]

• Our work:– No instrumentation– Use profile data + hardware info– Targets fully automatic dynamic optimization

Oct 2005 7

Hardware performance monitors

• Sampling-based counting– CPU reports state every n events– Precision platform-dependent (pipelines,

out-of-order execution)

• Sampling provides method, basic block, or instruction-level information

– Newer CPUs support precise sampling (e.g. P4, Itanium)

Oct 2005 8

Hardware performance monitors

• Way to localize performance bottlenecks– Sampling interval determines how fine-

grained the information is

• Smaller sampling interval more data– Trade-off: precision vs. runtime overhead– Need enough samples for a representative

picture of the program behavior

Oct 2005 9

Implementation

Main parts1. Kernel module: low level access to

hardware, per process counting

2. User-space library: hides kernel & device driver details from VM

3. Java VM thread: collects samples periodically, maps samples to Java code– Implemented on top of Jikes RVM

Oct 2005 10

System overview

Oct 2005 11

Implementation

• Supported events:– L1 and L2 cache misses– DTLB misses– Branch prediction

• Parameters of the monitoring module: – Buffer size (fixed)– Polling interval (fixed)– Sampling interval (adaptive)

• Keep runtime overhead constant by changing interval during run-time automatically

Oct 2005 12

From raw data to Java

• Determine method + bytecode instr– Build sorted method table– Map offset to bytecode

0x080485e1: mov 0x4(%esi),%esi0x080485e4: mov $0x4,%edi0x080485e9: mov (%esi,%edi,4),%esi0x080485ec: mov %ebx,0x4(%esi)0x080485ef: mov $0x4,%ebx0x080485f4: push %ebx0x080485f5: mov $0x0,%ebx0x080485fa: push %ebx0x080485fb: mov 0x8(%ebp),%ebx0x080485fe: push %ebx0x080485ff: mov (%ebx),%ebx0x08048601: call *0x4(%ebx)0x08048604: add $0xc,%esp0x08048607: mov 0x8(%ebp),%ebx0x0804860a: mov 0x4(%ebx),%ebx

GETFIELD

ARRAYLOAD

INVOKEVIRTUAL

Oct 2005 13

From raw data to Java• Sample gives PC + register contents• PC machine code compiled Java code

bytecode instruction

• For data address: use registers + machine code to calculate target address:– GETFIELD indirect loadmov 12(eax), eax // 12 = offset of field

Oct 2005 14

Engineering issues

• Lookup of PC to get method / BC instr must be efficient – Done in parallel with user program– Use binary search / hash table– Update at recompilation, GC

• Identify 100% of instructions (PCs):– Include samples from application, VM, and

library code– Dealing with native parts

Oct 2005 15

Infrastructure

• Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform

• Pentium 4, 3 GHz, 1G RAM, 1M L2 cache

• Measured data show:– Runtime overhead– Extraction of meaningful information

Oct 2005 16

Runtime overhead

Program Orig

[sec], [score]

Sampling interval 10000

Sampling interval 1000

javac 7.18 2.0% 2.4%

raytrace 4.04 2.4% 2.0%

jess 2.93 0.6% 0.1%

jack 2.73 3.5% 2.7%

db 10.49 0.1% 3.1%

compress 6.50 0.9% 1.5%

mpegaudio 6.54 1.3% 0.3%

jbb 6209.67 2.4% 4.6%

average 1.6% 2.1%

• Experiment setup: monitor L2 cache misses

Oct 2005 17

Runtime overhead: specJBB

1

1.01

1.02

1.03

1.04

1.05

1.06

0 10000 20000 30000 40000

sampling interval

Rel

. per

form

ance

to "

no s

ampl

ing"

Total cost / sample: ~ 3000 cycles

Oct 2005 18

Measurements

• Measure which instructions produce most events (cache misses, branch mispred)– Potential for data locality and control flow

optimizations

• Compare different spec-benchmarks– Find “hot spots”: instructions that produce

80% of all measured events

Oct 2005 19

L1/L2 Cache missesdb L1 misses

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

top-100 memory load instructions

# of

sam

ples

db L2 misses

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

top-100 memory load instructions

# of

sam

ples

80% quantile = 21 instructions

(N=571)

80% quantile = 13

(N=295)

Oct 2005 20

L1/L2 Cache missesspecJBB L1 misses

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

memory load instructions

# of

sam

ples

specJBB L2 misses

0

5000

10000

15000

20000

25000


# o

f sa

mp

les

80% quantile = 76

(N=2361)

80% quantile = 477(N=8526)

Oct 2005 21

L1/L2 Cache missesjavac L1 misses

0

20

40

60

80

100

120

140

160


# of

sam

ples

javac L2 misses

0

100

200

300

400

500

600


# of

sam

ples

80% quantile = 1296(N=3172)

80% quantile = 153

(N=672)

Oct 2005 22

Branch predictionspecJBB branch mispredictions

0

500

1000

1500

2000

2500

3000

3500

4000

branch instructions

# o

f s

am

ple

s

javac branch mispredictions

0

100

200

300

400

500

600

700

800

branch instructions

# o

f s

am

ple

s

db branch mispredictions

0

2000

4000

6000

8000

10000

12000

branch instructions

# o

f sa

mp

les

80% quantile = 307(N=4193) 80% quantile = 1575

(N=7478)

Oct 2005 23

Summary

80%-quantile in % of total L1 misses L2 misses Branch pred.

specJBB 5.6% 3.2% 7.3%

javac 40.9% 22.7% 21.1%

db 3.7% 4.4% 0.8%

• Distribution of events over program differ significantly between benchmarks

• Challenge: Are data precise enough to guide optimizations in a dynamic compiler?

Oct 2005 24

Further work

• Apply information in optimizer– Data: access path expressions p.x.y– Control flow: inlining, I-cache locality

• Investigate flexible sampling interval

• Further optimizations of monitoring system– Replacing expensive JNI calls– Avoid copying of samples

Oct 2005 25

Concluding remarks

• Precise performance event monitoring is possible with low overhead (~ 2%)

• Monitoring infrastructure tied into Jikes RVM compiler

• Instruction level information allows optimizations to focus on “hot spots”

• Good platform to study coupling compiler decisions to hardware-specific platform properties

Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Documents