Performance Profiling of Virtual Machines Jiaqing Du + , Nipun Sehrawat * , Willy Zwaenepoel + +EPFL, Switzerland *University of Illinois at Urbana-Champaign
May 10, 2015
Performance Profiling of Virtual Machines
Jiaqing Du+, Nipun Sehrawat*, Willy Zwaenepoel+
+EPFL, Switzerland*University of Illinois at Urbana-Champaign
2
Performance Profiling
• Use CPU performance counters• Monitor software runtime behavior• Incur very low overhead• Used extensively: OProfile, VTune, …
%CYCLE Function Module98.5529 vmx_vcpu_run kvm-intel.ko0.2226 (no symbols) libc.so0.1034 hpet_cpuhp_notify vmlinux0.1034 native_patch vmlinux
Jiaqing Du, VEE, March 9, 2011
3
Terminology
PMUCPU
OS
profiler
PMUCPU
Guest
PMUCPU
Guest
profiler
profilerVMM
profiler
VMM
Jiaqing Du, VEE, March 9, 2011
(1) native profiling (2) guest-wide profiling (3) system-wide profiling
4
Profiling with Virtual Machines
Jiaqing Du, VEE, March 9, 2011
Para-virtualization
Hardware assistance
Binary translation
Guest-wide profiling ? ? ?System-wide profiling ? ?
Profilers do not work well with virtual machines.
XenOprof
5
Contributions
Jiaqing Du, VEE, March 9, 2011
Para-virtualization
Hardware assistance
Binary translation
Guest-wide profiling ? ? ?System-wide profiling ? ?
XenOprof
(1) Give solutions
(2) Implement prototypes
6
Outline
• Native profiling• Guest-wide profiling• System-wide profiling• Evaluation
Jiaqing Du, VEE, March 9, 2011
7
Native Profiling
• Performance monitoring unit (PMU)– consists of a set of event counters– generates an interrupt when a counter overflows
• PMU-based profiler
PMUCPU
Kernel
UserControl
CollectConfigure
Interpret
Jiaqing Du, VEE, March 9, 2011
- previous PC value- process identifier
8
Guest-wide Profiling
• Profiler runs in the guest and only profiles the guest
Jiaqing Du, VEE, March 9, 2011
PMUCPU
GuestControl
CollectConfigure
Interpret
VMM
Challenge: synchronous interrupt delivery to the guest
Injected interrupts should be handled right after guest resumes execution.
9
System-wide Profiling (1/3)
• Reveal runtime behavior of both VMM and guest(s)
Jiaqing Du, VEE, March 9, 2011
PMUCPU
Control
CollectConfigure
Interpret
VMM
Guest1 Guest2
Challenge: interpret samples belonging to the guest
Do not know the internals of a guest.
10
System-wide Profiling (2/3)
• Interpret guest samples: full delegation
Jiaqing Du, VEE, March 9, 2011
PMUCPU
Control
CollectConfigure
Interpret
VMM
Guest
Control
CollectConfigure
Interpret
11
System-wide Profiling (3/3)
• Interpret guest samples: interpretation delegation
Jiaqing Du, VEE, March 9, 2011
PMUCPU
Control
CollectConfigure
Interpret
VMM
Guest
Control
CollectConfigure
Interpret
SharedBuffer
12
• When to save & restore performance counters?• CPU switch– only in-guest execution is accounted to the guest
• Domain switch– in-VMM execution is also accounted to the guest
PMU Multiplexing
Jiaqing Du, VEE, March 9, 2011
account to guest 1
guest2I/Oguest1
VMMguest1 I/Oguest2
VMMguest2
account to guest 2 account to guest 2
guest2I/Oguest1
VMMguest1 I/Oguest2
VMMguest2
account to guest1 account to guest2
13
Implementation
Jiaqing Du, VEE, March 9, 2011
Para-virtualization
Guest-wide profiling ? √ ?System-wide profiling √ √
XenOprof
KVM QEMU
14
Evaluation question #1
How much does profiling slow down programs?
Jiaqing Du, VEE, March 9, 2011
15
Profiling Overhead
• Measure execution time– a computation-intensive program– with and without profiling– about 400 counter overflows per second
Jiaqing Du, VEE, March 9, 2011
Profiling environment Increased execution time
Native Linux 0.04% ± 0.004%
KVM guest-wide 0.39% ± 0.045%
KVM system-wide 0.44% ± 0.043%
QEMU system-wide 0.94% ± 0.044%
16
Evaluation question #2
Are profiling results accurate?
Jiaqing Du, VEE, March 9, 2011
17
Profiling Accuracy (1/4)
• A computation-intensive benchmark• compute_{a|b}() does floating point arithmetic• Monitor CPU cycles
Jiaqing Du, VEE, March 9, 2011
int main(int argc, char *argv[]){ while (1) { compute_a(); compute_b(); }}
18
Profiling Accuracy (2/4)
• Comparison with native profiling
Jiaqing Du, VEE, March 9, 2011
compute_a compute_b0
10
20
30
40
50
60
70
80
90
NativeKVM guest-wideKVM system-wideQEMU system-wide
Cycle %
Routine name
19
Profiling Accuracy (3/4)
• A memory-intensive benchmark• Randomly access a fixed-size region of memory• Monitor last level cache misses
Jiaqing Du, VEE, March 9, 2011
struct item { struct item *next; long pad[NUM_PAD];}
void chase_pointer(){ struct item *p = NULL; p = &randomly_connected_items; while (p != null) p = p->next;}
20
Profiling Accuracy (4/4)
• Comparison with native profiling
Jiaqing Du, VEE, March 9, 2011
256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 30720
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
NativeKVM guest-wideKVM system-wideQEMU system-wide
Cache misses permemory access
Working set size (KB)
21
Evaluation question #3
What is the difference betweenCPU switch and domain switch?
Jiaqing Du, VEE, March 9, 2011
22
• CPU switch
• Domain switch
Recap
Jiaqing Du, VEE, March 9, 2011
account to guest 1
guest2I/Oguest1
VMMguest1 I/Oguest2
VMMguest2
account to guest 2 account to guest 2
guest2I/Oguest1
VMMguest1 I/Oguest2
VMMguest2
account to guest1 account to guest2
23
Profiling Packet Receive (1/2)
• Experiment– push packets to a Linux guest in KVM– run OProfile in the guest– monitor instruction retirements
Linux
NICHardware
KVM virtual NIC
NICHardware
Linux
Jiaqing Du, VEE, March 9, 2011
24
INSTR Function
2261 cp_interrupt
1336 cp_rx_poll
1034 cp_start_xmit
421 native_apic_mem_write
374 native_apic_mem_read
… …
… …
… …
… …
… …
INSTR Function
2261 cp_interrupt
1336 cp_rx_poll
1034 cp_start_xmit
421 native_apic_mem_write
374 native_apic_mem_read
191 csum_partial
105 csum_partial_copy_generic
94 copy_to_user
79 ipt_do_table
51 tcp_v4_rcv
Profiling Packet Receive (2/2)
INSTR Function
167 csum_partial
106 csum_partial_copy_generic
74 copy_to_user
47 ipt_do_table
38 tcp_v4_rcv
… …
… …
… …
… …
… …
CPU Switch Domain Switch
Jiaqing Du, VEE, March 9, 2011
Domain switch gives more insight for I/O operations.
PacketProcessing
I/ORelated
25
Related Work
• XenOprof– first profiler targeting virtual machines– system-wide profiling for Xen
• Linux perf– a profiling infrastructure for Linux– limited support of profiling KVM Linux guest
• VMware vmkperf– only read and write CPU performance counters
Jiaqing Du, VEE, March 9, 2011
26
Conclusions
Jiaqing Du, VEE, March 9, 2011
Para-virtualization
Hardware assistance
Binary translation
Guest-wide profiling
√ √ √
System-wide profiling √ √
XenOprof