IO Virtualization Performance HUANG Zhiteng ([email protected])
IO Virtualization PerformanceHUANG Zhiteng ([email protected])
SoftwareServices&
group
Agenda
• IO Virtualization Overview• Software solution
• Hardware solution
• IO performance Trend
• How IO Virtualization Performed in Micro Benchmark• Network
• Disk
• Performance in Enterprise Workloads• Web Server: PV, VT-d and Native performance
• Database: VT-d vs. Native performance
• Consolidated Workload: SR-IOV benefit
• Direct IO (VT-d) Overhead Analysis
2
SoftwareServices&
group
IO Virtualization Overview
3
VM0
VMN
Software Solutions:Two solution we are familiar with on Xen:•Emulated Devices – QEMU
Good compatibility, very poor performance•Para-Virtualized Devices
Need driver support in guest, provides optimized performance compared to QEMU.
Both require participation of Dom0 (driver domain) to serve an VM IO request.
IO Virtualization enables VMs to utilize Input/Output resources of hardware platform. In this session, we cover network and storage.
SoftwareServices&
group
IO Virtualization Overview – Hardware Solution
4
• VMDq (Virtual Machine Device Queue)• Separate Rx & Tx queue pairs of NIC for each VM, Software “switch”. Requires specific OS and VMM support.
• Direct IO (VT-d)•Improved IO performance through direct assignment of a I/O device to an unmodified or paravirtualized VM.
• SR-IOV (Single Root I/O Virtualization)•Changes to I/O device silicon to support multiple PCI device ID’s, thus one I/O device can support multiple direct assigned guests. Requires VT-d.
Offer 3 kinds of H/W assists to accelerate IO.
Single or combination of technology can be used to address various usages.
Network Only
VM Exclusively
Owns Device
One Device, Multiple Virtual Function
SoftwareServices&
group
IO Virtualization Overview - Trend
5
Solid State Drive (SSD)Provides hundreds MB/s bandwidth and >10,000 IOPS for single devices*.
PCIe 2.0Doubles the bit rate from 2.5GT/s to 5.0GT/s.
40Gb/s and 100Gb/s EthernetScheduled to release Draft 3.0 in Nov. 2009, Standard Approval in
2010**.
Fibre Channel over Ethernet (FCoE)Unified IO consolidates network (IP) and storage (SAN) to single
connection.
Much Higher throughput, denser IO capacity.
* http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403
** See http://en.wikipedia.org/wiki/100_Gigabit_Ethernet
SoftwareServices&
group
4.68
8.479.54
19
0%
20%
40%
60%
80%
100%
02468
101214161820
HVM + PV driver PV Guest HVM + VT-d 30VMs+SR-IOV
Ban
dw
idth
(G
b/s
)
iperf: Transmiting Performance
perf cpu%
1.81x1.13x
How IO Virtualization Performed in Micro Benchmark – Network
6
1.46
3.10
9.43
0%
20%
40%
60%
80%
100%
0
2
4
6
8
10
HVM + PV driver PV Guest HVM + VT-d
Ban
dw
idth
(G
b/s
)
iperf: Receiving Performance
perf cpu%
2.12x
3.04x
Iperf with 10Gb/s Ethernet NIC were
used to benchmark TCP bandwidth of
different device models(*).
Thanks to VT-d, VM can easily
achieved 10GbE line-rate in both cases
with relatively much lower resource
consumption.
Also with SR-IOV, we were able to get
19Gb/s transmitting performance with
30VFs assigned to 30 VMs.
* HVM+ VT-d uses 2.6.27 kernel while PV guest and HVM+PV driver uses 2.6.18.* We turned off multiqueue support in NIC driver of HVM+VT-d due 2.6.18 kernel doesn’t have multi TX queue support. So for iperf test, there was only one TX/RX queue in the NIC and all interrupts are sent to one physical core only.* ITR (Interrupt Throttle Rate) was set to 8000 for all cases.
SR-IOV Dual-Port 10GbE NIC
SoftwareServices&
group
How IO Virtualization Performed in Micro Benchmark – Network (cont.)
7
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
PV Guest 1 queue HVM+VT-d 1 queue HVM+VT-d 4 queue
0.33
3.90
8.15
Mill
ion
Pac
ket
Pe
r Se
con
d (
mp
p/s
)
Packet Transmitting Performance
11.7x
2.1x
Packet Transmitting performance is another
essential aspect of high throughput network.
Using Linux kernel packet generator(pktgen)
with small UDP packets (128 Byte), HVM+VT-d
can send over 4 million packets/s with 1 TX
queue and >8 million packets/s with 4 TX
queue.
PV performance was far behind due to its long
packet processing path.
SoftwareServices&
group
How IO Virtualization Performed in Micro Benchmark – Disk IO
8
1,711
4,911
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
1,000
2,000
3,000
4,000
5,000
6,000
PV Guest HVM + VT-d
Ban
dw
idth
(M
B/s
)
IOmeter: Disk Bandwidth
2.9x
7,056
18,725
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
4,000
8,000
12,000
16,000
20,000
PV Guest HVM + VT-d
IO p
er
Seco
nd
IOmeter: Disk IOPS
2.7x
We measured disk bandwidth with sequential
read and IOPS with random read to check block
device performance.
HVM+VT-d out-perform PV guest to ~3.0x in
both tests.
SoftwareServices&
group
Performance in Enterprise Workloads – Web Server
9
5,000
9,000
24,500
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
-
5,000
10,000
15,000
20,000
25,000
30,000
HVM + PV driver PV guest HVM+ VT-d
Sess
ion
s
Web Server Performance
performance CPU utilization
1.8x
2.7x
Web Server simulates a support website where connected users browse and download files.
We measures maximum simultaneous user sessions that web server can support while satisfying the QoS criteria.
Only HVM+VT-d was able to push server’s
utilization to ~100%. PV solution hit some
bottleneck that they failed to pass QoS while
utilization is still
SoftwareServices&
group
Performance in Enterprise Workloads – Database
10
0
2000
4000
6000
8000
10000
12000
Native HVM + VT-d
11,44310,762
Qp
hH
Decision Support DB Performnce
94.06%
0.00
50.00
100.00
150.00
200.00
Native HVM + VT-d Storage & NIC HVM + VT-d Storage
199.71 184.08
127.56
OLTP DB Performance
92.2%
63.9%
Decision Support DB requires high disk bandwidth while OLTP DB is IOPS bound and requires certain
network bandwidth to connect to clients.
HVM+VT-d combination achieved > 90% of native performance in these two DB workloads.
SoftwareServices&
group
Performance in Enterprise Workloads – Consolidation with SR-IOV
11
1.49x
Workload consolidates multiple tiles of servers to run on same physical machine. One tile consists of 1 instance of Web Server, 1 J2EE AppServer and 1 Mail Server, altogether 6 VMs.
It’s a complex workload, which consumes CPU, memory, disk and network.
PV solution could only supports 4 tiles on two socket server and fails to pass QoS of Web Server criteria before it saturates CPU.
As a pilot, we enabled SR-IOV NIC for Web Server. This brought >49% performance increase also allowed system to support two more tiles (12 VMs).
1.00
1.49
0
10
20
30
40
50
60
70
80
90
100
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
PV Guest HVM+SR-IOV
Rat
io o
f P
erf
orm
ance
SR-IOV Benefit: Consolidated Workload
Ratio System Utilization
SoftwareServices&
group
Direct IO (VT-d) Overhead
12
6.37%
4.73%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
SPECWeb
VT-d: XEN Cycles Breakdown
Interrupt Windows
INTR
APIC ACCESS
IO INSTRUCTION
APIC access and interrupt delivery consumed the most cycles. (Note that some amount of interrupts arrive when CPU is HLT thus they are not
counted.)
In various workloads, we’ve seen Xen brings in about 5~12% overhead, which is being mainly spent on serving interrupts. Intel OTC team has developed patch set to eliminate part of Xen software overhead. Check out Xiaowei’s session for details.
5.94
11.878
0
10
20
30
40
50
60
70
80
90
100
Disk Bandwidth Web Server
Uti
lizat
ion
(%
)
VT-d cases: Utilization Breakdown
Dom0
Xen
Guest Kernel
Guest User
Web Server
SoftwareServices&
group
CREDIT
Great thanks to DUAN, Ronghui and XIANG, Kai for providing data of VT-d network and SR-IOV.
13
SoftwareServices&
group
QUESTIONS?
SSG/SSD/SPA/PRC Scalability Lab14
SoftwareServices&
group
BACKUP
SSG/SSD/SPA/PRC Scalability Lab15
SoftwareServices&
group
Configuration
16
Hardware Configuration:
Intel® Nehalem-EP Server System
CPU Info:
2 socket Nehalem 2.66 GHZ with 8MB LLC Cache, C0 stepping.
Hardware Prefetches OFF
Turbo mode OFF, EIST OFF.
NIC device
Intel 10Gb XF SR NIC (82598EB)—2 single port NIC installed on
machine and one dual port NIC installed on server.
RAID bus controller:
LSI Logic MegaRAID SAS 8888elp x3
DISK array x6 (each with 70GB X 12 SAS HDD).
Memory Info
64GB memory (16x 4GB DDR3 1066MHz) , 32GB on each node.
Software Configuration:
Xen C/S:18771 for network/disk micro benchmark, 19591 for SR-IOV test
Test Case VM Configuration
Network Micro benchmark
4 vCPU, 64GB memory
Storage Micro Benchmark
2 vCPU, 12GB memory
Web Server4 vCPU, 64GB memory
Database 4 vCPU, 12GB memory
17 SSG/SSD/SPA/PRC Scalability Lab