Top Banner
Virtualization Adam Belay <[email protected]>
39

Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Apr 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

VirtualizationAdamBelay<[email protected]>

Page 2: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Whatisavirtualmachine

• Simulationofacomputer• Runningasanapplicationonahostcomputer• Accurate• Isolated• Fast

Page 3: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Whyuseavirtualmachine?

• Torunmultipleoperatingsystem(e.g.WindowsandLinux)• Tomanagebigmachines(allocatecoresandmemoryatO/Sgranularity)• Kerneldevelopment(e.g.likeQEMU+JOS)• Betterfaultisolation(defenseindepth)• Topackageapplicationswithaspecifickernelversionandenvironment• Toimproveresourceutilization

Page 4: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Howaccuratedowehavetobe?

• MusthandleweirdquirksinexistingOses• Evenbug-for-bugcompatibility

• Mustmaintainisolationwithmalicioussoftware• GuestcannotbreakoutofVM!

• MustbeimpossibleforguesttodistinguishVMfromrealmachine• SomeVMscompromise,modifyingtheguestkerneltoreduceaccuracyrequirement

Page 5: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

VMsareanoldidea

• 1960s:IBMusedVMstosharebigmachines• 1970s:IBMspecializedCPUsforvirtualization• 1990s:VMwarerepopularized VMsforx86HW• 2000s:AMD&IntelspecializedCPUsforvirtualization

Page 6: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

ProcessArchitecture

Hardware

OS

vi gcc firefox

Page 7: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

VMArchitecture

• WhatiftheprocessabstractionlookedjustlikeHW?

Hardware

OS(VMM)

vi gcc firefox

GuestOS

VirtualHW

GuestOS

VirtualHW

Page 8: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

ComparingaprocessandHW

Process• Nonprivilegedregistersandinstructions• Virtualmemory• Signals• Filesystemandsockets

Hardware• Allregistersandinstructions• Virt.mem.andMMU• Trapsandinterrupts• I/OdevicesandDMA

Page 9: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

CanaCPUbevirtualized?

Requirementstobe“classicallyvirtualizable”definedbyPopek andGoldbergin1974:1. Fidelity:SoftwareontheVMMexecutes

identicallytoitsexecutiononhardware,barringtimingeffects.

2. Performance:AnoverwhelmingmajorityofguestinstructionsareexecutedbythehardwarewithouttheinterventionoftheVMM.

3. Safety:TheVMMmanagesallhardwareresources.

Page 10: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Whynotsimulation?

• VMMinterpretseachinstruction(e.g.BOCHS)• Maintainmachinestateforeachregister• EmulateI/Oportsandmemory• Violatesperformance requirement

Page 11: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Idea:ExecuteguestinstructionsonrealCPUwheneverpossible• Worksfineformostinstructions• E.g.add%eax,%ebx• Butprivilegedinstructionscouldbeharmful• Wouldviolatesafety property

Page 12: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Idea:RunguestkernelsatCPL3

• Ordinaryinstructionsworkfine• PrivilegedinstructionsshouldtraptoVMM(generalprotectionfault)• VMMcanapplyprivilegedoperationson“virtual”state,nottorealhardware• Thisiscalled“trap-and-emulate”

Page 13: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Trapandemulateexample

• CLI/STI– enablesanddisablesinterrupts• EFLAGSIFbittrackscurrentstatus• VMMmaintainsvirtualcopyofEFLAGSregister• VMMcontrolshardwareEFLAGS• ProbablyleaveinterruptsenablesevenifVMrunsCLI

• VMMlooksatvirtualEFLAGSregistertodecidewhentointerruptguest• VMMmustmakesureguestonlyseesvirtualEFLAGS

Page 14: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Whataboutvirtualmemory?

• WanttomaintainillusionthateachVMhasdedicatedphysicalmemory• GuestwantstostartatPA0,useallofRAM• VMMneedstosupportmanyguests,theycan’tallreallyusethesamephysicaladdresses• Idea:

• ClaimRAMissmallerthanrealRAM• Keeppagingenabled• Maintaina“shadow”copyofguestpagetable• ShadowmapsVAstodifferentPAthanguestrequests• Real%CR3pointstoshadowtable• Virtual%CR3pointstoguestpagetable

Page 15: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Virtualizationmemorydiagram

HostPageTable

HostVirtualAddress

HostPhysicalAddress

Page 16: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Virtualizationmemorydiagram

HostPageTable

HostVirtualAddress

HostPhysicalAddress

VMMMap

GuestVirtualAddress

HostPhysicalAddress

GuestPTGuestPhysicalAddress

ShadowPageTable

GuestVirtualAddress

HostPhysicalAddress

Page 17: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Example:

• Guestwantsguest-physical page@0x1000000• VMMmapredirectsguest-physical 0x1000000tohost-physical 0x2000000• VMMtrapsifguestchanges%cr3orwritestoguestpagetable• TransferseachguestPTEtoshadowpagetable• UsesVMMmaptotranslateguest-physical pageaddressesinpagetabletohost-physical addresses

Page 18: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Whycan’ttheVMMmodifytheguestpagetablein-place?

Page 19: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Needshadowcopyofallprivilegedstate• SofardiscussedEFLAGSandpagetables• AlsoneedGDT,IDT,LDTR,%CR*,etc.

Page 20: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Unfortunatelytrap-and-emulateisnotpossibleonx86Twoproblems:1. SomeinstructionsbehavedifferentlyinCPL3

insteadoftrapping2. SomeregistersleakstatethatrevealsiftheCPUis

runninginCPL3• Violatesfidelity property

Page 21: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

x86isn’tclassicallyvirtualizable

Problems->CPL3versusCPL0:• mov %cs,%ax• %cs containstheCPLinitslowertwobits

• popfl/pushfl• Privilegedbits,includingEFLAGS.IFaremaskedout

• iretq• Noringchange,sodoesn’trestoreSS/ESP

Page 22: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Twopossiblesolutions

1. Binarytranslation• Rewriteoffendinginstructionstobehavecorrectly

2. Hardwarevirtualization• CPUmaintainsshadowstateinternallyanddirectlyexecutesprivilegedguestinstructions

Page 23: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Strawmanbinarytranslation

• ReplaceallinstructionsthatcauseviolationswithINT$3,whichtraps• INT$3isonebyte,socanfitinsideanyx86instructionwithoutchangingsize/layout• Butunrealistic• Don’tknowthedifferencebetweencodeanddataorwhereinstructionboundarieslie• VMware’ssolutionismuchmoresophisticated

Page 24: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

VMware’sbinarytranslator

• KerneltranslateddynamicallylikeaJIT• idea:scanonlyasexecuted,sinceexecutionrevealsinstructionboundaries• whenVMMfirstloadsguestkernel,rewritefromentrytofirstjump• Mostinstructionstranslateidentically

• Needtotranslateinstructionsinchunks• Calledabasicblock• Either12instructionsorthecontrolflowinstruction,whicheveroccursfirst

• Onlyguestkernelcodeistranslated

Page 25: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

GuestkernelsharesaddressspacewithVMM• UsessegmentationtoprotectVMMmemory• VMMloadedathighvirtualaddresses,translatedguestkernelatlowaddresses• Programsegmentlimitsto“truncate”addressspace,preventingallsegmentsfromaccessingVMMexcept%GS• Whatifguestkernelinstructionuses%GSselector?• %GSprovidesfastaccesstodatasharedbetweenguestkernelandVMM

• Assumption:Translatedcodecan’tviolateisolation• Canneverdirectlyaccess%GS,%CR3,GDT,etc.

Page 26: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

WhyputguestkernelandVMMinsameaddressspace?

Page 27: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

WhyputguestkernelandVMMinsameaddressspace?• Sharedstatebecomesinexpensivetoaccesse.g.cli->“vcpu.flags.IF:=0”• Translatedcodeissafe,can’tviolateisolationaftertranslation

Page 28: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Translationexample• AllcontrolflowrequiresindirectionOriginal:isPrime()mov %ecx, %edi # %ecx = %edi (a)mov %esi, $2 # %esi = 2cmp %esi, %ecx # is i >= a?jge prime # if yes jump…

Csource:int isPrime(int a) {

for (int i = 2; i < a; i++) { if (a % i == 0) return 0;

}return 1;

}

Endofbasicblock

Page 29: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Translationexample• Allcontrolflowrequiresindirection• Original:isPrime()mov %ecx, %edi # %ecx = %edi (a)mov %esi, $2 # %esi = 2cmp %esi, %ecx # is i >= a?jge prime # if yes jump…

Translated:isPrime()’mov %ecx, %edi # IDENTmov %esi, $2cmp %esi, %ecxjge [takenAddr] # JCCjmp [fallthrAddr]

Page 30: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Translationexample

• Bracketsrepresentcontinuations• Firsttimetheyareexecuted,jumpintoBTandgeneratethenextbasicblock• Canelide“jmp [fallthraddr]”ifit’sthenextaddresstranslated• Indirectcontrolflowisharder• “(jmp,call,ret)doesnotgotoafixedtarget,preventingtranslation-timebinding.Instead,thetranslatedtargetmustbecomputeddynamically,e.g.,withahashtablelookup.Theresultingoverheadvariesbyworkloadbutistypicallyasingle-digitpercentage.”– frompaper

Page 31: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Hardwarevirtualization

• CPUmaintainsguest-copyofprivilegedstateinspecialregioncalledthevirtualmachinecontrolstructure(VMCS)• CPUoperatesintwomodes• VMXnon-rootmode:runsguestkernel• VMXrootmode:runsVMM• HardwaresavesandrestoresprivilegedregisterstatetoandfromtheVMCSasitswitchesmodes• Eachmodehasitsownseparateprivilegerings

• Neteffect:Hardwarecanrunmostprivilegedguestinstructionsdirectlywithoutemulation

Page 32: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

WhataboutMMU?

• Hardwareeffectivelymaintainstwopagetables• Normalpagetablecontrolledbyguestkernel• Extendedpagetable(EPT)controlledbyVMM• EPTdidn’texistwhenVMwarepublishedpaper

EPT

GuestVirtualAddress

HostPhysicalAddress

GuestPTGuestPhysicalAddress

Page 33: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

What’sbetterHWorSWvirt?

Page 34: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

What’sbetterHWorSWvirt?

• Softwarevirtualizationadvantages• Trapemulation:Mosttrapscanbereplacedwithcallouts• Emulationspeed:BTcangeneratepurpose-builtemulationcode,hardwaretrapsmustdecodetheinstruction,etc.

• Calloutavoidance:SometimesBTcaneveninlinecallouts

• Hardwarevirtualizationadvantages• Codedensity:Translatedcoderequiresmoreinstructionsandlargeropcodes

• Preciseexceptions:BTmustperformextraworktorecovergueststate

• Systemcalls:Don’trequireVMMintervention

Page 35: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

What’sbetterHWorSWvirt?

0.1

1

10

100

1000

10000

100000

ptemoddivzeropgfaultcallretcr8wrinsyscall

CPU

cycle

s (s

mal

ler i

s be

tter)

NativeSoftware VMM

Hardware VMM

Figure 4. Virtualization nanobenchmarks.

tween the two VMMs, the hardware VMM inducing approximately4.4 times greater overhead than the software VMM. Still, this pro-gram stresses many divergent paths through both VMMs, such assystem calls, context switching, creation of address spaces, modifi-cation of traced page table entries, and injection of page faults.

6.3 Virtualization nanobenchmarksTo better understand the performance differences between the twoVMMs, we wrote a series of “nanobenchmarks” that each exer-cise a single virtualization-sensitive operation. Often, the measuredoperation is a single instruction long. For precise control over theexecuted code, we repurposed a custom OS, FrobOS, that VMwaredeveloped for VMM testing.Our modified FrobOS boots, establishes a minimal runtime en-

vironment for C code, calibrates its measurement loops, and thenexecutes a series of virtualization-sensitive operations. The test re-peats each operation many times, amortizing the cost of the binarytranslator’s adaptations over multiple iterations. In our experience,this is representative of guest behavior, in which adaptation con-verges on a small fraction of poorly behaving guest instructions.The results of these nanobenchmarks are presented in Figure 4. Thelarge spread of cycle counts requires the use of a logarithmic scale.

syscall. This test measures round-trip transitions from user-level to supervisor-level via the syscall and sysret instructions.The software VMM introduces a layer of code and an extra privi-lege transition, requiring approximately 2000 more cycles than anative system call. In the hardware VMM, system calls executewithout VMM intervention, so as we expect, the hardware VMMexecutes system calls at native speed.

in. We execute an in instruction from port 0x80, the BIOSPOST port. Native execution accesses an off-CPU register in thechipset, requiring 3209 cycles. The software VMM, on the otherhand, translates the in into a short sequence of instructions thatinteracts with the virtual chipset model. Thus, the software VMMexecutes this instruction fifteen times faster than native. The hard-ware VMM must perform a vmm/guest round trip to complete theI/O operation. This transition causes in to consume 15826 cyclesin the tested system.

cr8wr. %cr8 is a privileged register that determines whichpending interrupts can be delivered. Only %cr8 writes that reduce%cr8 below the priority of the highest pending virtual interruptcause an exit [24]. Our FrobOS test never takes interrupts so no%cr8 write in the test ever causes an exit. As with syscall, thehardware VMM’s performance is similar to native. The softwareVMM translates %cr8 writes into a short sequence of simple in-

0

2

4

6

8

10

translateptemodpgfaultcallretcr8wrin/outsyscall

Ove

rhea

d (s

econ

ds)

Software VMMHardware VMM

Figure 5. Sources of virtualization overhead in an XP boot/halt.

structions, completing the %cr8 write in 35 cycles, about four timesfaster than native.

call/ret. BT slows down indirect control flow. We target thisoverhead by repeatedly calling a subroutine. Since the hardwareVMM executes calls and returns without modification, the hard-ware VMM and native both execute the call/return pair in 11 cycles.The software VMM introduces an average penalty of 40 cycles, re-quiring 51 cycles.

pgfault. In both VMMs, the software MMU interposes onboth true and hidden page faults. This test targets the overheadsfor true page faults. While both VMM paths are logically similar,the software VMM (3927 cycles) performs much better than thehardware VMM (11242 cycles). This is due mostly to the shorterpath whereby the software VMM receives control; page faults,while by no means cheap natively (1093 cycles on this hardware),are faster than a vmrun/exit round-trip.

divzero. Division by zero has fault semantics similar to thoseof page faults, but does not invoke the software MMU. Whiledivision by zero is uncommon in guest workloads, we includethis nanobenchmark to clarify the pgfault results. It allows usto separate out the virtualization overheads caused by faults fromthe overheads introduced by the virtual MMU. As expected, thehardware VMM (1014 cycles) delivers near native performance(889 cycles), decisively beating the software VMM (3223 cycles).

ptemod. Both VMMs use the shadowing technique described inSection 2.4 to implement guest paging with trace-based coherency.The traces induce significant overheads for PTE writes, causingvery high penalties relative to the native single cycle store. Thesoftware VMM adaptively discovers the PTE write and translates itinto a small program that is cheaper than a trap but still quite costly.This small program consumes 391 cycles on each iteration. Thehardware VMM enters and exits guest mode repeatedly, causingit to perform approximately thirty times worse than the softwareVMM, requiring 12733 cycles.To place this data in context, Figure 5 shows the total over-

heads incurred by each nano-operation during a 64-bit WindowsXP Professional boot/halt. Although the pgfault nanobenchmarkhas much higher cost on the hardware VMM than the softwareVMM, the boot/halt workload took so few true page faults that thedifference does not affect the bottom line materially. In contrast,the guest performed over 1 million PTE modifications, causinghigh overheads for the hardware VMM. While the figure may sug-gest that in/out dominates the execution profile of the hardwareVMM, the vast majority of these instructions originate in atypicalBIOS code that is unused after initial boot.

Page 36: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

What’sbettershadowpagetableorEPT?

Page 37: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

What’sbettershadowpagetableorEPT?• EPTisfasterwhenpagetablecontentschangefrequently• Fewertraps

• Shadowpagetableisfasterwhenpagetableisstable• LessTLBmissoverhead• Onepagetabletowalkthroughinsteadoftwo

Page 38: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Conclusion

• Virtualizationtransformedcloudcomputing,hadatremendousimpact• VirtualizationonPCswasalsobig,butlesssignificant

• VMwaremadevirtualizationpossibleonanarchitecturethatcouldn’tbevirtualized(x86)throughBT• PromptedIntelandAMDtochangehardware,sometimesfaster,sometimesslowerthanBT

Page 39: Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Adecadelater,what’schanged?

• HWvirtualizationbecamemuchfaster• Fewertraps,bettermicrocode,morededicatedlogic• AlmostallCPUarchitecturessupportHWvirt.• EPTwidelyavailable

• VMMsbecamecommoditized• BTtechnologywashardtobuild• VMMsbasedonHWvirt.aremucheasiertoimplement• Xen,KVM,HyperV,etc.

• I/Odevicesaren’tjustemulated,theycanbeexposeddirectly• IOMMUprovidespagingprotectionforDMA