Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

VirtualizationAdamBelay<[email protected]>

Whatisavirtualmachine

• Simulationofacomputer• Runningasanapplicationonahostcomputer• Accurate• Isolated• Fast

Whyuseavirtualmachine?

• Torunmultipleoperatingsystem(e.g.WindowsandLinux)• Tomanagebigmachines(allocatecoresandmemoryatO/Sgranularity)• Kerneldevelopment(e.g.likeQEMU+JOS)• Betterfaultisolation(defenseindepth)• Topackageapplicationswithaspecifickernelversionandenvironment• Toimproveresourceutilization

Howaccuratedowehavetobe?

• MusthandleweirdquirksinexistingOses• Evenbug-for-bugcompatibility

• Mustmaintainisolationwithmalicioussoftware• GuestcannotbreakoutofVM!

• MustbeimpossibleforguesttodistinguishVMfromrealmachine• SomeVMscompromise,modifyingtheguestkerneltoreduceaccuracyrequirement

VMsareanoldidea

• 1960s:IBMusedVMstosharebigmachines• 1970s:IBMspecializedCPUsforvirtualization• 1990s:VMwarerepopularized VMsforx86HW• 2000s:AMD&IntelspecializedCPUsforvirtualization

ProcessArchitecture

Hardware

OS

vi gcc firefox

VMArchitecture

• WhatiftheprocessabstractionlookedjustlikeHW?

Hardware

OS(VMM)

vi gcc firefox

GuestOS

VirtualHW

GuestOS

VirtualHW

ComparingaprocessandHW

Process• Nonprivilegedregistersandinstructions• Virtualmemory• Signals• Filesystemandsockets

Hardware• Allregistersandinstructions• Virt.mem.andMMU• Trapsandinterrupts• I/OdevicesandDMA

CanaCPUbevirtualized?

Requirementstobe“classicallyvirtualizable”definedbyPopek andGoldbergin1974:1. Fidelity:SoftwareontheVMMexecutes

identicallytoitsexecutiononhardware,barringtimingeffects.

2. Performance:AnoverwhelmingmajorityofguestinstructionsareexecutedbythehardwarewithouttheinterventionoftheVMM.

3. Safety:TheVMMmanagesallhardwareresources.

Whynotsimulation?

• VMMinterpretseachinstruction(e.g.BOCHS)• Maintainmachinestateforeachregister• EmulateI/Oportsandmemory• Violatesperformance requirement

Idea:ExecuteguestinstructionsonrealCPUwheneverpossible• Worksfineformostinstructions• E.g.add%eax,%ebx• Butprivilegedinstructionscouldbeharmful• Wouldviolatesafety property

Idea:RunguestkernelsatCPL3

• Ordinaryinstructionsworkfine• PrivilegedinstructionsshouldtraptoVMM(generalprotectionfault)• VMMcanapplyprivilegedoperationson“virtual”state,nottorealhardware• Thisiscalled“trap-and-emulate”

Trapandemulateexample

• CLI/STI– enablesanddisablesinterrupts• EFLAGSIFbittrackscurrentstatus• VMMmaintainsvirtualcopyofEFLAGSregister• VMMcontrolshardwareEFLAGS• ProbablyleaveinterruptsenablesevenifVMrunsCLI

• VMMlooksatvirtualEFLAGSregistertodecidewhentointerruptguest• VMMmustmakesureguestonlyseesvirtualEFLAGS

Whataboutvirtualmemory?

• WanttomaintainillusionthateachVMhasdedicatedphysicalmemory• GuestwantstostartatPA0,useallofRAM• VMMneedstosupportmanyguests,theycan’tallreallyusethesamephysicaladdresses• Idea:

• ClaimRAMissmallerthanrealRAM• Keeppagingenabled• Maintaina“shadow”copyofguestpagetable• ShadowmapsVAstodifferentPAthanguestrequests• Real%CR3pointstoshadowtable• Virtual%CR3pointstoguestpagetable

Virtualizationmemorydiagram

HostPageTable

HostVirtualAddress

HostPhysicalAddress

Virtualizationmemorydiagram

HostPageTable

HostVirtualAddress

HostPhysicalAddress

VMMMap

GuestVirtualAddress

HostPhysicalAddress

GuestPTGuestPhysicalAddress

ShadowPageTable

GuestVirtualAddress

HostPhysicalAddress

Example:

• Guestwantsguest-physical page@0x1000000• VMMmapredirectsguest-physical 0x1000000tohost-physical 0x2000000• VMMtrapsifguestchanges%cr3orwritestoguestpagetable• TransferseachguestPTEtoshadowpagetable• UsesVMMmaptotranslateguest-physical pageaddressesinpagetabletohost-physical addresses

Whycan’ttheVMMmodifytheguestpagetablein-place?

Needshadowcopyofallprivilegedstate• SofardiscussedEFLAGSandpagetables• AlsoneedGDT,IDT,LDTR,%CR*,etc.

Unfortunatelytrap-and-emulateisnotpossibleonx86Twoproblems:1. SomeinstructionsbehavedifferentlyinCPL3

insteadoftrapping2. SomeregistersleakstatethatrevealsiftheCPUis

runninginCPL3• Violatesfidelity property

x86isn’tclassicallyvirtualizable

Problems->CPL3versusCPL0:• mov %cs,%ax• %cs containstheCPLinitslowertwobits

• popfl/pushfl• Privilegedbits,includingEFLAGS.IFaremaskedout

• iretq• Noringchange,sodoesn’trestoreSS/ESP

Twopossiblesolutions

1. Binarytranslation• Rewriteoffendinginstructionstobehavecorrectly

2. Hardwarevirtualization• CPUmaintainsshadowstateinternallyanddirectlyexecutesprivilegedguestinstructions

Strawmanbinarytranslation

• ReplaceallinstructionsthatcauseviolationswithINT$3,whichtraps• INT$3isonebyte,socanfitinsideanyx86instructionwithoutchangingsize/layout• Butunrealistic• Don’tknowthedifferencebetweencodeanddataorwhereinstructionboundarieslie• VMware’ssolutionismuchmoresophisticated

VMware’sbinarytranslator

• KerneltranslateddynamicallylikeaJIT• idea:scanonlyasexecuted,sinceexecutionrevealsinstructionboundaries• whenVMMfirstloadsguestkernel,rewritefromentrytofirstjump• Mostinstructionstranslateidentically

• Needtotranslateinstructionsinchunks• Calledabasicblock• Either12instructionsorthecontrolflowinstruction,whicheveroccursfirst

• Onlyguestkernelcodeistranslated

GuestkernelsharesaddressspacewithVMM• UsessegmentationtoprotectVMMmemory• VMMloadedathighvirtualaddresses,translatedguestkernelatlowaddresses• Programsegmentlimitsto“truncate”addressspace,preventingallsegmentsfromaccessingVMMexcept%GS• Whatifguestkernelinstructionuses%GSselector?• %GSprovidesfastaccesstodatasharedbetweenguestkernelandVMM

• Assumption:Translatedcodecan’tviolateisolation• Canneverdirectlyaccess%GS,%CR3,GDT,etc.

WhyputguestkernelandVMMinsameaddressspace?

WhyputguestkernelandVMMinsameaddressspace?• Sharedstatebecomesinexpensivetoaccesse.g.cli->“vcpu.flags.IF:=0”• Translatedcodeissafe,can’tviolateisolationaftertranslation

Translationexample• AllcontrolflowrequiresindirectionOriginal:isPrime()mov %ecx, %edi # %ecx = %edi (a)mov %esi, $2 # %esi = 2cmp %esi, %ecx # is i >= a?jge prime # if yes jump…

Csource:int isPrime(int a) {

for (int i = 2; i < a; i++) { if (a % i == 0) return 0;

}return 1;

}

Endofbasicblock

Translationexample• Allcontrolflowrequiresindirection• Original:isPrime()mov %ecx, %edi # %ecx = %edi (a)mov %esi, $2 # %esi = 2cmp %esi, %ecx # is i >= a?jge prime # if yes jump…

Translated:isPrime()’mov %ecx, %edi # IDENTmov %esi, $2cmp %esi, %ecxjge [takenAddr] # JCCjmp [fallthrAddr]

…

Translationexample

• Bracketsrepresentcontinuations• Firsttimetheyareexecuted,jumpintoBTandgeneratethenextbasicblock• Canelide“jmp [fallthraddr]”ifit’sthenextaddresstranslated• Indirectcontrolflowisharder• “(jmp,call,ret)doesnotgotoafixedtarget,preventingtranslation-timebinding.Instead,thetranslatedtargetmustbecomputeddynamically,e.g.,withahashtablelookup.Theresultingoverheadvariesbyworkloadbutistypicallyasingle-digitpercentage.”– frompaper

Hardwarevirtualization

• CPUmaintainsguest-copyofprivilegedstateinspecialregioncalledthevirtualmachinecontrolstructure(VMCS)• CPUoperatesintwomodes• VMXnon-rootmode:runsguestkernel• VMXrootmode:runsVMM• HardwaresavesandrestoresprivilegedregisterstatetoandfromtheVMCSasitswitchesmodes• Eachmodehasitsownseparateprivilegerings

• Neteffect:Hardwarecanrunmostprivilegedguestinstructionsdirectlywithoutemulation

WhataboutMMU?

• Hardwareeffectivelymaintainstwopagetables• Normalpagetablecontrolledbyguestkernel• Extendedpagetable(EPT)controlledbyVMM• EPTdidn’texistwhenVMwarepublishedpaper

EPT

GuestVirtualAddress

HostPhysicalAddress

GuestPTGuestPhysicalAddress

What’sbetterHWorSWvirt?


• Softwarevirtualizationadvantages• Trapemulation:Mosttrapscanbereplacedwithcallouts• Emulationspeed:BTcangeneratepurpose-builtemulationcode,hardwaretrapsmustdecodetheinstruction,etc.

• Calloutavoidance:SometimesBTcaneveninlinecallouts

• Hardwarevirtualizationadvantages• Codedensity:Translatedcoderequiresmoreinstructionsandlargeropcodes

• Preciseexceptions:BTmustperformextraworktorecovergueststate

• Systemcalls:Don’trequireVMMintervention


0.1

1

10

100

1000

10000

100000

ptemoddivzeropgfaultcallretcr8wrinsyscall

CPU

cycle

s (s

mal

ler i

s be

tter)

NativeSoftware VMM

Hardware VMM

Figure 4. Virtualization nanobenchmarks.

tween the two VMMs, the hardware VMM inducing approximately4.4 times greater overhead than the software VMM. Still, this pro-gram stresses many divergent paths through both VMMs, such assystem calls, context switching, creation of address spaces, modifi-cation of traced page table entries, and injection of page faults.

6.3 Virtualization nanobenchmarksTo better understand the performance differences between the twoVMMs, we wrote a series of “nanobenchmarks” that each exer-cise a single virtualization-sensitive operation. Often, the measuredoperation is a single instruction long. For precise control over theexecuted code, we repurposed a custom OS, FrobOS, that VMwaredeveloped for VMM testing.Our modified FrobOS boots, establishes a minimal runtime en-

vironment for C code, calibrates its measurement loops, and thenexecutes a series of virtualization-sensitive operations. The test re-peats each operation many times, amortizing the cost of the binarytranslator’s adaptations over multiple iterations. In our experience,this is representative of guest behavior, in which adaptation con-verges on a small fraction of poorly behaving guest instructions.The results of these nanobenchmarks are presented in Figure 4. Thelarge spread of cycle counts requires the use of a logarithmic scale.

syscall. This test measures round-trip transitions from user-level to supervisor-level via the syscall and sysret instructions.The software VMM introduces a layer of code and an extra privi-lege transition, requiring approximately 2000 more cycles than anative system call. In the hardware VMM, system calls executewithout VMM intervention, so as we expect, the hardware VMMexecutes system calls at native speed.

in. We execute an in instruction from port 0x80, the BIOSPOST port. Native execution accesses an off-CPU register in thechipset, requiring 3209 cycles. The software VMM, on the otherhand, translates the in into a short sequence of instructions thatinteracts with the virtual chipset model. Thus, the software VMMexecutes this instruction fifteen times faster than native. The hard-ware VMM must perform a vmm/guest round trip to complete theI/O operation. This transition causes in to consume 15826 cyclesin the tested system.

cr8wr. %cr8 is a privileged register that determines whichpending interrupts can be delivered. Only %cr8 writes that reduce%cr8 below the priority of the highest pending virtual interruptcause an exit [24]. Our FrobOS test never takes interrupts so no%cr8 write in the test ever causes an exit. As with syscall, thehardware VMM’s performance is similar to native. The softwareVMM translates %cr8 writes into a short sequence of simple in-

0

2

4

6

8

10

translateptemodpgfaultcallretcr8wrin/outsyscall

Ove

rhea

d (s

econ

ds)

Software VMMHardware VMM

Figure 5. Sources of virtualization overhead in an XP boot/halt.

structions, completing the %cr8 write in 35 cycles, about four timesfaster than native.

call/ret. BT slows down indirect control flow. We target thisoverhead by repeatedly calling a subroutine. Since the hardwareVMM executes calls and returns without modification, the hard-ware VMM and native both execute the call/return pair in 11 cycles.The software VMM introduces an average penalty of 40 cycles, re-quiring 51 cycles.

pgfault. In both VMMs, the software MMU interposes onboth true and hidden page faults. This test targets the overheadsfor true page faults. While both VMM paths are logically similar,the software VMM (3927 cycles) performs much better than thehardware VMM (11242 cycles). This is due mostly to the shorterpath whereby the software VMM receives control; page faults,while by no means cheap natively (1093 cycles on this hardware),are faster than a vmrun/exit round-trip.

divzero. Division by zero has fault semantics similar to thoseof page faults, but does not invoke the software MMU. Whiledivision by zero is uncommon in guest workloads, we includethis nanobenchmark to clarify the pgfault results. It allows usto separate out the virtualization overheads caused by faults fromthe overheads introduced by the virtual MMU. As expected, thehardware VMM (1014 cycles) delivers near native performance(889 cycles), decisively beating the software VMM (3223 cycles).

ptemod. Both VMMs use the shadowing technique described inSection 2.4 to implement guest paging with trace-based coherency.The traces induce significant overheads for PTE writes, causingvery high penalties relative to the native single cycle store. Thesoftware VMM adaptively discovers the PTE write and translates itinto a small program that is cheaper than a trap but still quite costly.This small program consumes 391 cycles on each iteration. Thehardware VMM enters and exits guest mode repeatedly, causingit to perform approximately thirty times worse than the softwareVMM, requiring 12733 cycles.To place this data in context, Figure 5 shows the total over-

heads incurred by each nano-operation during a 64-bit WindowsXP Professional boot/halt. Although the pgfault nanobenchmarkhas much higher cost on the hardware VMM than the softwareVMM, the boot/halt workload took so few true page faults that thedifference does not affect the bottom line materially. In contrast,the guest performed over 1 million PTE modifications, causinghigh overheads for the hardware VMM. While the figure may sug-gest that in/out dominates the execution profile of the hardwareVMM, the vast majority of these instructions originate in atypicalBIOS code that is unused after initial boot.

What’sbettershadowpagetableorEPT?

What’sbettershadowpagetableorEPT?• EPTisfasterwhenpagetablecontentschangefrequently• Fewertraps

• Shadowpagetableisfasterwhenpagetableisstable• LessTLBmissoverhead• Onepagetabletowalkthroughinsteadoftwo

Conclusion

• Virtualizationtransformedcloudcomputing,hadatremendousimpact• VirtualizationonPCswasalsobig,butlesssignificant

• VMwaremadevirtualizationpossibleonanarchitecturethatcouldn’tbevirtualized(x86)throughBT• PromptedIntelandAMDtochangehardware,sometimesfaster,sometimesslowerthanBT

Adecadelater,what’schanged?

• HWvirtualizationbecamemuchfaster• Fewertraps,bettermicrocode,morededicatedlogic• AlmostallCPUarchitecturessupportHWvirt.• EPTwidelyavailable

• VMMsbecamecommoditized• BTtechnologywashardtobuild• VMMsbasedonHWvirt.aremucheasiertoimplement• Xen,KVM,HyperV,etc.

• I/Odevicesaren’tjustemulated,theycanbeexposeddirectly• IOMMUprovidespagingprotectionforDMA

Virtualization - Distributed Operating System · 6.3 Virtualization nanobenchmarks To better understand the performance differences between the two VMMs, we wrote a series of “nanobenchmarks”

Documents