D u k e S y s t e m s Intro to Clouds Jeff Chase Dept. of Computer Science Duke University.

D u k e S y s t e m s

Intro to Clouds

Jeff ChaseDept. of Computer Science

Duke University

Virtual machinesPart 1

The story so far: OS platforms

• OS platforms let us run programs in contexts.

• Contexts are protected/isolated to varying degrees.

• The OS platform TCB offers APIs to create and manipulate protected contexts.– It enforces isolation of contexts for running programs.

– It governs access to hardware resources.

• Classical example: – Unix context: process

– Unix TCB: kernel

– Unix kernel API: syscalls

The story so far: layered platforms

• We can layer “new” platforms on “old” ones.– The outer layer hides the the inner layer,

– covering the inner APIs and abstractions, and

– replacing them with the model of the new platform.

• Example: Android over Linux

AndroidAMS

JVM+lib

Native virtual machines (VMs)

• Slide a hypervisor underneath the kernel.– New OS/TCB layer: virtual machine monitor (VMM).

• Kernel and processes run in a virtual machine (VM).– The VM “looks the same” to the OS as a physical machine.

– The VM is a sandboxed/isolated context for an entire OS.

• A VMM can run multiple VMs on a shared computer.

guest or tenant

VM contexts

hosthypervisor/VMM

guest VM1 guest VM2 guest VM3

OS kernel 1 OS kernel 2 OS kernel 3

P1A P2B P3C

What is a “program” for a VM?

VMM/hypervisor is a new layer of OS platform, with a new kind of protected context. What is a program?

What kind of program do we

launch into a VM context?

guest kernel

app

hypervisor/VMM

app app

It’s called avirtual appliance or VM image.

A VM is called an instanceof the image.

???

V 4.2.9

virtual appliance contains a

complete OS system image, with file tree and apps

[Graphics are from rPath inc. and VMware inc.]

Thank you, VMware

When virtual is better than real

Motivation: support multiple OS

When virtual is better than realeveryone plays nicely together

[image from virtualbox.org]

The story so far: protected CPU mode

user mode

kernel mode

kernel “top half”kernel “bottom half” (interrupt handlers)

syscall trap

u-start u-return u-start

fault

u-return

fault

clock interrupt

interruptreturn

Kernel handler manipulates CPU register context to return

to selected user context.

Any kind of machine exception transfers control to a registered (trusted) kernel handler running in a protected CPU mode.

A closer look

syscall trap

u-start

u-return

u-start

fault

u-return

fault

clock interrupt

interruptreturn

X

user stack

kernel stack

user stack

kernel stack

handler dispatch

table

boot

u-return

IA/x86 Protection Rings (CPL)

• Modern CPUs have multiple protected modes.

• History: IA/x86 rings (CPL)

– Has built in security levels (Rings 0, 1, 2, 3)

– Ring 0 – “Kernel mode” (most privileged)

– Ring 3 – “User mode”

• Unix uses only two modes:

– user – untrusted execution

– kernel – trusted execution

Increasing Privilege Level

Ring 0

Ring 1

Ring 2

Ring 3

CPU Privilege Level (CPL)

[Fischbach]

Protection Rings

• New Intel VT and AMD SVM CPUs introduce new protected modes for VMM hypervisors.

• We can think of it as a new inner ring: one ring to bind them all.

• Warning: this is an oversimplification: the actual architecture is more complex for backward compatibility.

hypervisor

kernel

user

hypervisor

guest

user

Protection Rings

• Computer scientists have drawn these rings since the 1960s.

• They represent layering: the outer ring “hides” the interface of the lower ring.

• The machine defines the events (exceptions) that transition to higher privilege (inner ring).

• Inner rings register handlers to intercept selected events.

• But the picture is misleading….

Increasing Privilege Level

Ring 0

Ring 1

Ring 2

Ring 3

[Fischbach]

Protection Rings

• We might just as soon draw it “inside out”.

• Now the ring represents power: what the code at that ring can access or modify.

• Bigger rings have more power.

• Inclusion: bigger rings can see or do anything that the smaller rings can do.

• And they can manipulate the state of the rings they contain.

• But still misleading: there are multiple ‘instances’ of the weaker rings.

hypervisor

guest

user

Maybe a better picture…

There are multiple ‘instances’ of the weaker rings.

And powers are nested: an outer ring limits the “sandbox” or scope of the rings it contains.

Post-note

• The remaining slides in the section are just more slides to reinforce these concepts.

• We didn’t see them in class.

• There is more detail in the reading…

registers

CPU core

R0

Rn

PC x

mode

CPU mode (a field in some status register) indicates whether a machine CPU (core) is running in a user program or in the protected kernel.

Some instructions or register accesses are legal only when the CPU (core) is executing in kernel mode.

CPU mode transitions to kernel mode only on machine exception events (trap, fault, interrupt), which transfers control to a handler registered by the kernel with the machine at boot time.

So only the kernel program chooses what code ever runs in the kernel mode (or so we hope and intend).

A kernel handler can read the user register values at the time of the event, and modify them arbitrarily before (optionally) returning to user mode.

Kernel Mode

U/K

synchronouscaused by an

instruction

asynchronouscaused by some other

event

intentionalhappens every time

unintentionalcontributing factors

trap: system callopen, close, read,

write, fork, exec, exit, wait, kill, etc.

faultinvalid or protected

address or opcode, page fault, overflow, etc.

interruptcaused by an external

event: I/O op completed, clock tick, power fail, etc.

“software interrupt” software requests an

interrupt to be delivered at a later time

Exceptions: trap, fault, interrupt

Kernel Stacks and Trap/Fault Handling

data

Processes execute user

code on a user stack in the user virtual

memory in the process virtual address space.

Each process has a second kernel stack in kernel

space (VM accessible only to

the kernel).

stack

stack

stack

stack

System calls and faults run

in kernel mode on the

process kernel stack.

syscall dispatch

table

Kernel code running in P’s

process context (i.e., on its kstack) has

access to P’s virtual memory.

The syscall handler makes an indirect call through the system call dispatch table to the handler registered for the specific system call.

More on VMsRecent CPUs support additional protected mode(s) for hypervisors. When the hypervisor initializes, it selects some set of event types to intercept, and registers handlers for them.

Selected machine events occuring in user mode or kernel mode transfer control to a hypervisor handler. For example, a guest OS kernel accessing device registers may cause the physical machine to invoke the hypervisor to intervene.

In addition, the VM architecture has another level of indirection in the MMU page tables: the hypervisor can specify and restrict what parts of physical memory are visible to each guest VM.

A guest VM kernel can map to or address a physical memory frame or command device DMA I/O to/from a physical frame if and only if the hypervisor permits it.

If any guest VM tries to do anything weird, then the hypervisor regains control and can see or do anything to any part of the physical or virtual machine state before (optionally) restarting the guest VM.

If you are interested…

2.1 The Intel VT-x Extension

In order to improve virtualization performance and simplify VMM implementation, Intel has developed VT-x [37], a virtualization extension to the x86 ISA. AMD also provides a similar extension with a different hardware interface called SVM [3].

The simplest method of adapting hardware to support virtualization is to introduce a mechanism for trapping each instruction that accesses privileged state so that emulation can be performed by a VMM. VT-x embraces a more sophisticated approach, inspired by IBM’s interpre tive execution architecture [31], where as many instructions as possible, including most that access privileged state, are executed directly in hardware without any intervention from the VMM. This is possible because hardware maintains a “shadow copy” of privileged state. The motivation for this approach is to increase performance, as traps can be a significant source of overhead.

VT-x adopts a design where the CPU is split into two operating modes: VMX root and VMX non-root mode. VMX root mode is generally used to run the VMM and does not change CPU behavior, except to enable access to new instructions for managing VT-x. VMX non-root mode, on the other hand, restricts CPU behavior and is intended for running virtualized guest OSes.

Transitions between VMX modes are managed by hardware. When the VMM executes the VMLAUNCH or VMRESUME instruction, hardware performs a VM entry; placing the CPU in VMX non-root mode and executing the guest. Then, when action is required from the VMM, hardware performs a VM exit, placing the CPU back in VMX root mode and jumping to a VMM entry point. Hardware automatically saves and restores most architectural state during both types of transitions. This is accomplished by using buffers in a memory resident data structure called the VM control structure (VMCS).

In addition to storing architectural state, the VMCS contains a myriad of configuration parameters that allow the VMM to control execution and specify which type of events should generate VM exits. This gives the VMM considerable flexibility in determining which hardware is exposed to the guest. For example, a VMM could configure the VMCS so that the HLT instruction causes a VM exit or it could allow the guest to halt the CPU. However, some hardware interfaces, such as the interrupt descriptor table (IDT) and privilege modes, are exposed implicitly in VMX non-root mode and never generate VM exits when accessed. Moreover, a guest can manually request a VM exit by using the VMCALL instruction.

Virtual memory is perhaps the most difficult hardware feature for a VMM to expose safely. A straw man solution would be to configure the VMCS so that the guest has access to the page table root register, %CR3. However, this would place complete trust in the guest because it would be possible for it to configure the page table to access any physical memory address, including memory that belongs to the VMM. Fortunately, VT-x includes a dedicated hardware mechanism, called the extended page table (EPT), that can enforce memory isolation on guests with direct access to virtual memory. It works by applying a second, underlying, layer of address translation that can only be configured by the VMM. AMD’s SVM includes a similar mechanism to the EPT, referred to as a nested page table (NPT).

From Dune: Safe User-level Access to Privileged CPU Features, Belay e.t al., (Stanford), OSDI, October, 2012

VT in a Nutshell

• New VM mode bit– Orthogonal to kernel/user mode or rings (CPL)

• If VM mode is off– Machine looks just like it always did

• If VM bit is on– Machine is running a guest VM

– “VMX non-root operation”

– Various events cause gated entry into hypervisor

– “virtualization intercept”

– Hypervisor can control which events cause intercepts

– Hypervisor can examine/manipulate guest VM state

ServicesPart 2

There is another motivation for VMs and hypervisors. Application services and computational jobs need access to computing power “on tap”. Virtualization allows the owner of a server to “slice and dice” server resources and allocate the virtual slices out to customers as VMs. The customers can install and manage their own software their own way in their own VMs. That is cloud hosting.

Services

RPC

GET (HTTP)

End-to-end application delivery

Cloud and Software-as-a-Service (SaaS)Rapid evolution, no user upgrade, no user data management.Agile/elastic deployment on virtual infrastructure.

Where is your application?Where is your data?Where is your OS?

Networking

channelbinding

connection

endpointport

Some IPC mechanisms allow communication across a network.E.g.: sockets using Internet communication protocols (TCP/IP).Each endpoint on a node (host) has a port number.

Each node has one or more interfaces, each on at most one network.Each interface may be reachable on its network by one or more names.

E.g. an IP address and an (optional) DNS name.

node A node B

operationsadvertise (bind)listenconnect (bind)close

write/sendread/receive

SaaS platform elements

[wiki.eeng.dcu.ie]“Classical OS”

browsercontainer

[Graphic from Amazon: Mike Culver, Web Scale Computing]

Motivation: “Success disaster”

[Graphic from Amazon: Mike Culver, Web Scale Computing]

Motivation: “Success disaster”

Virtual Cloud hostingPart 2

“Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

- US National Institute for Standards and Technology http://www.csrc.nist.gov/groups/SNS/cloud-computing/

Client Server(s)

Cloud > server-based computing

• Client/server model (1980s - )

• Now called Software-as-a-Service (SaaS)

Cloud Provider(s)

Host

GuestClient Service

Host/guest model

• Service is hosted by a third party.– flexible programming model

– cloud APIs for service to allocate/link resources

– on-demand: pay as you grow

OS

VMM

Physical

Platform

Client Service

IaaS: infrastructure services

Deployment of private clouds is growing rapidly w/ open IaaS cloud software.

Hosting performance and isolation is determined by virtualization layer

Virtual machines: VMware, KVM, etc.

OS

VMM (optional)

Physical

Platform

Client Service

PaaS cloud services define the high-level programming models, e.g., for clusters or specific application classes.

PaaS: platform services

Hadoop, grids,batch job services, etc. can also be viewed as PaaS category.

Note: can deploy them over IaaS.

Varying workload

Fixed system Varying performance

Varying workload

Varying system Fixed performance

Varying workload

Varying system Target performance

“Elastic Cloud”Resource Control

Elastic provisioning

Managing Energy and Server Resources in Hosting Centers, SOSP, October 2001.

EC2 The canonical public cloud

Virtual Appliance

Image

OpenStack, the Cloud Operating SystemManagement Layer That Adds Automation & Control

[Anthony Young @ Rackspace]

IaaS Cloud APIs (OpenStack, EC2)

• Query of availability zones (i.e. clusters in Eucalyptus)

• SSH public key management (add, list, delete)

• VM management (start, list, stop, reboot, get console output)

• Security group management

• Volume and snapshot management (attach, list, detach, create, bundle, delete)

• Image management (bundle, upload, register, list, deregister)

• IP address management (allocate, associate, list, release)

Adding storage

Competing Cloud Models: PaaS vs. IaaS

• Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.

• Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).

Amazon Elastic Compute Cloud (EC2)EucalyptusOpenNebula

Post-note

• The remaining slides weren’t discussed.

• Some give more info on the various forms of cloud computing following the NIST model. Just understand IaaS and PaaS hosting models.

• The “Adaptation” slides deal with resource management: what assurances does the holder of virtual infrastructure have about how much resource it will receive, and how good its performance will (therefore) be? We’ll discuss this more later.

• The last slide refers to an advanced cloud project at Duke and RENCI.org, partially funded by NSF Global Environment for Network Innovations (geni.net).

Managing images

• “Let a thousand flowers bloom.”

• Curated image collections are needed!

• “Virtual appliance marketplace”

Infrastructure as a Service (IaaS)“Consumers of IaaS have access to virtual computers, network-accessible storage, network infrastructure components, and other fundamental computing resources…and are billed according to the amount or duration of the resources consumed.”

Cloud Models

• Cloud Software as a Service (SaaS)– Use provider’s applications over a network

• Cloud Platform as a Service (PaaS)– Deploy customer-created applications to a cloud

• Cloud Infrastructure as a Service (IaaS)– Rent processing, storage, network capacity, and

other fundamental computing resources

NIST Cloud Definition Framework

CommunityCommunityCloudCloud

Private Private CloudCloud

Public CloudPublic Cloud

Hybrid Clouds

DeploymentModels

ServiceModels

EssentialCharacteristics

Common Characteristics

Software as a Service (SaaS)

Platform as a Service (PaaS)

Infrastructure as a Service (IaaS)

Resource Pooling

Broad Network Access Rapid Elasticity

Measured Service

On Demand Self-Service

Low Cost Software

Virtualization Service Orientation

Advanced Security

Homogeneity

Massive Scale Resilient Computing

Geographic Distribution

Adaptations: Describing IaaS Services

Computer

CPU

Memory

Disk

BW

ra=(8,4)

rb=(4,8)

a

b

crc=(4,4)

→

→

→16

CPU shares

mem

ory

shar

es

Adaptations: service classes

• Must adaptations promise performance isolation?

• There is a wide range of possible service classes…to the extent that we can reason about them.

Availablesurplus

Weakeffort

Besteffort

Proportionalshare

Elastic reservation

Hard reservation

Continuum of service classes

Reflects load factor or overbooking degree

Reflects priority

Constructing “slices”

• I like to use TinkerToys as a metaphor for creating a slice in the GENI federated cloud.

• The parts are virtual infrastructure resources: compute, networking, storage, etc.

• Parts come in many types, shapes, sizes.

• Parts interconnect in various ways. • We combine them to create useful

built-to-order assemblies.• Some parts are programmable.• Where do the parts come from?

D u k e S y s t e m s Intro to Clouds Jeff Chase Dept. of Computer Science Duke University.

Documents

multiple os slide

vmware slide

syscalls slide

p1ap2bp3c slide

kernel unix kernel api

half kernel

os platforms os platforms

new layer of os platform