Top Banner
1 *Director, Intel Research Berke ://abovetheclouds.cs.berkeley.edu/ Cloud Computing: Past, Present, and Future Professor Anthony D. Joseph*, UC Berkeley Reliable Adaptive Distributed Systems Lab RWTH Aachen 22 March 2010 UC Berkeley
86
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloud Computing

UC Berkeley

1*Director, Intel Research Berkeleyhttp://abovetheclouds.cs.berkeley.edu/

Cloud Computing: Past, Present, and Future

Professor Anthony D. Joseph*, UC BerkeleyReliable Adaptive Distributed Systems Lab

RWTH Aachen22 March 2010

UC Berkeley

Page 2: Cloud Computing

RAD Lab 5-year Mission

Enable 1 person to develop, deploy, operate next -generation Internet application

• Key enabling technology: Statistical machine learning– debugging, monitoring, pwr mgmt, auto-configuration, perf prediction, ...

• Highly interdisciplinary faculty & students– PI’s: Patterson/Fox/Katz (systems/networks), Jordan (machine learning),

Stoica (networks & P2P), Joseph (security), Shenker (networks), Franklin (DB)– 2 postdocs, ~30 PhD students, ~6 undergrads

• Grad/Undergrad teaching integrated with research

Page 3: Cloud Computing

Course Timeline

• Friday– 10:00-12:00 History of Cloud Computing: Time-sharing, virtual

machines, datacenter architectures, utility computing– 12:00-13:30 Lunch– 13:30-15:00 Modern Cloud Computing: economics, elasticity,

failures– 15:00-15:30 Break– 15:30-17:00 Cloud Computing Infrastructure: networking,

storage, computation models• Monday

– 10:00-12:00 Cloud Computing research topics: scheduling, multiple datacenters, testbeds

Page 4: Cloud Computing

NEXUS: A COMMON SUBSTRATE FOR CLUSTER COMPUTING

Joint work with Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Scott Shenker, and Ion Stoica

Page 5: Cloud Computing

Recall: Hadoop on HDFS

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

Adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

Page 6: Cloud Computing

Problem

•Rapid innovation in cluster computing frameworks•No single framework optimal for all applications•Energy efficiency means maximizing cluster utilization•Want to run multiple frameworks in a single cluster

Page 7: Cloud Computing

What do we want to run in the cluster?

Dryad

ApacheHama

Pregel

Pig

Page 8: Cloud Computing

Why share the cluster between frameworks?

• Better utilization and efficiency (e.g., take advantage of diurnal patterns)

• Better data sharing across frameworks and applications

Page 9: Cloud Computing

Solution

Nexus is an “operating system” for the cluster over which diverse frameworks can run

– Nexus multiplexes resources between frameworks– Frameworks control job execution

Page 10: Cloud Computing

Goals

• Scalable• Robust (i.e., simple enough to harden)• Flexible enough for a variety of different

cluster frameworks• Extensible enough to encourage innovative

future frameworks

Page 11: Cloud Computing

Question 1: Granularity of SharingOption: Coarse-grained sharing

– Give framework a (slice of) machine for its entire duration

Hadoop 1

Hadoop 2

Hadoop 3

Data locality compromised if machine held for long time

Hard to account for new frameworks and changing demands -> hurts utilization and interactivity

Page 12: Cloud Computing

Nexus: Fine-grained sharing– Support frameworks that use smaller tasks (in time and

space) by multiplexing them across all available resources

Question 1: Granularity of Sharing

Frameworks can take turns accessing data on each node

Can resize frameworks shares to get utilization & interactivity

Hadoop 1

Hadoop 1

Hadoop 1

Hadoop 1Hadoop 3

Hadoop 3 Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2

Hadoop 1

Hadoop 3

Hadoop 2Hadoop 3

Hadoop 1

Hadoop 2

Page 13: Cloud Computing

Question 2: Resource Allocation

Option: Global scheduler– Frameworks express needs in a specification language, a

global scheduler matches resources to frameworks

• Requires encoding a framework’s semantics using the language, which is complex and can lead to ambiguities

• Restricts frameworks if specification is unanticipated

Designing a general-purpose global scheduler is hard

Page 14: Cloud Computing

Question 2: Resource Allocation

Nexus: Resource offers– Offer free resources to frameworks, let frameworks

pick which resources best suit their needs+Keeps Nexus simple and allows us to support future jobs- Distributed decisions might not be optimal

Page 15: Cloud Computing

Outline

• Nexus Architecture• Resource Allocation• Multi-Resource Fairness• Implementation• Results

Page 16: Cloud Computing

NEXUS ARCHITECTURE

Page 17: Cloud Computing

Nexus slave

Nexus master

Hadoop v20 scheduler

Nexus slave

Hadoop job

Hadoop v20 executor

task

Nexus slaveHadoop v19 executor

task

MPIscheduler

MPI job

MPIexecutor

task

Overview

Hadoop v19 scheduler

Hadoop job

Hadoop v19 executor

task

MPIexecutor

task

Page 18: Cloud Computing

Nexus slaveNexus slave

Nexus master

MPI executor

task

Hadoopscheduler

Hadoop job

Resource Offers

MPIscheduler

MPI job

MPIexecutor

task

Pick framework to offer to

Resourceoffer

Page 19: Cloud Computing

Nexus slaveNexus slave

Nexus master

MPI executor

task

Hadoopscheduler

Hadoop job

Resource Offers

MPIscheduler

MPI job

MPIexecutor

task

Pick framework to offer to

Resource offer

offer = list of {machine,

free_resources}

Example: [ {node 1, <2 CPUs, 4 GB>}, {node 2, <2 CPUs, 4 GB>} ]

Page 20: Cloud Computing

Nexus slave

Nexus master

Nexus slaveMPI

executor

task

Hadoopscheduler

Hadoop job

Hadoopexecutor

Resource Offers

MPIscheduler

MPI job

MPIexecutor

task

Framework-specific scheduling

Pick framework to offer to

Launches & isolates executors

task

Resourceoffer

Page 21: Cloud Computing

Resource Offer Details

•Min and max task sizes to control fragmentation•Filters let framework restrict offers sent to it

– By machine list– By quantity of resources

•Timeouts can be added to filters•Frameworks can signal when to destroy filters, or when they want more offers

Page 22: Cloud Computing

Using Offers for Data Locality

We found that a simple policy called delay scheduling can give very high locality:

– Framework waits for offers on nodes that have its data

– If waited longer than a certain delay, starts launching non-local tasks

Page 23: Cloud Computing

Framework Isolation

• Isolation mechanism is pluggable due to the inherent perfomance/isolation tradeoff

• Current implementation supports Solaris projects and Linux containers – Both isolate CPU, memory and network

bandwidth– Linux developers working on disk IO isolation

• Other options: VMs, Solaris zones, policing

Page 24: Cloud Computing

RESOURCE ALLOCATION

Page 25: Cloud Computing

Allocation Policies

•Nexus picks framework to offer resources to, and hence controls how many resources each framework can get (but not which)•Allocation policies are pluggable to suit organization needs, through allocation modules

Page 26: Cloud Computing

Example: Hierarchical Fairshare Policy

Facebook.com

Spam Ads

Job 3

Job 2

User 1

Job 1

User 2

Job 4

100%

0 1 2 30%

20%

40%

60%

80%

100%

Cluster Utilization

Time

CurrTime

80%20%

30%

70%User 1User 2

Cluster Share Policy

20%

80%

Spam Dept.Ads Dept.

20% 14%100%

CurrTime

6%

CurrTime

0%

70%30%

Page 27: Cloud Computing

Revocation

Killing tasks to make room for other usersNot the normal case because fine-grained tasks enable quick reallocation of resources Sometimes necessary:

– Long running tasks never relinquishing resources– Buggy job running forever– Greedy user who decides to makes his task long

Page 28: Cloud Computing

Revocation Mechanism

Allocation policy defines a safe share for each user– Users will get at least safe share within specified time

Revoke only if a user is below its safe share and is interested in offers– Revoke tasks from users farthest above their safe

share– Framework warned before its task is killed

Page 29: Cloud Computing

How Do We Run MPI?

Users always told their safe share– Avoid revocation by staying below it

Giving each user a small safe share may not be enough if jobs need many machines Can run a traditional grid or HPC scheduler as a user with a larger safe share of the cluster, and have MPI jobs queue up on it

– E.g. Torque gets 40% of cluster

Page 30: Cloud Computing

Example: Torque on Nexus

MPI Job

40%Safe share =

40%

MPI Job

MPI Job

Torque

MPI Job

Facebook.com

Spam Ads

Job 1

Job 2

User 1

Job 1

User 2

Job 4

40%20%

Page 31: Cloud Computing

MULTI-RESOURCE FAIRNESS

Page 32: Cloud Computing

What is Fair?

•Goal: define a fair allocation of resources in the cluster between multiple users•Example: suppose we have:– 30 CPUs and 30 GB RAM–Two users with equal shares–User 1 needs <1 CPU, 1 GB RAM> per task–User 2 needs <1 CPU, 3 GB RAM> per task

•What is a fair allocation?

Page 33: Cloud Computing

•Idea: give weights to resources (e.g. 1 CPU = 1 GB) and equalize value of resources given to each user•Algorithm: when resources are free, offer to whoever has the least value•Result:

– U1: 12 tasks: 12 CPUs, 12 GB ($24)– U2: 6 tasks: 6 CPUs, 18 GB ($24)

Definition 1: Asset Fairness

PROBLEMUser 1 has < 50% of both CPUs and RAM

CPU

User 1

User 2100

%

50%

0%RAM

Page 34: Cloud Computing

Lessons from Definition 1

•“You shouldn’t do worse than if you ran a smaller, private cluster equal in size to your share”•Thus, given N users, each user should get ≥ 1/N of his dominating resource (i.e., the resource that he consumes most of)

Page 35: Cloud Computing

Def. 2: Dominant Resource Fairness

•Idea: give every user an equal share of her dominant resource (i.e., resource it consumes most of) •Algorithm: when resources are free, offer to the user with the smallest dominant share (i.e., fractional share of the her dominant resource)•Result:

– U1: 15 tasks: 15 CPUs, 15 GB– U2: 5 tasks: 5 CPUs, 15 GB

CPU

User 1

User 2100

%

50%

0%RAM

Page 36: Cloud Computing

Fairness PropertiesScheduler →Property ↓

Asset Dynamic CEEI DRF

Pareto efficiency x x x x

Single-resource fairness

x x x x

Bottleneck fairness x x x

Share guarantee x x

Population monotonicity x x

Envy-freedom x x x

Resource monotonicity

Page 37: Cloud Computing

IMPLEMENTATION

Page 38: Cloud Computing

Implementation Stats

7000 lines of C++

APIs in C, C++, Java, Python, Ruby

Executor isolation using Linux containers and Solaris projects

Page 39: Cloud Computing

Frameworks

Ported frameworks:– Hadoop (900 line patch)

– MPI (160 line wrapper scripts)

New frameworks:– Spark, Scala framework for iterative jobs (1300 lines)

– Apache+haproxy, elastic web server farm (200 lines)

Page 40: Cloud Computing

RESULTS

Page 41: Cloud Computing

Overhead

Unmodified MPI

MPI on Nexus

0

10

20

30

40

50

6050.9 51.8

MPI LINPACK

Com

pleti

on T

ime

(s)

Unmodified Hadoop

Hadoop on Nexus

0255075

100125150175200

159.9166.2

Hadoop WordCount

Com

pleti

on T

ime

(s)

Less than 4% seen in practice

Page 42: Cloud Computing

Dynamic Resource Sharing

1 23 45 67 89 1111331551771992212432652873093313530%

10%20%30%40%50%60%70%80%90%

100%

MPIHadoopSpark

Time (s)

Shar

e of

Clu

ster

Page 43: Cloud Computing

Multiple Hadoops Experiment

Hadoop 1

Hadoop 2

Hadoop 3

Page 44: Cloud Computing

Multiple Hadoops Experiment

Hadoop 1

Hadoop 1

Hadoop 1 Hadoop 1

Hadoop 1Hadoop 3

Hadoop 3 Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2Hadoop 2

Hadoop 2

Hadoop 2 Hadoop 1

Hadoop 1

Hadoop 2

Hadoop 3 Hadoop 2

Hadoop 3

Page 45: Cloud Computing

Results with 16 Hadoops

Separate Hadoops Hadoops on Nexus, no delay

sched.

Hadoops on Nexus, 1s delay

sched.

Hadoops on Nexus, 5s delay

sched.

0%

20%

40%

60%

80%

100%

18%

50%

92%97%

Perc

ent L

ocal

Map

s

Separate Hadoops Hadoops on Nexus, no delay

sched.

Hadoops on Nexus, 1s delay

sched.

Hadoops on Nexus, 5s delay

sched.

0

100

200

300

400

500

600 565

486

369338

Job

Runn

ing

Tim

e (s

)

Page 46: Cloud Computing

WEB SERVER FARM FRAMEWORK

Page 47: Cloud Computing

Load calculation

Nexus slave

Web Framework Experiment

Nexus master

Nexus slaveWeb

executortask

(Apache)

Scheduler (haproxy)

Load gen framework

Load gen executor

task

httperf

Nexus slaveWeb

executortask

(Apache)

Load gen executor

task

HTTP requestHTTP

request

Load gen

task task

executorWeb executor

task(Apache)

HTTP request

resource offer

task

status update

Page 48: Cloud Computing

Web Framework Results

Page 49: Cloud Computing

Future Work

Experiment with parallel programming modelsFurther explore low-latency services on Nexus (web applications, etc)Shared services (e.g. BigTable, GFS)Deploy to users and open source

Page 50: Cloud Computing

CLOUD COMPUTING TESTBEDS

Page 51: Cloud Computing

OPEN CIRRUS™: SEIZING THE OPEN SOURCE CLOUD STACK OPPORTUNITYA JOINT INITIATIVE SPONSORED BY HP, INTEL, AND YAHOO!

http://opencirrus.org/

Page 52: Cloud Computing

Applications

Application FrameworksMapReduce, Sawzall, Google App Engine, Protocol Buffers

Hardware Infrastructure Borg

Software Infrastructure

VM Management

Job SchedulingBorgStorage ManagementGFS, BigTableMonitoringBorg

GOOGLE

Applications

Application FrameworksEMR – Hadoop

Hardware Infrastructure

Software Infrastructure

VM ManagementEC2Job Scheduling

Storage ManagementS3, EBSMonitoringBorg

AMAZON

Applications

Application Frameworks.NET Services

Hardware Infrastructure Fabric Controller

Software Infrastructure

VM ManagementFabric ControllerJob SchedulingFabric ControllerStorage ManagementSQL Services, blobs, tables, queuesMonitoringFabric Controller

MICROSOFT

Publicly accessible layerProprietary Cloud Computing stacks

Page 53: Cloud Computing

Hardware InfrastructurePRS Emulab Cobbler xCat

VM ManagementEucalyptus EnomalismTashi ReservoirNimbus,oVirt

Job SchedulingMaui/Torque

Storage ManagementHDFSKFSGlusterLustrePVFSMooseFS HBase Hypertable

MonitoringGangliaNagiosZenossMONMoaraHeavily

fragmented today!

Applications

Application Frameworks

Pig, Hadoop, MPI, Sprout, Mahout

Hardware Infrastructure PRS, Emulab, Cobbler, xCat

Software Infrastructure

VM Management

Job Scheduling

Storage Management

Monitoring

Open Cloud Computing stacks

Page 54: Cloud Computing

Open Cirrus™ Cloud Computing TestbedShared: research, applications, infrastructure (12K cores), data sets

Global services: sign on, monitoring, store. Open src stack (prs, tashi, hadoop)

Sponsored by HP, Intel, and Yahoo! (with additional support from NSF)

• 9 sites currently, target of around 20 in the next two years.

Page 55: Cloud Computing

Open Cirrus Goals

• Goals• Foster new systems and services research around cloud

computing• Catalyze open-source stack and APIs for the cloud

• How are we unique?• Support for systems research and applications research• Federation of heterogeneous datacenters

Page 56: Cloud Computing

Open Cirrus Organization• Central Management Office, oversees Open Cirrus

• Currently owned by HP

• Governance model• Research team • Technical team • New site additions • Support (legal (export, privacy), IT, etc.)

• Each site • Runs its own research and technical teams • Contributes individual technologies • Operates some of the global services

• E.g. • HP site supports portal and PRS• Intel site developing and supporting Tashi• Yahoo! contributes to Hadoop

Page 57: Cloud Computing

1 Gb/s (x8 p2p)

Intel BigData Open Cirrus Site

45 Mb/s T3 to Internet

3U Rack5 storage nodes-------------12 1TB Disks

1 Gb/s (x2x5 p2p)

x3

20 nodes: 1 Xeon (1-core) [Irwindale/Pent4], 6GB DRAM, 366GB disk (36+300GB)10 nodes: 2 Xeon 5160 (2-core) [Woodcrest/Core], 4GB RAM, 2 75GB disks10 nodes: 2 Xeon E5345 (4-core) [Clovertown/Core],8GB DRAM, 2 150GB Disk x2

Switch48 Gb/s

x2

1 Gb/s (x4x4 p2p)

Blade Rack 40 nodes

Switch48 Gb/s

1 Gb/s (x4x4 p2p)

Blade Rack 40 nodes

Switch48 Gb/s

1 Gb/s (x15 p2p)

1U Rack 15 nodes

Switch48 Gb/s

1 Gb/s (x15 p2p)

2U Rack 15 nodes

Switch48 Gb/s

1 Gb/s (x15 p2p)

2U Rack 15 nodes

Switch48 Gb/s

Mobile Rack 8 (1u) nodes-------------2 Xeon E5440(quad-core)[Harpertown/Core 2] 16GB DRAM2 1TB Disk

Switch24 Gb/s*

PDUw/per-port monitoring and control

Switch48 Gb/s

1 Gb/s (x8)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)

(r2r2c1-4)(r2r1c1-4)

r2r1c1-4 r2r2c1-4 r1r1 r1r2r1r3 r1r4

r2r3 r3r2 r3r3 mobile storage TOTALNodes 40 40 30 45 30 8 5 198Cores 140 320 240 360 240 64 1364DRAM (GB) 240 320 240 360 480 128 1768Spindles 80 80 60 270 180 16 60 746Storage (TB) 12 12 60 270 180 16 60 610

(r1r5)

Key:rXrY=row X rack YrXrYcZ=row X rack Y chassis Z

2 Xeon E5345(quad-core)[Clovertown/Core]8GB DRAM2 150GB Disk

2 Xeon E5420(quad-core)[Harpertown/Core 2]8GB DRAM2 1TB Disk

2 Xeon E5440(quad-core)[Harpertown/Core 2]8GB DRAM6 1TB Disk

2 Xeon E5520(quad-core)[Nehalem-EP/Core i7] 16GB DRAM6 1TB Disk

(r1r1, r1r2) (r1r3, r1r4, r2r3) (r3r2, r3r3)

http://opencirrus.intel-research.net

Page 58: Cloud Computing

Open Cirrus Sites

SiteCharacteristics

#Cores#Srvrs Public Memory Storage Spindles Network Focus

HP 1,024 256 178 3.3TB 632TB 115210G internal1Gb/s x-rack

Hadoop, Cells, PRS, scheduling

IDA 2,400 300 100 4.8TB43TB+

16TB SAN600 1Gb/s

Apps based onHadoop, Pig

Intel 1,364 198 145 1.8TB610TB local 60TB attach

746 1Gb/sTashi, PRS, MPI,

Hadoop

KIT 2,048 256 128 10TB 1PB 192 1Gb/sApps with high

throughput

UIUC 1,024 128 64 2TB ~500TB 288 1Gb/sDatasets, cloud infrastructure

CMU 1,024 128 64 2TB -- -- 1 Gb/s Storage, Tashi

Yahoo(M45)

3,200 480 400 2.4TB 1.2PB 1600 1Gb/sHadoop on

demand

12,074 1,746 1,029 26.3 TB 4 PBTotal

Page 59: Cloud Computing

Testbeds

Open Cirrus

IBM/Google TeraGrid PlanetLab EmuLabOpen Cloud Consortium

Amazon EC2

LANL/NSF cluster

Type of research

Systems & services

Data-intensive

applications research

Scientific applications

Systems and

servicesSystems

Interoperab. across clouds using open

APIs

Commer. use

Systems

Approach

Federation of hetero-geneous

data centers

A cluster supported by Google

and IBM

Multi-site hetero

clusters super comp

A few 100 nodes

hosted by research

instit.

A single-site cluster with

flexible control

Multi-site heteros

clusters, focus on network

Raw access to

virtual machines

Re-use of LANL’s retiring clusters

Participants

HP, Intel, IDA, KIT, UIUC, Yahoo!CMU

IBM, Google, Stanford, U.Wash,

MIT

Many schoolsand orgs

Many schools and orgs

University of Utah

4 centers AmazonCMU, LANL,

NSF

Distribution

7(9) sites1,746 nodes12,074 cores

1 site11

partners in US

> 700 nodes

world-wide

>300 nodes univ@Utah

480 cores, distributed in four locations

1 site

1 site1000s of older, still

useful nodes

Testbed Comparison

Page 60: Cloud Computing

Open Cirrus Stack

Compute + network + storage resources

Power + cooling

Management and control subsystem

Physical Resource set (Zoni) service

Credit: John Wilkes (HP)

Page 61: Cloud Computing

Open Cirrus Stack

Zoni service

Research Tashi NFS storage service

HDFS storageservice

PRS clients, each with theirown “physical data center”

Page 62: Cloud Computing

Open Cirrus Stack

Zoni service

Research Tashi NFS storage service

HDFS storageservice

Virtual cluster Virtual cluster

Virtual clusters (e.g., Tashi)

Page 63: Cloud Computing

Open Cirrus Stack

Zoni service

Research Tashi NFS storage service

HDFS storageservice

Virtual cluster Virtual cluster

BigData App

Hadoop

1. Application running2. On Hadoop3. On Tashi virtual cluster4. On a PRS5. On real hardware

Page 64: Cloud Computing

Open Cirrus Stack

Zoni service

Research Tashi NFS storage service

HDFS storageservice

Virtual cluster Virtual cluster

BigData app

Hadoop

Experiment/save/restore

Page 65: Cloud Computing

Open Cirrus Stack

Zoni service

Research Tashi NFS storage service

HDFS storageservice

Virtual cluster Virtual cluster

BigData App

Hadoop

Experiment/save/restore

Platform services

Page 66: Cloud Computing

Open Cirrus Stack

Zoni service

Research Tashi NFS storage service

HDFS storageservice

Virtual cluster Virtual cluster

BigData App

Hadoop

Experiment/save/restore

Platform services

User services

Page 67: Cloud Computing

Open Cirrus Stack

Zoni

Research Tashi NFS storage service

HDFS storageservice

Virtual cluster Virtual cluster

BigData App

Hadoop

Experiment/save/restore

Platform services

User services

Page 68: Cloud Computing

System Organization

•Compute nodes are divided into dynamically-allocated, vlan-isolated PRS subdomains

•Apps switch back and forth between virtual and phyiscal

Openservice research

Tashi development

Proprietaryservice research

Apps running in a VM mgmt infrastructure (e.g., Tashi)

Open workload monitoring and trace collection

Production storage service

Page 69: Cloud Computing

Open Cirrus stack - Zoni

• Zoni service goals• Provide mini-datacenters to researchers• Isolate experiments from each other• Stable base for other research

• Zoni service approach• Allocate sets of physical co-located nodes, isolated inside VLANs.

• Zoni code from HP being merged into Tashi Apache project and extended by Intel

• Running on HP site• Being ported to Intel site• Will eventually run on all sites

Page 70: Cloud Computing

Open Cirrus Stack - Tashi• An open source Apache Software Foundation project sponsored by Intel (with CMU, Yahoo, HP)

• Infrastructure for cloud computing on Big Data • http://incubator.apache.org/projects/tashi

• Research focus: • Location-aware co-scheduling of VMs, storage,

and power.• Seamless physical/virtual migration.

• Joint with Greg Ganger (CMU), Mor Harchol-Balter (CMU), Milan Milenkovic (CTG)

Page 71: Cloud Computing

ClusterManager

Tashi High-Level Design

Node NodeNodeNodeNode

Storage Service

Virtualization Service

Node

Scheduler

Cluster nodes are assumed to be commodity machines

Services are instantiated through virtual machines

Data location and power informationis exposed to scheduler and services

CM maintains databasesand routes messages;decision logic is limited

Most decisions happen inthe scheduler; manages compute/storage/power in concert

The storage service aggregates thecapacity of the commodity nodes to house Big Data repositories.

Page 72: Cloud Computing

Calculated (40 racks * 30 nodes * 2 disks)

0

50

100

150

200

250

300

Disk-1G SSD-1G Disk-10G SSD-10G

Thr

ough

put/

disk

(M

B/s

)

Random Placement Location-Aware Placement

3.6X

11X

3.5X

9.2X

Location Matters (calculated)

Page 73: Cloud Computing

73

Open Cirrus Stack - Hadoop

• An open-source Apache Software Foundation project sponsored by Yahoo!

• http://wiki.apache.org/hadoop/ProjectDescription

• Provides a parallel programming model (MapReduce), a distributed file system, and a parallel database (HDFS)

Page 74: Cloud Computing

What kinds of research projects are Open Cirrus sites looking for?

• Open Cirrus is seeking research in the following areas (different centers will weight these differently):

• Datacenter federation• Datacenter management• Web services• Data-intensive applications and systems

• The following kinds of projects are generally not of interest:

• Traditional HPC application development• Production applications that just need lots of cycles• Closed source system development

Page 75: Cloud Computing

How do users get access to Open Cirrus sites?

• Project PIs apply to each site separately.

• Contact names, email addresses, and web links for applications to each site will be available on the Open Cirrus Web site (which goes live Q209)– http://opencirrus.org

• Each Open Cirrus site decides which users and projects get access to its site.

• Developing a global sign on for all sites (Q2 09)– Users will be able to login to each Open Cirrus site for which they are

authorized using the same login and password.

Page 76: Cloud Computing

Summary and Lessons

• Intel is collaborating with HP and Yahoo! to provide a cloud computing testbed for the research community

• Using the cloud as an accelerator for interactive streaming/big data apps is an important usage model

• Primary goals are to • Foster new systems research around cloud computing• Catalyze open-source reference stack and APIs for the cloud

– Access model, Local and global services, Application frameworks

• Explore location-aware and power-aware workload scheduling• Develop integrated physical/virtual allocations to combat cluster squatting• Design cloud storage models

– GFS-style storage systems not mature, impact of SSDs unknown

• Investigate new application framework alternatives to map-reduce/Hadoop

Page 77: Cloud Computing

OTHER CLOUD COMPUTING RESEARCH TOPICS: ISOLATION AND DC ENERGY

Page 78: Cloud Computing

Heterogeneity in Virtualized Environments

• VM technology isolates CPU and memory, but disk and network are shared– Full bandwidth when no contention– Equal shares when there is contention

• 2.5x performance difference

1 2 3 4 5 6 70

10

20

30

40

50

60

70

VMs on Physical Host

IO P

erfo

rman

ce p

er V

M (M

B/s)

EC2 small instances

Page 79: Cloud Computing

Isolation Research

• Need predictable variance over raw performance• Some resources that people have run into

problems with: – Power, disk space, disk I/O rate (drive, bus), memory

space (user/kernel), memory bus, cache at all levels (TLB, etc), hyperthreading/etc, CPU rate, interrupts

– Network: NIC (Rx/Tx), Switch, cross-datacenter, cross-country

– OS resources: File descriptors, ports, sockets

Page 80: Cloud Computing

Datacenter Energy

• EPA, 8/2007:– 1.5% of total U.S. energy consumption– Growing from 60 to 100 Billion kWh in 5 yrs– 48% of typical IT budget spent on energy

• 75 MW new DC deployments in PG&E’s service area – that they know about! (expect another 2x)

• Microsoft: $500m new Chicago facility– Three substations with a capacity of 198MW– 200+ shipping containers w/ 2,000 servers each– Overall growth of 20,000/month

Page 81: Cloud Computing

81

Power/Cooling Issues

Page 82: Cloud Computing

First Milestone: DC Energy Conservation

• DCs limited by power– For each dollar spent on servers, add $0.48

(2005)/$0.71 (2010) for power/cooling– $26B spent to power and cool servers in 2005

grows to $45B in 2010• Within DC racks, network equipment often the

“hottest” components in the hot spot

Page 83: Cloud Computing

Thermal Image of Typical Cluster Rack

M. K. Patterson, A. Pratt, P. Kumar, “From UPS to Silicon: an end-to-end evaluation of datacenter efficiency”, Intel Corporation

RackSwitch

Page 84: Cloud Computing

DC Networking and Power

• Selectively power down ports/portions of net elements• Enhanced power-awareness in the network stack

– Power-aware routing and support for system virtualization• Support for datacenter “slice” power down and restart

– Application and power-aware media access/control• Dynamic selection of full/half duplex• Directional asymmetry to save power,

e.g., 10Gb/s send, 100Mb/s receive – Power-awareness in applications and protocols

• Hard state (proxying), soft state (caching), protocol/data “streamlining” for power as well as b/w reduction

• Power implications for topology design– Tradeoffs in redundancy/high-availability vs. power consumption– VLANs support for power-aware system virtualization

Page 85: Cloud Computing

Summary

• Many areas for research into Cloud Computing!– Datacenter design, languages, scheduling,

isolation, energy efficiency (at all levels)• Opportunities to try out research at scale!

– Amazon EC2, Open Cirrus, …

Page 86: Cloud Computing

UC Berkeley

Thank you!

[email protected]://abovetheclouds.cs.berkeley.edu/

86