Cluster Computing OverviewCluster Computing Overview
CS444I Internet ServicesCS444I Internet ServicesWinter 00Winter 00
© 1999-2000 Armando Fox© 1999-2000 Armando [email protected]@cs.stanford.edu
© 1999, Armando Fox
Today’s OutlineToday’s Outline
Clustering: the Holy GrailClustering: the Holy Grail
The Case For NOWThe Case For NOW
Clustering and Internet ServicesClustering and Internet Services
Meeting the Cluster ChallengesMeeting the Cluster Challenges
© 1999, Armando Fox
Clustering: Holy GrailClustering: Holy Grail
Goal: Take a cluster of commodity workstations and Goal: Take a cluster of commodity workstations and make them look like a supercomputer.make them look like a supercomputer.
ProblemsProblems Application structureApplication structure
Partial failure managementPartial failure management
Interconnect technologyInterconnect technology
System administrationSystem administration
© 1999, Armando Fox
Cluster Prehistory: Tandem NonStopCluster Prehistory: Tandem NonStop
Early (1974) foray into transparent fault tolerance Early (1974) foray into transparent fault tolerance through redundancythrough redundancy Mirror everything (CPU, storage, power supplies…), can Mirror everything (CPU, storage, power supplies…), can
tolerate any single fault (later: processor duplexing)tolerate any single fault (later: processor duplexing)
““Hot standby” process pair approachHot standby” process pair approach
What’s the difference between What’s the difference between high availabilityhigh availability and and fault fault tolerance?tolerance?
NoteworthyNoteworthy ““Shared nothing”--why?Shared nothing”--why?
Performance and efficiency costs?Performance and efficiency costs?
Later evolved into Tandem Himalaya, which used clustering for Later evolved into Tandem Himalaya, which used clustering for bothboth higher performance and higher availability higher performance and higher availability
© 1999, Armando Fox
Pre-NOW Clustering in the 90’sPre-NOW Clustering in the 90’s
IBM Parallel Sysplex and DEC OpenVMSIBM Parallel Sysplex and DEC OpenVMS Targeted at conservative (read: mainframe) customersTargeted at conservative (read: mainframe) customers
Shared disks allowed under both (why?)Shared disks allowed under both (why?)
All devices have cluster-wide names (shared everything?)All devices have cluster-wide names (shared everything?)
1500 installations of Sysplex, 25,000 of OpenVMS Cluster1500 installations of Sysplex, 25,000 of OpenVMS Cluster
Programming the clustersProgramming the clusters All System/390 and/or VAX VMS subsystems were rewritten to All System/390 and/or VAX VMS subsystems were rewritten to
be cluster-awarebe cluster-aware
OpenVMS: cluster support exists even in single-node OS!OpenVMS: cluster support exists even in single-node OS!
An advantage of locking into proprietary interfacesAn advantage of locking into proprietary interfaces
What about fault tolerance?What about fault tolerance?
© 1999, Armando Fox
The Case For NOW: MPP’s a Near MissThe Case For NOW: MPP’s a Near Miss
uproc perf. improves 50% / yr (4%/month)uproc perf. improves 50% / yr (4%/month) 1 year lag:WS = 1.50 MPP node perf.1 year lag:WS = 1.50 MPP node perf.
2 year lag:WS = 2.25 MPP node perf.2 year lag:WS = 2.25 MPP node perf.
No economy of scale in 100s => +$No economy of scale in 100s => +$
Software incompatibility (OS & apps) => +$$$$Software incompatibility (OS & apps) => +$$$$
More efficient utilization of compute resources More efficient utilization of compute resources (statistical multiplexing)(statistical multiplexing)
““Scale makes availability affordable” (Pfister)Scale makes availability affordable” (Pfister)
Which of these do commodity clusters Which of these do commodity clusters actuallyactually solve? solve?
© 1999, Armando Fox
Philosophy: “Systems of Systems”Philosophy: “Systems of Systems”
Higher Order systems research: aggressively use off-the-Higher Order systems research: aggressively use off-the-shelf hardware shelf hardware and OS softwareand OS software
Advantages:Advantages: easier to track technological advanceseasier to track technological advances
less development timeless development time
easier to transfer technology (reduce lag)easier to transfer technology (reduce lag)
New challenges (“the case against NOW”):New challenges (“the case against NOW”): maintaining performance goalsmaintaining performance goals
system is changing underneath yousystem is changing underneath you
underlying system has other people's bugsunderlying system has other people's bugs
underlying system is poorly documentedunderlying system is poorly documented
© 1999, Armando Fox
Clusters: “Enhanced Standard Litany”Clusters: “Enhanced Standard Litany”
Hardware redundancyHardware redundancy
Aggregate capacityAggregate capacity
Incremental scalabilityIncremental scalability
Absolute scalabilityAbsolute scalability
Price/performance Price/performance sweet spotsweet spot
Software engineeringSoftware engineering
Partial failure Partial failure managementmanagement
Incremental scalabilityIncremental scalability
System administrationSystem administration
HeterogeneityHeterogeneity
© 1999, Armando Fox
Clustering and Internet ServicesClustering and Internet Services
Aggregate capacityAggregate capacity TB of disk storage, THz of compute power (if we can TB of disk storage, THz of compute power (if we can
harness in parallel!)harness in parallel!)
RedundancyRedundancy Partial failure behavior: only small fractional degradation Partial failure behavior: only small fractional degradation
from loss of one nodefrom loss of one node
Availability: industry average across “large” sites during Availability: industry average across “large” sites during 1998 holiday season was 97.2% availability (source: 1998 holiday season was 97.2% availability (source: CyberAtlas)CyberAtlas)
Compare: mission-critical systems have “four nines” Compare: mission-critical systems have “four nines” (99.99%)(99.99%)
© 1999, Armando Fox
Spike AbsorptionSpike Absorption
Internet traffic is self-similarInternet traffic is self-similar Bursty at all granularities less than about 24 hoursBursty at all granularities less than about 24 hours
What’s bad about burstiness?What’s bad about burstiness?
Spike AbsorptionSpike Absorption Diurnal variation: peak vs. average demand typically a factor Diurnal variation: peak vs. average demand typically a factor
of 3 or moreof 3 or more
Starr Report: CNN peaked at 20M hits/hour (compared to Starr Report: CNN peaked at 20M hits/hour (compared to usual peak of 12M hits/hour; that’s +66%)usual peak of 12M hits/hour; that’s +66%)
Really Really the holy grail: capacity on demandthe holy grail: capacity on demand Is this realistic?Is this realistic?
© 1999, Armando Fox
Diurnal Cycle (UCB dialups, Jan. 1997)Diurnal Cycle (UCB dialups, Jan. 1997)
~750 modems at UC ~750 modems at UC BerkeleyBerkeley
Instrumented early 1997Instrumented early 1997
© 1999, Armando Fox
Clustering and Internet WorkloadsClustering and Internet Workloads
Internet vs. “traditional” workloadsInternet vs. “traditional” workloads e.g. Database workloads (TPC benchmarks)e.g. Database workloads (TPC benchmarks)
e.g. traditional scientific codes (matrix multiply, simulated e.g. traditional scientific codes (matrix multiply, simulated annealing and related simulations, etc.)annealing and related simulations, etc.)
Some characteristic differencesSome characteristic differences Read mostlyRead mostly
Quality of service (best-effort vs. guarantees)Quality of service (best-effort vs. guarantees)
Task granularityTask granularity
““Embarrasingly parallel”Embarrasingly parallel”
……but are they balanced? (we’ll return to this later)but are they balanced? (we’ll return to this later)
© 1999, Armando Fox
Meeting the Cluster ChallengesMeeting the Cluster Challenges
Software & programming modelsSoftware & programming models
Partial failure and application semanticsPartial failure and application semantics
System administrationSystem administration
© 1999, Armando Fox
Software ChallengesSoftware Challenges
Message-passing & Active MessagesMessage-passing & Active Messages
Shared memory: Network RAMShared memory: Network RAM CC-NUMA, Software DSM: CC-NUMA, Software DSM: Anyone who thinks cache Anyone who thinks cache
misses can take milliseconds is an idiot.misses can take milliseconds is an idiot. (Paraphrasing (Paraphrasing Larry McVoy at OSDI 96)Larry McVoy at OSDI 96)
MP vs SM a long-standing religious debateMP vs SM a long-standing religious debate
Arbitrary object migration (“network transparency”)Arbitrary object migration (“network transparency”) What are the problems with this?What are the problems with this?
Hints: RPC, checkpointing, residual stateHints: RPC, checkpointing, residual state
© 1999, Armando Fox
Partial Failure ManagementPartial Failure Management
What does What does partial failure partial failure mean for…mean for… a transactional database?a transactional database?
A read-only database striped across cluster nodes?A read-only database striped across cluster nodes?
A compute-intensive shared service?A compute-intensive shared service?
What are appropriate “partial failure abstractions”?What are appropriate “partial failure abstractions”? Incomplete/imprecise results?Incomplete/imprecise results?
Longer latency?Longer latency?
What current programming idioms make partial What current programming idioms make partial failure hard?failure hard? Hint: remember the original RPC papers?Hint: remember the original RPC papers?
© 1999, Armando Fox
Software Challenges, Again?Software Challenges, Again?
Real issue: we have to think differently about Real issue: we have to think differently about programming…programming… ……to harness clusters?to harness clusters?
……to get decent failure semantics?to get decent failure semantics?
……to really exploit software modularity?to really exploit software modularity?
Traditional uniprocessor programming idioms/models Traditional uniprocessor programming idioms/models don’t seem to scale up to clustersdon’t seem to scale up to clusters
Question: Is there a “natural to use” cluster model that Question: Is there a “natural to use” cluster model that scales down to uniprocessors?scales down to uniprocessors? If so, is it general or application-specific?If so, is it general or application-specific?
What would be the obstacles to adopting such a model?What would be the obstacles to adopting such a model?
© 1999, Armando Fox
System Administration on a ClusterSystem Administration on a Cluster
Thanks to Eric Anderson (1998) for some of this material.Thanks to Eric Anderson (1998) for some of this material.
Total cost of ownership (TCO) way high for clustersTotal cost of ownership (TCO) way high for clusters Median sysadmin cost per machine per year (1996): ~$700Median sysadmin cost per machine per year (1996): ~$700
Cost of a headless workstation today: ~$1500Cost of a headless workstation today: ~$1500
Previous SolutionsPrevious Solutions Pay someone to watchPay someone to watch
Ignore or wait for someone to complainIgnore or wait for someone to complain
““Shell Scripts From Hell” (not general Shell Scripts From Hell” (not general vast repeated work) vast repeated work)
Need an extensible and scalable way to automate the Need an extensible and scalable way to automate the gathering, analysis, and presentation of datagathering, analysis, and presentation of data
© 1999, Armando Fox
System Administration, cont’d.System Administration, cont’d.
Extensible Scalable Monitoring For Clusters of Extensible Scalable Monitoring For Clusters of Computers Computers (Anderson & Patterson, UC Berkeley)(Anderson & Patterson, UC Berkeley)
Relational tables allow properties & queries of interest Relational tables allow properties & queries of interest to evolve as the cluster evolvesto evolve as the cluster evolves
Extensive visualization support allows humans to make Extensive visualization support allows humans to make sense of masses of datasense of masses of data
Multiple levels of caching decouple data collection from Multiple levels of caching decouple data collection from aggregationaggregation
Data updates can be “pulled” on demand or triggered Data updates can be “pulled” on demand or triggered by pushby push
© 1999, Armando Fox
Visualizing Data: ExampleVisualizing Data: Example
Display aggregates of various interesting machine Display aggregates of various interesting machine properties on the NOW’sproperties on the NOW’s
Note use of aggregation, colorNote use of aggregation, color
© 1999, Armando Fox
Case Study: The Berkeley NOWCase Study: The Berkeley NOW
History and History and PicturesPictures of an early research cluster of an early research cluster NOW-0: four HP-735’sNOW-0: four HP-735’s
NOW-1: 32 headless Sparc-10’s and Sparc-20’sNOW-1: 32 headless Sparc-10’s and Sparc-20’s
NOW-2: 100 UltraSparc 1’s, Myrinet interconnectNOW-2: 100 UltraSparc 1’s, Myrinet interconnect
inktomi.berkeley.edu: four Sparc-10’sinktomi.berkeley.edu: four Sparc-10’s
www.hotbot.com: 160 Ultra’s, 200 CPU’s totalwww.hotbot.com: 160 Ultra’s, 200 CPU’s total
NOW-3: eight 4-way SMP’sNOW-3: eight 4-way SMP’s
Myrinet interconnectionMyrinet interconnection In addition to commodity switched EthernetIn addition to commodity switched Ethernet
Originally Sparc SBus, now available on PCIbusOriginally Sparc SBus, now available on PCIbus
© 1999, Armando Fox
The Adventures of NOW: ApplicationsThe Adventures of NOW: Applications
AlphaSort: 8.41 GB in one minute, 95 UltraSparcsAlphaSort: 8.41 GB in one minute, 95 UltraSparcs runner up: Ordinal Systems runner up: Ordinal Systems nSort nSort on SGI Origin, 5 GB)on SGI Origin, 5 GB)
pre-1997 record, 1.6 GB on an SGI Challengepre-1997 record, 1.6 GB on an SGI Challenge
40-bit DES key crack in 3.5 hours40-bit DES key crack in 3.5 hours ““NOW+”: headless and some headed machinesNOW+”: headless and some headed machines
inktomi.berkeley.edu (now inktomi.com)inktomi.berkeley.edu (now inktomi.com) now fastest search engine, largest aggregate capacitynow fastest search engine, largest aggregate capacity
TranSend proxy & Top Gun Wingman Pilot browserTranSend proxy & Top Gun Wingman Pilot browser ~15,000 users, 3-10 machines~15,000 users, 3-10 machines
© 1999, Armando Fox
The Adventures of NOW: ToolsThe Adventures of NOW: Tools
GLUnix (coming up, later today)GLUnix (coming up, later today)
xFS, a serverless network filesystemxFS, a serverless network filesystem Why not just a big RAID on a single server?Why not just a big RAID on a single server?
Support for the Myrinet fast interconnectSupport for the Myrinet fast interconnect Active Message (AM-1 and AM-2) over MyrinetActive Message (AM-1 and AM-2) over Myrinet
Fast Sockets: one-copy TCP fast path over AM-1 on MyrinetFast Sockets: one-copy TCP fast path over AM-1 on Myrinet
Moral: cluster tools are hard?Moral: cluster tools are hard?
© 1999, Armando Fox
Cluster SummaryCluster Summary
Clusters have potential advantages…but serious Clusters have potential advantages…but serious challenges to achieving them in practicechallenges to achieving them in practice Kind of like Network Computers?Kind of like Network Computers?
Everyone and their brother is now selling a clusterEveryone and their brother is now selling a cluster Who’s selling a system, and who’s selling a promise?Who’s selling a system, and who’s selling a promise?
Can clustering be sold as a “secret sauce”?Can clustering be sold as a “secret sauce”?
Next: non-clustering, and approaches to clusteringNext: non-clustering, and approaches to clustering