Clustera: A data-centric approach to scalable cluster management David J. DeWitt Jeff Naughton Eric Robinson Andrew Krioukov Srinath Shankar Joshua Royalty Erik Paulson Computer Sciences Department University of Wisconsin-Madison Outline A historical perspective A taxonomy of current cluster management systems Clustera - the first DBMS-centric cluster management system Examples and experimental results Wrapup and summary
23
Embed
Clustera: A data-centric approach to scalable cluster ...infolab.stanford.edu/infoseminar.Archive/WinterY2009/dewitt-slides.… · Clustera: A data-centric approach to scalable cluster
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Clustera: A data-centric approach to
scalable cluster management
David J. DeWitt Jeff Naughton
Eric Robinson Andrew KrioukovSrinath Shankar Joshua Royalty
Erik Paulson
Computer Sciences Department
University of Wisconsin-Madison
Outline
A historical perspective
A taxonomy of current cluster management systems
Clustera - the first DBMS-centric cluster management
system
Examples and experimental results
Wrapup and summary
A Historical Perspective
Concept of a “cluster” seems to have originated with Wilke’s idea of “Processor bank” in 1980
“Remote Unix” (RU) project at Wisconsin in 1984
� Ran on a cluster of 20 VAX 11/750s
� Supported remote execution of jobs
� I/O calls redirected to submitting machine
“RU” became Condor in late 1980s (Livny)
� Job checkpointing
� Support for non-dedicated machines (e.g. workstations)
� Today, deployed on 1500+ clusters and 100K+ machines
worldwide (biggest clusters of 8000-15000 nodes)
Cluster of 20 VAX 11/750s circa 1985 (Univ. Wisconsin)
No, Google did not invent clusters
4
Clusters and Parallel DB Systems
Gamma and RU/Condor projects started at the same time using same hardware. Different focuses:
� 1 to 10 identical 2.4 GHz Intel Core2 Duo, 4GB RAM, no cache limit
DBMS (IBM DB2 v8.1)
� 3.0 GHz Xeon (x2) with HT, 4GB RAM, 1GB buffer pool
Job queue preloaded with fixed-length “sleep” jobs
� Enables targeting specific throughput rates
Evaluation of Alternative Caching Policies
Caching alternatives:
no caching, asynchronous invalidation, synchronous replication
90 Nodes, 4 concurrent jobs/node
100
1080
8
60
6
40
4
20
2
Application Server Fault Tolerance
Approach: maintain a target throughput rate of 40 jobs/sec; start with 4 servers
and kill one off every 5 minutes; monitor job completion, error rates
Key insight: Clustera displays consistent performance with rapid failover – of 47,535 jobs that successfully completed, only 21 had to be restarted due to error
4 Servers 3 Servers 2 Servers 1 Server
13 jobs cancelled
and restarted in 4th
minute
0 jobs cancelled and
restarted
8 jobs cancelled and restarted in 14th
minute
0 jobs cancelled and
restarted
Application Server Summary
Clustera can make efficient use of additional
application server capacity
The Clustera mid-tier “scales-out” effectively
� About same as “scale-up” – not shown
System exhibits consistent performance and rapid
failover in the face of application server failure
Still two single points of failure. Would the behavior
change if we:
� Used redundancy or round-robin DNS to set up a highly
available load balancer?
� Used replication to set up a highly available DBMS?
Summary & Future Work
Cluster management is truly a data management task
The combination of a RDMS and AppServer seems to
work very well
Looks feasible to build a cluster management system to handle a variety of different workload types
Unsolved challenges:
� Scalability of really short jobs (1 second) with the PULL model
� Make it possible for mortals to write abstract schedulers
Bizarre feeling to walk away from a project in the middle