Distributed Systems and Service Group 1 Renyu Yang and Professor Jie Xu Computing at Massive-Scale: Scalability and Dependability Challenges IEEE SOSE 2016, Oxford, March 2016 CoLAB
Distributed Systems and Service Group
1
Renyu Yang and Professor Jie Xu
Computing at Massive-Scale: Scalability and Dependability Challenges
IEEE SOSE 2016, Oxford, March 2016
CoLAB
Cloud Datacenters and Virtualization
• Cloud computing is primarily a business model.
• It provides dynamic computing resources to businesses: computing resources can be rented rather than owned outright.
• Two key characteristics: multi-tenancy and resource elasticity.
Machine 2
Customer B
Customer CCPU
MemoryStorage
Machine 1
Customer A
Customer B
CPUMemoryStorage
Machine 3
Customer BCPU
MemoryStorage
Scheduler
Customer C
Cloud Provider
Overview: Computing at Massive-scale
…
Scheduler
Fm 1 Fm 3 Fm 2
Scheduler
Task submission
Fm = Framework
= compute task
= compute node
New trends and characteristics • Varying request and resource heterogeneity • Workload diversity and resource sharing • Increasing scale of request and cluster • Frequent failure occurrence
• Multiple computing frameworks often have to run on a unified scheduler while handling varying requests. The diverse workloads are usually co-allocated to the shared hardware cluster in order to improve utilization.
Overview: Computing at Massive-scale
Varying request and resource heterogeneity
• “Varying” Features
• Various customers with diverse resource estimation
• Varying resource demand dimensions and attributes
• Various resource usage patterns
1. Architecture
2. Number of Cores
3. Max Disks
4. Min Disks
5. Number of CPUs
6. Kernel Version
7. CPU Clock Speed
8. Ethernet Speed
9. Platform Family
Fig: Different constraints in task resource request
• The request heterogeneity can attribute to the highly dynamic Cloud environment, where users with different computation purposes co-exist with diverse resource requirements and patterns
• The heterogeneities will increase the scheduling complexity since the system has to pre-filter the candidate targeted servers for the specific request
Fig: Varying user and resource request
Workload Diversity and Resource Sharing
• Cluster computing systems are increasingly specialized for particular application domains and purposes
• Offline processing v.s. long-running services • Batch job: Map Reduce, Dryad
• In-memory computing: Spark
• Stream processing: Storm and MillWheel
• Interactive SQL queries: Dremel and Hive
• Strong requirements: resource sharing, high server utilization and efficient data sharing
Overview: Computing at Massive-scale
MapReduce
Spark
VM shared cluster
Static partitioning Dynamic sharing
• Graph processing: Pregel
• DAG processing: Tez and FuxiJob
• Machine learning: GraphLab
• Virtual Machine or Container: EC2, Docker
Overview: Computing at Massive-scale
Type Number
Peak order number Over 120,000 per second
Total payment transactions in Alipay
(Alibaba Payment system)
710 millions
Peak payment transactions in Alipay 85,900 per second
Peak transactions processed on
AliCloud (Alibaba Cloud) platform
140,000 per second
Type Number
Server number 4830
Job number 91,990
Task number 42,266,899
Worker number 16,295,167
Statistical data during 2015 Alibaba double-eleven shopping festival Statistical data of one production system in Ali-Cloud.
Increasing Request and Cluster Scale
• The increasingly enlarged cluster size also gives rise to difficulties of cluster management and the increasing scheduling complexity
• A transparent user experience is highly desirable during the request bursting period without noticeable response latency or service timing-out due to the overloaded workloads beyond the system capacity
Overview: Computing at Massive-scale
Frequent failure occurrence
• With increasing scale of a cluster, the probability of hardware and software failures also arises. Failures have become the norm rather than the exception at massive scale.
• The increased cluster size itself introduces much more uncertainties and reduces the overall system reliability, largely due to the increased failure probability of each node and software component.
• Some root causes: • OS crash/network disconnection
• disk hang or insufficient memory (OOM)
• bugs in codes/overweight system utilization
• performance interference
• network congestion
2009.2.24, Gmail failure for 4 hours
2011.4,Yahoo mailbox outages,affecting 0.25 billion users.
2011.4,Amazon Outage 4 days
2009.3,Azure outage for 22 hours
2010.1,Salesforce cloud service outage
2012.7,Azure failure for 2.5 hours
2013.3,HotMail、Outlook 17hours
2012.10,AliCloud Power Outage
2009
2010
2011
2012
2013
2014
2014.8,Azure System outage
2014.11 AWS,Rackspace,SoftLayer rebooting
Challenges and Methodology
Scalability and dependability have become two fundamental challenges for all distributed computing at massive scale.
Data-driven Analysing, Modelling, Problem Finding:
• A good understanding of Cloud Computing workloads drives to:
• Identify the resource demand dimensions and attributes for a better datacenter planning
• Identify resource usage inefficiencies
• Identify resource usage patterns to improve the QoS
• Identify relationships between workload parameters and their impact on the productivity of the overall datacenter
• Identifying the workload characteristics from a real production system allows us to:
• Design experimental scenarios to simulate environments and evaluate mechanisms for datacenter operational improvements following realistic conditions
• Find system bottlenecks, and further improve the system performance
• 29 days
• 12,500+ servers
• 27,000,000 tasks
9
Cloud Datacenter Case Studies
• 365 days (year)
• 100 + servers per site
• 1000 tasks per site
• 60 days
• 5,000+ servers
• 185,444 tasks
Comprehensive and Correlation Analysis
• Task length • CPU • Memory
Task User • Submission rate • CPU estimated • Memory estimated
ii tttttT ,...,,, 321
)(),(),( fffui
i i u u u u u U ,..., , , 3 2 1
))()(()( jiii uPtPttE
)(),(),( fffti
)()( iii uPuuE
User and Task profile definition
Expectation of User and Task profile
u t
11
Analytics: Workload Models
K = Cluster Number SK = Sum of Squares α = Weighted Variable (Coherence) d = Number of Dimensions
D.T. Pham “Selection of K in k-means clustering” Proc. Inst. Mech. Eng. C. Mech. Eng. Sci., 219 (1) (2005)
• Workload is a combination of tasks and users (customers).
• Characteristics, behavioural patterns and relationships of workload.
Workload model definition Workload clusterization
12
Example: Google Workload
Users Month Day 2 Day 18 Day 26
Requested CPU Requested Memory Submission Rate (Hourly)
Cluster Population % Mean Stdev. Cv Mean Stdev. Cv Mean Stdev. Cv
U1 37.03 0.010 0.004 0.388 0.016 0.013 0.854 34.94 94.00 2.691
U2 0.71 0.016 0.011 0.689 0.019 0.013 0.658 2498.21 2034.6 0.814
U3 6.37 0.135 0.048 0.358 0.094 0.136 1.453 4.71 10.82 2.295
U4 6.37 0.025 0.018 0.718 0.092 0.031 0.342 13.49 19.47 1.444
U5 22.64 0.063 0.011 0.168 0.030 0.020 0.648 73.40 170.44 2.322
U6 26.89 0.032 0.006 0.197 0.014 0.010 0.752 43.63 105.18 2.411
Month Day 2
Param Cluster Mean Stdev. Cv Mean Stdev. Cv
CPU
T1 0.029 0.028 0.966 0.029 0.025 0.862
T2 0.095 0.088 0.926 0.071 0.071 1
T3 0.006 0.012 2 0.007 0.012 1.714
Mem
T1 0.011 0.01 0.909 0.013 0.01 0.769
T2 0.049 0.031 0.633 0.047 0.021 0.447
T3 0.002 0.003 1.5 0.003 0.003 1
Length
T1 16,605,683 32,753,760 1.972 9,787,032 1,551,9963 1.586
T2 123,974,450 250,146,79 2.018 30,932,490 40,683,248 1.315
T3 739,117 4,056,404 5.488 245,445 655,190 2.669
Day 18 Day 26
Mean Stdev. Cv Mean Stdev. Cv
CPU
T1 0.028 0.014 0.492 0.006 0.006 1
T2 0.076 0.051 0.667 0.065 0.04 0.615
T3 0.005 0.005 0.984 0.026 0.012 0.462
Mem
T1 0.009 0.006 0.632 0.001 0.001 1
T2 0.040 0.017 0.428 0.031 0.018 0.581
T3 0.001 0.001 1.075 0.009 0.004 0.444
Length
T1 41,329,800
103,613,33
5 2.507 13,669,736 16,538,165 1.21
T2 117,493,568
388,077,47
6 3.303 82300581 54,360,253 0.661
T3 7,658,844 25,068,810 3.273 613,803 1,450,884 2.364
Task dimension characteristics
Tasks
Data
U6U5U4U3U2U1S-T3
R-T3S-T2
R-T2S-T1
R-T1S-T3
R-T3S-T2
R-T2S-T1
R-T1S-T3
R-T3S-T2
R-T2S-T1
R-T1S-T3
R-T3S-T2
R-T2S-T1
R-T1S-T3
R-T3S-T2
R-T2S-T1
R-T1S-T3
R-T3S-T2
R-T2S-T1
R-T1
50
40
30
20
10
0
% Du
ring R
un-Ti
me
Scalability Challenges
• Request handling scalability • How to enable the high cluster throughput with low-latency request
handling and allocation decisions?
• Resource scheduling scalability • How to make prompt scheduling decisions at millisecond rate for interactive
query task or millions of queued resource requests?
• Communication and message scalability • how to properly avoid message flooding whilst guaranteeing timely
component communication(with resource request/reclaim, heartbeat)?
Scalability
Request handling
scalability
Resource scheduling
scalability
Communication and
messaging scalability
Request number
and frequency
Resource dimension
and amount
System scale and
complexity
Scalability Solutions (1)
Architectural evolution • Single-master scheduling
• Delegate every scheduling decision, state monitoring and updating all in a single master node (such as the JobTracker in Hadoop 1.0)
• Overloaded JobTracker, and single point of failure
• Only support one type of computing paradigm/framework (slots only for Map or Reduce)
• Two-level scheduling
• Decouples the resource management and the framework- /application- specified scheduling into two separate layers
• The central resource manager is responsible for resource negotiation among different resource requests and application master takes charge of job scheduling
• Decentralized scheduling
• Multiple distributed scheduler replicas are adopted via multi-threads or independent processes, and each scheduler can handle requests simultaneously based on its local cached states or global shared states
Scalability Solutions (2)
Incremental scheduling
• Achieving rapid response and prompt scheduling decisions at such a fast rate means that the central resource manager cannot recalculate the complete mapping of all machine resource to all applications tasks in every decision making
• Only the changed part will be calculated
• Locality-tree based incremental scheduling
• Incremental resource request and allocation protocol
• Resource request is only sent once until the application master releases the resources
• Scheduling tree
• Multi-level waiting queue
• Different priorities and constraint labels
• Quota-group control (access control)
Fig: Scheduling tree example with multi-level waiting queues
Scalability Solutions (3)
Decentralized scheduling
• Option1: Local state replica coordinated by central master
• The functionality of central master can be simplified to only synchronization all states as a coordinator
• Conflict resolving is significantly important
• Option2: Shared states visible to all schedulers without a central coordinator
• The communal states can be locked using exclusive locking techniques or lock-free optimistic concurrency control by using incremental transaction
• Option3: Stateless distributed scheduling
• Sampling-based probing for low-latency
• Each autonomous scheduler detects servers with fewer queued tasks by probing m random servers and assigns the tasks of its jobs to targeted machines
Scalability Solutions (4)
Incremental communication
• An incremental request will be sent only when the resource demands are dynamically adjusted:
• reducing frequency of message passing
• improving the whole cluster utilization
• Core techniques in a messenger:
• Message order-preserving
• Message idempotent resending
• Message deduplication
Cluster Partition
• A compute cluster can be divided into several area partitions and each manager replica is responsible for request handling and information delegation of severs within its specified partition
• The consistency will be guaranteed by an elected central coordinator (only the coordinator can conduct changes to the permanent store)
Sender App RPC-Call
1
1
MessageBuf
{max=1,ack=0}
1
2
callback
Messenger Messenger
{max=2,ack=0}
2
Receiver App
12
{max=2,ack=0}
12
Sender App RPC-Call
MessageBuf
2callback
Messenger Messenger Receiver App
{max=2,ack=0}
12
{max=1,ack=0}
1
1
{max=2,ack=0}
2
12
{max=1,ack=0}
{ack=1}
1
Fig: Message re-sending and de-duplication
Fig: Google Cluster Partition [EuroSys15’]
Dependability Challenges
• Providers are under great pressure to provision uninterrupted reliable service to consumers while trying to reduce their operational costs due to software and hardware failures within the system.
• Faults and handling coverage: Components within the resource manager are likely to experience different types of faults ranging from crash-stop to late timing failure, as well as have different underlying root causes
• Recovery effectiveness and efficiency: consider factors including the full recovery time, the system utilization, the additional resource cost produced by the recovery, the latent negative impacts onto other components or workloads
• User-perceived impact: a user-transparent failover technique to recover the service without noticeable changes to provisioned service perceived by consumers
Dependability
Fault Coverage
Recovery
Effectiveness &
Efficiency
User-perceived
Impact
Failure MTTF
Workloads and
Subcomponents
amount
System complexity
Dependability Solutions (1)
Rapid and Effective Component Failover
• Failover with reduced checkpointing
• Soft-state inference: Collects and exploits states collected from neighboring components instead of solely relying on hard-state periodically collected from dedicated backup systems
• Hybrid recovery techniques: A combination recovery of light-weight hard-state and soft-state inference
• Minimized worker eviction
• Loose-coupling master or agent behavior from its respective workers during the execution and non-faulty workers will not be automatically evicted
• State-inference to identify late-timing or inaccessible agents
• Adaptive resource reservation for running/faulty workers
Fig: Soft-state inference applied to Fuxi system Fig: Fault Recovery Finite-State mMachine (FSM)
Dependability Solutions (2)
Optimized Recovery Time v.s. Degraded Service Level
• Recovery Time or Information Completeness
• Incomplete information might appear due to timing-out components unable to contribute their states in time. The state-collection time also closely depends on cluster scale, application number, and application-specified configurations etc.
• Insufficient collection time leads to incomplete states and subsequent degraded service level
• Flexible and customizable configurations
• Such flexibility through customization offers adaptive control of recovery overheads and allows possible trade-offs between the full state recovery and various levels of degraded recovery with incomplete state inference.
Dependability Solutions (3)
Maximized fault coverage
• Diverse failure types: stop-crash failure, timing failures etc.
• Different failure combinations
• Failure correlations and simultaneous component failures
• Root causes analysis
Blacklist and alarm dashboard
• Multi-level blacklist
• cluster level, task-level and job-level
• System health self-checker and dashboard
• monitor, diagnose the node health, process status, system features
• The right tools can quickly find the root cause, minimizing the duration of the failure
Data-driven Methodology Applied Into Engineering
Future Directions
• Big Data as a Service (BDaaS)
• Data processing API, data sharing, API composition
• Debugging large-scale distributed applications • Debugging or investigating a distributed application performance issue
• Time-consuming for engineers and technical staffs to find root-causes of problems
• History-Based Optimization (HBO) approach • Accurate estimation of resource requirement
• User demands/system patterns etc.
• Simulation of large-scale system behavior • Cost-effective technique to evaluate the system functionalities and performance in a
simulation environment
• Application in container-based system
• Light-weighted container, Docker
• IoE Applications • Cloud-Network-Edge
• The dependable and real-time capability with low latency
Conclusions
• Exploiting the inherent workload heterogeneity that exists in Cloud environments provides an excellent mechanism that helps to improve both the performance of running tasks and the system efficiency
• Large-scale distributed systems may run millions of service instances concurrently, with an increased probability of frequent and simultaneous failures
• Relying on real data is critical to understanding the real challenges in massive-scale computing and formulating assumptions under realistic operational circumstances
• Experiences learnt from Cloud and distributed computing will facilitate the development of the future generation computing systems that support a number of human intelligent decisions
Our Main Contributions
Topic 1 - Analysis, Modeling and Simulation
• [1] I. S. Moreno, P. Garraghan, P. Townend, and J. Xu. An approach for characterizing workloads in
Google cloud to derive realistic resource utilization models. In Proceedings of IEEE SOSE 2013, Best Paper Award
• [2] R. Yang, I. S. Moreno, J. Xu and T. Wo. T. An analysis of performance interference effects on energy-efficiency of virtualized cloud environments. In Proceedings of the IEEE CloudCom, 2013
• [3] P.Garraghan, P.Townend, J.Xu, "An Analysis of the Server Characteristics and Resource Utilization in Google Cloud" in the proceedings of the IEEE IC2E, 2013.
• [4] P.Garraghan, P.Townend, J.Xu, "An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment" in the proceedings of IEEE HASE 2014
• [5] I. S. Moreno, P. Garraghan, P. Townend, and J. Xu. Analysis, modeling and simulation of workload patterns in a large-scale utility cloud[J], in IEEE Transactions on Cloud Computing, 2014
• [6] P. Garraghan, I. S. Moreno, P. Townend, and J. Xu. An analysis of failure-related energy waste in a large-scale cloud environment, in IEEE Transactions on Emerging Topics in Computing, 2014
• [7] P. Garraghan, D.McKee, X. Ouyang, D. Webster and J. Xu. SEED: A Scalable Approach for Cyber-Physical System Simulation, in IEEE Transactions on Services Computing, 2015
Our Main Contributions
Topic 2 – Scalable Resource Scheduling at Scale
• [1] I. S. Moreno, R. Yang, J. Xu and T. Wo. Improved energy-efficiency in cloud datacenters with
interference-aware virtual machine placement. In Proceedings of the IEEE ISADS 2013, Best Paper Award
• [2] Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proceedings of the VLDB Endowment, 2014
• [3] Y.Wang, R. Yang, T. Wo, W. Jiang and C. Hu. Improving utilization through dynamic VM resource allocation in hybrid cloud environment. In Proceedings of the IEEE ICPADS 2014
• [4] P.Garraghan, X.Ouyang, P.Townend, J.Xu. Timely Long Tail Identification through Agent Based Monitoring and Analytics, In Proceedings of IEEE ISORC, 2015
• [5] R. Yang, T. Wo, C. Hu, J. Xu and M. Zhang. D2PS: a Dependable Data Provisioning Service in Multi-Tenants Cloud Environments, In Proceedings of IEEE HASE, 2016
• [6] X.Ouyang, P.Garraghan, D.McKee, P.Townend, and J.Xu. Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation, In Proceedings of IEEE AINA, 2016
• [7] X.Ouyang , P.Garraghan, R.Yang, P.Townend and J.Xu Reducing Late-Timing Failure at Scale: Straggler Causes Analysis and Occurrence Prediction in proceeding of IEEE/IFIP DSN 2016 (under review)
Our Main Contributions
Topic 3 – Dependable and Reliable Computing at Scale
• [1] Y. Zhang ,R. Yang, T. Wo, C. Hu, J. Kang and L. Cui. CloudAP: Improving the QoS of Mobile
Applications with Efficient VM Migration. In Proceedings of IEEE HPCC, 2013
• [2] L. Cui, J. Li, T. Wo, B. Li, R. Yang, Y. Cao and J. Huai. HotRestore: a fast restore system for virtual machine cluster. In Proceedings of USENIX LISA, 2014
• [3] Y. Huang, R. Yang, L. Cui, T. Wo, C. Hu and B. Li. VMCSnap: Taking Snapshots of Virtual Machine Cluster with Memory Deduplication. In Proceedings of IEEE SOSE, 2014
• [4] J. Li, J. Zheng, L. Cui and R. Yang. ConSnap: Taking continuous snapshots for running state protection of virtual machines. In Proceedings of IEEE ICPADS, 2014
• [5] R.Yang and J.Xu. Computing at Massive Scale: Scalability and Dependability Challenges. In Proceedings of IEEE SOSE 2016, Invited Visionary Paper (In press)
• [6] R.Yang, Y.Zhang, P.Garraghan, Y.Feng, J.Ouyang, J.Xu, Z.Zhang, C.Li. Reliable Compute Service in Massive-scale Systems through Rapid Low-cost Failover. In IEEE Transactions on Services Computing, 2016 (In press)
• [7] P.Garraghan, X.Ouyang, R.Yang and J.Xu. Straggler Root-Cause Analysis and Detection in Massive-scale Cloud Datacenters. In IEEE Transactions on Services Computing, 2016 (under review)
28
Thanks !
Renyu Yang
Beihang University/Alibaba Cloud Inc, China
http://act.buaa.edu.cn/yangrenyu
Professor Jie Xu
University of Leeds, UK
http://www.comp.leeds.ac.uk/jxu/
CoLAB Made by: