Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica Systems for Resource Management Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack Matteo Nardelli - SABD 2016/17 1 Resource Management Data Storage Data Processing High-level Interfaces Support / Integration
46
Embed
Systems for Resource Management - home · DAMON · Mesos in the data center • Where does Mesos fit as an abstraction layer in the datacenter? Valeria Cardellini - SABD 2016/17 9
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
Systems for Resource Management
Corso di Sistemi e Architetture per Big Data A.A. 2016/17
Valeria Cardellini
The reference Big Data stack
Matteo Nardelli - SABD 2016/17
1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
Outline
• Cluster management systems – Mesos – YARN – Borg – Omega – Kubernetes
• Resource management policies – DRF
Valeria Cardellini - SABD 2016/17
2
Motivations
• Rapid innovation in cloud computing
• No single framework optimal for all applications
• Running each framework on its dedicated cluster: – Expensive – Hard to share data
Valeria Cardellini - SABD 2016/17
3
A possible solution
4
• Run multiple frameworks on a single cluster
• How to share the (virtual) cluster resources among multiple and non homogeneous frameworks executed in virtual machines/containers?
• The classical solution:
Static partitioning
• Is it efficient?
Valeria Cardellini - SABD 2016/17
What we need
• The Datacenter as a Computer idea by D. Patterson – Share resources to maximize their utilization – Share data among frameworks – Provide a unified API to the outside – Hide the internal complexity of the infrastructure
from applications • The solution:
A cluster-scale resource manager that employs dynamic partitioning
Valeria Cardellini - SABD 2016/17
5
Apache Mesos
6 Valeria Cardellini - SABD 2016/17
Dynamic partitioning
• Cluster manager that provides a common resource sharing layer over which diverse frameworks can run
• Abstracts the entire datacenter into a single pool of computing resources, simplifying running distributed systems at scale
• A distributed system to run distributed systems on top of it
Apache Mesos (2)
Valeria Cardellini - SABD 2016/17
7
• Designed and developed at the University of Berkeley - Top open-source project by Apache mesos.apache.org
• Twitter and Airbnb as first users; now supports some of the largest applications in the world
• Cluster: a dynamically shared pool of resources Dynamic partitioning Static partitioning
Mesos goals
• High utilization of resources
• Supports diverse frameworks (current and future)
• Scalability to 10,000's of nodes
• Reliability in face of failures
Valeria Cardellini - SABD 2016/17
8
Mesos in the data center
• Where does Mesos fit as an abstraction layer in the datacenter?
Valeria Cardellini - SABD 2016/17
9
Computation model
• A framework (e.g., Hadoop, Spark) manages and runs one or more jobs
• A job consists of one or more tasks • A task (e.g., map, reduce) consists of one or
more processes running on same machine
Valeria Cardellini - SABD 2016/17
10
What Mesos does
Valeria Cardellini - SABD 2016/17
11
• Enables fine-grained resource sharing (at the level of tasks within a job) of resources (CPU, RAM, …) across frameworks
• Provides common functionalities: - Failure detection
- Task distribution
- Task starting
- Task monitoring
- Task killing
- Task cleanup
Fine-grained sharing
• Allocation at the level of tasks within a job • Improves utilization, latency, and data locality
Valeria Cardellini - SABD 2016/17
12
Coarse-grain sharing Fine-grain sharing
Frameworks on Mesos
• Frameworks must be aware of running on Mesos – DevOps tooling: Vamp (deployment and workflow
tool for container orchestration) – Long running services: Aurora (service scheduler), …
– Big Data processing: Hadoop, Spark, Storm, MPI, …
– Batch scheduling: Chronos, … – Data storage: Alluxio, Cassandra, ElasticSearch, … – Machine learning: TFMesos (Tensorflow in Docker
on Mesos) See the full list at mesos.apache.org/documentation/latest/frameworks/
Valeria Cardellini - SABD 2016/17
13
Mesos: architecture
Valeria Cardellini - SABD 2016/17
14
• Master-slave architecture
• Slaves publish available resources to master
• Master sends resource offers to frameworks
• Master election and service discovery via ZooKeeper
Source: Mesos: a platform for fine-grained resource sharing in the data center, NSDI'11
Mesos component: Apache ZooKeeper • Coordination service for maintaining configuration
information, naming, providing distributed synchronization, and providing group services
• Used in many distributed systems, among which Mesos, Storm and Kafka
• Allows distributed processes to coordinate with each other through a shared hierarchical name space of data (znodes) – File-system-like API – Name space similar to a standard file system – Limited amount of data in znodes – It is no really a: file system, database, key-value store, lock
service
• Provides high throughput, low latency, highly available, strictly ordered access to the znodes
Valeria Cardellini - SABD 2016/17
15
Mesos component: ZooKeeper (2) • Replicated over a set of machines that maintain an
in-memory image of the data tree – Read requests processed locally by the ZooKeeper server – Write requests forwarded to other ZooKeeper servers and
consensus before a response is generated (primary-backup system)
– Uses Paxos as leader election protocol to determine which server is the master
• Implements atomic broadcast – Processes deliver the same messages (agreement) and
deliver them in the same order (total order) – Message = state update
Valeria Cardellini - SABD 2016/17
16
Mesos and framework components
Valeria Cardellini - SABD 2016/17
17
• Mesos components - Master
- Slaves or agents
• Framework components - Scheduler: registers with
the master to be offered resources
- Executors: launched on agent nodes to run the framework’s tasks
Scheduling in Mesos
Valeria Cardellini - SABD 2016/17
18
• Scheduling mechanism based on resource offers - Mesos offers available resources to frameworks
• Each resource offer contains a list of <agent ID, resource1: amount1, resource2: amount2, ...> !
- Each framework chooses which resources to use and which tasks to launch
• Two-level scheduler architecture - Mesos delegates the actual scheduling of tasks to
frameworks
- Improves scalability: the master does not have to know the scheduling intricacies of every type of application that it supports
Mesos: resource offers
• Resource allocation is based on Dominant Resource Fairness (DRF)
Valeria Cardellini - SABD 2016/17
19
Mesos: resource offers in details
Valeria Cardellini - SABD 2016/17
20
• Slaves continuously send status updates about resources to the Master
Mesos: resource offers in details (2)
Valeria Cardellini - SABD 2016/17
21
Mesos: resource offers in details (3)
Valeria Cardellini - SABD 2016/17
22
• Framework scheduler can reject offers
Mesos: resource offers in details (4)
Valeria Cardellini - SABD 2016/17
23
• Framework scheduler selects resources and provides tasks
– Hard to anticipate future frameworks requirements
– Need to refactor existing frameworks
Two-level scheduling in Mesos
• Resource offer – Vector of available
resources on a node – E.g., node1: <1CPU, 1GB>,
node2: <4CPU, 16GB>
• Master sends resource offers to frameworks
• Frameworks select which offers to accept and which tasks to run
Valeria Cardellini - SABD 2016/17
44
• Pros: – Simple: easier to scale and
make resilient – Easy to port existing
frameworks and support new ones
• Cons: – Distributed scheduling
decision: not optimal
Push task placement to frameworks
Mesos: resource allocation
• Based on Dominant Resource Fairness (DRF) algorithm
Valeria Cardellini - SABD 2016/17
45
DRF: background on fair sharing • Consider a single resource: fair sharing
– n users want to share a resource, e.g., CPU – Solution: allocate each 1/n of the shared
resource • Generalized by max-min fairness
– Handles if a user wants less than its fair share
– E.g., user 1 wants no more than 20% • Generalized by weighted max-min
fairness – Gives weights to users according to
importance – E.g., user 1 gets weight 1, user 2 weight 2
Valeria Cardellini - SABD 2016/17
46
Max-min fairness • 1 resource type: CPU • Total resources: 20 CPU • User 1 has x tasks and wants <1CPU> per task • User 2 has y tasks and wants <2CPU> per task
max(x, y) (maximize allocation) subject to
x + 2y ≤ 20 (CPU constraint) x = 2y (fairness)
Solution: x = 10 y = 5
Valeria Cardellini - SABD 2016/17
47
Why is fair sharing useful?
• Proportional allocation – User 1 gets weight 2, user 2 weight 1
• Priorities – Give user 1 weight 1000, user 2 weight 1
• Reservations: – Ensure user 1 gets 10% of a resource, so give
user 1 weight 10, sum weights 100 • Isolation policy:
– Users cannot affect others beyond their fair share
Valeria Cardellini - SABD 2016/17
48
Why is fair sharing useful? (2)
• Share guarantee – Each user can get at least 1/n of the resource – But will get less if its demand is less
• Strategy-proof – Users are not better off by asking for more than
they need – Users have no reason to lie
• Max-min fairness is the only reasonable mechanism with these two properties
• Many schedulers use max-min fairness – OS, networking, datacenters (e.g., YARN),
Valeria Cardellini - SABD 2016/17
49
Max-min fairness drawback • When is max-min fairness not enough? • Need to schedule multiple, heterogeneous
resources (CPU, memory, disk, I/O) • Single resource example
– 1 resource: CPU – User 1 wants <1CPU> per task – User 2 wants <2CPU> per task
• Multi-resource example – 2 resources: CPUs and memory – User 1 wants <1CPU, 4GB> per task – User 2 wants <3CPU, 1GB> per task
• What is a fair allocation? Valeria Cardellini - SABD 2016/17
50
A first (wrong) solution • Asset fairness: gives weights to resources (e.g., 1
CPU = 1 GB) and equalizes total value allocated to each user
• Total resources: 28 CPU and 56GB RAM (e.g., 1 CPU = 2 GB) – User 1 has x tasks and wants <1CPU, 2GB> per task – User 2 has y tasks and wants <1CPU, 4GB> per task
• Asset fairness yields: max(x, y) x + y ≤ 28 2x + 4y ≤ 56 4x = 6y
• User 1: x = 12 i.e. <43%CPU, 43%GB> (sum = 86%) • User 2: y = 8 i.e. <29%CPU, 57%GB> (sum = 86%)
Valeria Cardellini - SABD 2016/17
51
A first (wrong) solution (2)
• Problem: violates share guarantee – User 1 gets less than 50% of both CPU and RAM – Better off in a separate cluster with half the
resources
Valeria Cardellini - SABD 2016/17
52
What Mesos needs
• A fair sharing policy that provides: – Share guarantee – Strategy-proofness
• Challenge: can we generalize max-min fairness to multiple resources?
• Dominant resource of a user: the resource that user has the biggest share of – Example:
• Total resources: <8CPU, 5GB> • User 1 allocation: <2CPU, 1GB> • 2/8 = 25%CPU and 1/5 = 20%RAM • Dominant resource of user 1 is CPU (25% > 20%)
• Dominant share of a user: the fraction of the dominant resource allocated to the user – User 1 dominant share is 25%
Valeria Cardellini - SABD 2016/17
54
DRF (2) • Apply max-min fairness to dominant shares: give
every user an equal share of its dominant resource • Equalize the dominant share of the users
– Total resources: <9CPU, 18GB> – User 1 wants <1CPU, 4GB> – Dominant resource for user 1: RAM (1/9 < 4/18) – User 2 wants <3CPU, 1GB> – Dominant resource for user 2: CPU (3/9 > 1/18)
max(x, y) x + 3y ≤ 9 4x + y ≤ 18 (4/18)x = (3/9)y
• User 1: x = 3 <33%CPU, 66%GB> • User 2: y = 2 <66%CPU, 16%GB>
Valeria Cardellini - SABD 2016/17
55
Online DRF
• Whenever there are available resources and tasks to run: – Presents resource offers to the framework with the
smallest dominant share among all frameworks
Valeria Cardellini - SABD 2016/17
56
DRF: efficiency-fairness trade-off
Valeria Cardellini - SABD 2016/17
57
Efficiency-Fairness Trade-off
• DRF has under-utilized resources
• DRF schedules at the level of tasks (lead to sub-optimal job completion time)
• Fairness is fundamentally at odds with overall efficiency (how to trade-off?)
Mesos and containers
• Mesos plays well with existing container technologies (e.g., Docker) and also provides its own container technology – Docker containers: tasks run inside Docker
container – Mesos containers: tasks run with an array of
pluggable isolators provided by Mesos • Also supports composing different container
technologies (e.g., Docker and Mesos) • Which scale? 50,000 live containers
Valeria Cardellini - SABD 2016/17
58
Mesos deployment and evolution
• Mesos clusters can be deployed – On nearly every IaaS cloud provider infrastructure – In your own physical datacenter
• Mesos platform evolution towards microservices • Mesosphere’s Datacenter Operating System (DC/OS)
Valeria Cardellini - SABD 2016/17
59
– Open source operating system and distributed system built upon Mesos
• Marathon – Container orchestration platform
for Mesos and DC/OS
Apache Hadoop YARN • YARN: Yet Another Resource Negotiator
– A framework for job scheduling and cluster resource management
• Turns out Hadoop into an analytics platform in which resource management functions are separated from the programming model – Many scheduling-related functions are delegated to per-job
components
• Why separation? – Provides flexibility in the choice of programming framework – Platform can support not only MapReduce, but also other
models (e.g., Spark, Storm)
Valeria Cardellini - SABD 2016/17
60
Source: Apache Hadoop YARN: Yet Another Resource Negotiator, SoCC'13
YARN architecture • Global ResourceManager (RM) • A set of per-application ApplicationMasters (AMs) • A set of NodeManagers (NMs)
Valeria Cardellini - SABD 2016/17
61
YARN architecture: ResourceManager • One RM per cluster
– Central: global view – Enable global properties: fairness, capacity, locality
• Job requests are submitted to RM – Request-based – To start a job (application), RM finds a container to spawn AM
• Container – Logical bundle of resources (CPU, memory)
• Only handles an overall resource profile for each application – Local optimization is up to the application
• Preemption – Request resources back from an application – Checkpoint snapshot instead of explicitly killing jobs/migrate
computation to other containers Valeria Cardellini - SABD 2016/17
62
YARN architecture: ResourceManager (2) • No static resource partitioning • RM: resource scheduler in the system
– Organizes the global assignment of computing resources to application instances
• Composed of Scheduler and ApplicationsManager – Scheduler: responsible for allocating resources to the
various running applications • Pluggable scheduling policy to partition the cluster resources
among the various applications • Available pluggable schedulers:
– CapacityScheduler and FairScheduler – ApplicationsManager accepts job submissions and
negotiates the first set of resources for job execution
Valeria Cardellini - SABD 2016/17
63
YARN architecture: ApplicationMaster • One AM per application • The head of a job • Runs as a container • Issues resource requests to RM to obtain containers
– # of containers, resources per container, locality preferences, ...
• Dynamically changing resource consumption, based on the containers it receives from the RM
• Requests are late-binding – The process spawned is not bound to the request, but to the
lease – The conditions that caused the AM to issue the request may not
remain true when it receives its resources • Can run any user code, e.g., MapReduce, Spark, etc. • AM determines the semantics of the success or failure of
the container Valeria Cardellini - SABD 2016/17
64
YARN architecture: NodeManager
• One NM per node – The worker daemon – Runs as local agent on the execution node
• Registers with RM, heartbeats its status and receives instructions
• Reports resources to RM: memory, CPU, ... – Responsible for monitoring resource availability, reporting
faults, and container lifecycle management (e.g., starting, killing)
• Containers are described by a container launch context (CLC) – The command necessary to create the process – Environment variables, security tokens, …
Valeria Cardellini - SABD 2016/17
65
YARN architecture: NodeManager (2)
• Configures the environment for task execution • Garbage collection • Auxiliary services
– A process may produce data that persist beyond the life of the container
– Output intermediate data between map and reduce tasks.
Valeria Cardellini - SABD 2016/17
66
YARN: application startup
• Submitting the application: passing a CLC for the AM to the RM
• The RM launches the AM, which registers with the RM
• Periodically advertises its liveness and requirements over the heartbeat protocol
• Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM – It monitors the status of the running container and stops it
when the resource should be reclaimed
• Once the AM is done with its work, it should unregister from the RM and exit cleanly
Designed and optimized for Hadoop jobs (stateless batch jobs)
Multi-dimensional resource granularity
RAM/CPU slots resource granularity
Multiple Masters Single ResourceManager Written in C++ Written in Java Linux cgroups (performance isolation)
Simple Linux processes
Cluster resource management: Hot research topic
• Borg @EuroSys 2015 – Cluster management system used by Google for
the last decade, ancestor of Kubernetes • Firmament @OSDI 2016 firmament.io
– Scheduler (with a Kubernetes plugin called Poseidon) that balances the high-quality placement decisions of a centralized scheduler with the speed of a distributed scheduler
• Slicer @OSDI 2016 – Google’s auto-sharding service that separates out
the responsibility for sharding, balancing, and re-balancing from individual application frameworks
Valeria Cardellini - SABD 2016/17
69
Container orchestration
• Container orchestration: set of operations that allows cloud and application providers to define how to select, deploy, monitor, and dynamically control the configuration of multi-container packaged applications in the cloud – The tool used by organizations adopting containers
for enterprise production to integrate and manage those containers at scale
• Some examples – Cloudify – Kubernetes – Docker Swarm
Valeria Cardellini - SABD 2016/17
70
Container-management systems at Google
• Application-oriented shift “Containerization transforms the data center from being machine-oriented to being application-oriented”
• Goal: to allow container technology to operate at Google scale – Everything at Google runs as a container
• Borg -> Omega -> Kubernetes – Borg and Omega: purely Google-internal systems – Kubernetes: open-source
Valeria Cardellini - SABD 2016/17
71
Borg
Cluster management system at Google that achieves high utilization by: • Admission control • Efficient task-packing • Over-commitment • Machine sharing
Valeria Cardellini - SABD 2016/17
72
The user perspective • Users: Google developers and system
administrators mainly • Heterogeneous workload: production and
batch, mainly • Cell: subset of cluster machines
– Around 10K nodes – Machines in a cell are heterogeneous in many
dimensions • Jobs and tasks
– A job is a group of tasks – Each task maps to a set of Linux processes
running in a container on a machine • task = container
Valeria Cardellini - SABD 2016/17
73
The user perspective • Alloc (allocation)
– Reserved pool of resources on a machine in which one or more tasks can run
– The smallest deployable units that can be created, scheduled, and manage
– Task = alloc – Job = alloc set
• Priority, quota, and admission control – Job has a priority (preempting) – Quota is used to decide which jobs to admit for scheduling
• Naming and monitoring – Service’s clients and other systems need to be able to find
tasks, even after their relocation to a new machine – BNS example: 50.jfoo.ubar.cc.borg.google.com !– Monitoring health of the task and thousands of performance
metrics Valeria Cardellini - SABD 2016/17
74
Scheduling a job
Valeria Cardellini - SABD 2016/17
75
job hello_world = {
runtime = { cell = “ic” } //what cell should run it in?
binary = ‘../hello_world_webserver’ //what program to run?
args = { port = ‘%port%’ }
requirements = {
RAM = 100M
disk = 100M
CPU = 0.1
}
replicas = 10000
}
Borg architecture
Valeria Cardellini - SABD 2016/17
76
• Borgmaster – Main BorgMaster process
• Five replicas – Scheduler
• Borglets – Manage and monitor tasks
and resource – Borgmaster polls Borglets
every few seconds
Borg architecture
Valeria Cardellini - SABD 2016/17
77
• Fauxmaster: high-fidelity Borgmaster simulator – Simulate previous runs from
checkpoints – Contains full Borg code
• Used for debugging, capacity planning, evaluate new policies and algorithms
Scheduling in Borg
• Two-phase scheduling algorithm 1. Feasibility checking: find machines for a given
job 2. Scoring: pick one machine
– User preferences and build-in criteria • Minimize the number and priority of the preempted tasks • Picking machines that already have a copy of the task’s
packages • Spreading tasks across power and failure domains • Packing by mixing high and low priority tasks
Valeria Cardellini - SABD 2016/17
78
Borg’s allocation policies
• Advanced bin-packing algorithms – Avoid stranding of resources – Bin packing problem: objects of different volumes
vi must be packed into a finite number of bins each of volume V in a way that minimizes the number of bins used
– Bin packing: NP-hard problem -> many heuristics, among which: best fit, first fit, worst fit
• Evaluation metric: cell-compaction – Find smallest cell that we can pack the workload
• Separate scheduler • Separate threads to poll the Borglets • Partition functions across the five replicas • Score caching
– Evaluating feasibility and scoring a machine is expensive, so cache them
• Equivalence classes – Group tasks with identical requirements
• Relaxed randomization – Wasteful to calculate feasibility and scores for all
the machines in a large cell Valeria Cardellini - SABD 2016/17
80
Borg ecosystem
• Many different systems built in, on, and around Borg to improve upon the basic container-management services – Naming and service discovery (the Borg Name
Service, or BNS) – Master election, using Chubby – Application-aware load balancing – Horizontal (instance number) and vertical
(instance size) auto-scaling – Rollout tools for deployment of new binaries and
configuration data – Workflow tools – Monitoring tools
Valeria Cardellini - SABD 2016/17
81
Omega
• Offspring of Borg • State of the cluster stored in a centralized
Paxos-based transaction-oriented store – Accessed by the different parts of the cluster
control plane (such as schedulers), using optimistic concurrency control to handle the occasional conflicts
Valeria Cardellini - SABD 2016/17
82
Mesos Borg
Kubernetes (K8s)
• Google’s open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure kubernetes.io kubernetes.io/docs/tutorials/kubernetes-basics/
• Borglet => Kubelet • alloc => pod • Borg containers =>
Docker • Declarative specifications
Improved
• Job => labels • Managed ports => IP per
pod • Monolithic master =>
Micro-services
• Loosely inspired by Borg
Pod
• Pod: basic unit in Kubernetes – Group of containers that are co-
scheduled • A resource envelope in which
one or more containers run
Valeria Cardellini - SABD 2016/17 85
– Containers that are part of the same pod are guaranteed to be scheduled together onto the same machine, and can share state via local volumes
• Users organize pods using labels – Label: arbitrary key/value pair attached to pod – E.g., role=frontend and stage=production !
IP per pod
• Containers running on a machine share the host’s IP address, so Borg assigns the containers unique port numbers – Burdens on infrastructure
and application developers, e.g., it requires to replace DNS service
• IP address per pod, thus aligning network identity (IP address) with application identity – Easier to run off-the-shelf
software on Kubernetes
Valeria Cardellini - SABD 2016/17 86
Borg Kubernetes
Omega and Kubernetes
• Like Omega, shared persistent store – Components watch for changes to relevant
objects • In contrast to Omega, state accessed
through a domain-specific REST API that applies higher-level versioning, validation, semantics, and policy, in support of a more diverse array of clients – Stronger focus on experience of developers
writing cluster applications
Valeria Cardellini - SABD 2016/17 87
Kubernetes architecture
Valeria Cardellini - SABD 2016/17 88
• Master: cluster control plane
• Kubelets: node agents
• Cluster state backed by distributed storage system (etcd, a highly available distributed key-value store)
Kubernetes on Mesos
• Mesos allows dynamic sharing of cluster resources between Kubernetes and other first-class Mesos frameworks such as HDFS, Spark, and Chronos kubernetes.io/docs/getting-started-guides/mesos/
Valeria Cardellini - SABD 2016/17 89
References
• B. Burns et al., Borg, Omega, and Kubernetes. Commun. ACM, 2016.
• A. Ghodsi et al., Dominant resource fairness: fair allocation of multiple resource types, NSDI 2011.
• B. Hindman et al., Mesos: a platform for fine-grained resource sharing in the data center, NSDI 2011.
• M. Schwarzkopf et al., Omega: flexible, scalable schedulers for large compute clusters, EuroSys 2013.
• V. K. Vavilapalli et al., Apache Hadoop YARN: yet another resource negotiator, SOCC 2013.
• A. Verma et al., Large-scale cluster management at Google with Borg, EuroSys 2015.