The Evolution of Big Data Frameworks

From the dive bars of silicon valley to the World Tour

Carlo Curino

PhD + PostDoc in databases

“we did all of this 30 years ago!”

Yahoo! Research

“webscale, webscale, webscale!”

Microsoft – CISL

“enterprise + cloud + search engine + big-data”

My perspective

Cluster as an Embedded systems (map-reduce)

single-purpose clusters

General purpose cluster OS (YARN, Mesos, Omega, Corona)

standardizing access to computational resources

Real-time OS for the cluster !? (Rayon)

predictable resource allocation

Cluster stdlib (REEF)

factoring out common functionalities

Agenda

Cluster as an Embedded Systems “the era of map-reduce only clusters”

Purpose-built technology

Within large web companies

Well targeted mission (process webcrawl)

à scale and fault tolerance

The origin

Google leading the pack

Google File System + MapReduce (2003/2004)

Open-source and parallel efforts

Yahoo! Hadoop ecosystem HDFS + MR (2006/2007)

Microsoft Scope/Cosmos (2008) (more than MR)

In-house growth

What was the key to success for Hadoop?

In-house growth (following Hadoop story) Access, access, access…

All the data sit in the DFS

Trivial to use massive compute power

à lots of new applications

But… everything has to be MR

Cast any computation as map-only job

MPI, graph processing, streaming, launching web-servers!?!

Popularization

Everybody wants Big Data

Insight from raw data is cool

Outside MS and Google, Big-Data == Hadoop

Hadoop as catch-all big-data solution (and cluster manager)

Not just massive in-house clusters

New challenges?

New deployment environments

Small clusters (10s of machines)

Public Cloud

New deployment challenges

Small clusters

Efficiency matter more than scalability

Admin/tuning done by mere mortals

Cloud

Untrusted users (security)

Users are paying (availability, predictability)

Users are unrelated to each other (performance isolation)

Classic MapReduce

Classic MapReduce

Classic MapReduce

Classic Hadoop (1.0) Architecture

Client Job1

JobTracker

Scheduler

TaskTracker TaskTracker TaskTracker

Map Reduce Map Reduce

Map

Map

Map Map

Map Map

Map

Red.

Red. Red.

Red.

Red.

Red.

Reduce

Handles resource management Global invariants fairness/capacity Determines who runs / resources / where

Manages MapReduce application flow maps before reducers, re-run upon failure, etc..

What are the key shortcomings of (old) Hadoop?

Hadoop 1.0 Shortcomings

Programming model rigidity

JobTracker manages resources

JobTracker manages application workflow (data dependencies)

Performance and Availability

Map vs Reduce slots lead to low cluster utilization (~70%)

JobTracker had too much to do: scalability concern

JobTracker is a single point of failure

Hadoop 1.0 Shortcomings (similar to original MR)

General purpose cluster OS “Cluster OS (YARN)”

YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)*

Request-based central scheduler

Mesos (2011, UCB, open-sourced, tested at Twitter)*

Offer-based two level scheduler

Omega (2013, Google, simulation?)*

Shared-state-based scheduling

Corona (2013, Facebook, production)

YARN-like but offered-based

Four proposals

* all three were best-papers or best-student-paper

YARN

Ad-hocapp

Ad-hocapp

Ad-hocapp

Ad-hocApps

YARN

MR v2

Tez Giraph Storm DryadREEF

...

Hive / Pig

Hadoop 1.x(MapReduce)

MR v1

Hive / Pig

Users

ApplicationFrameworks

ProgrammingModel(s)

Cluster OS (Resource Management)

Hadoop 1 World Hadoop 2 World

File SystemHDFS 1 HDFS 2

Hardware

Ad-hocapp

Ad-hocapp

A new architecture for Hadoop

Decouples resource management from programming model

(MapReduce is an “application” running on YARN)

YARN (or Hadoop 2.x)

YARN (Hadoop 2) Architecture

Client Job1

Resource Manager

Scheduler

NodeManager NodeManager NodeManager

Task

Task

Task Task

Task App Master

Task Task

Negotiate access to more resources (ResourceRequest)

Flexibility, Performance and Availability

Multiple Programming Models

Central components do less à scale better

Easier High-Availability (e.g., RM vs AM)

Why does this matter?

System Jobs/Day Tasks/Day Cores pegged Hadoop 1.0 77k 4M 3.2

YARN 125k (150k) 12M (15M) 6 (10)

Anything else you can think?

Maintenance, Upgrade, and Experimentation

Run with multiple framework versions (at one time)

Trying out a new ideas is as is as launching a job

Anything else you can think?

Real-time OS for the cluster (?) “predictable resource allocation”

YARN (Cosmos, Mesos, and Corona)

support instantaneous scheduling invariants (fairness/capacity)

maximize cluster throughput (eye to locality)

Current trends

New applications (require “gang” and dependencies)

Consolidation of production/test clusters + Cloud

(SLA jobs mixed with best-effort jobs)

Motivation

Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie)

Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza)

Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph)

Example Use Cases

In a consolidated cluster:

Time-based SLAs for production jobs (completion deadline)

Good latency for best-effort jobs

High cluster utilization/throughput

(Support rich applications: gang and skylines)

High-Level Goals

Decompose time-based SLAs in

resource definition: via RDL

predictable resource allocation: planning + scheduling

Divide and Conquer time-based SLAs

Expose to planner application needs

time: start (s), finish (f)

resources: capacity (w), total parallelism (h),

minimum parallelism (l), min lease duration (t)

Resource Definition Language (RDL) 1/2

Skylines / pipelines:

dependencies: among atomic allocations

(ALL, ANY, ORDER)

Resource Definition Language (RDL) 2/2

Important classes

Framework semantics: Perforator modeling of Scope/Hive

Machine Learning: gang + bounded iterations (PREDict)

Periodic jobs: history-based resource definition

Coming up with RDL specs

prediction

Root 100%

Staging 15%

Production 60%

J1 10%

J2 40%

J3 10%

Post 5%

Best Effort 20%

Planning vs Scheduling

Plan Follower

J3

J1

J3

J1

J3

J1

J3 J1 J3

??

Scheduling (fine-grained but time-oblivious)

Resource Definition

Planning (coarse but time-aware)

Preemption

Plan Sharing Policy

ReservationService

Resource Manager

Sys Model / Feedback

Some example run:

lots of queues for gridmix

Microsoft pipelines

Dynamic queues

Improves

production job SLAs

best-effort jobs latency

cluster utilization and throughput

Comparing against Hadoop CapacityScheduler

Under promise, over deliver

Plan for late execution, and run as early as you can

Greedy Agent

GB

Coping with imperfections (system)

compensate RDL based on black-box models of overheads

Coping with Failures (system)

re-plan (move/kill allocations) in response of system- observable resource issues

Coping with Failures/Misprediction (user)

continue in best-effort mode when reservation expires

re-negotiate existing reservations

Dealing with “Reality”

Sharing Policy: CapacityOverTimePolicy

constrains: instantaneous max, and running avg

e.g., no user can exceed an instantaneous 30% allocation, and

an average of 10% in any 24h period of time

single partial scan of plan: O(|alloc| + |window|)

User Quotas (trade-off flexibility to fairness)

Introduce Admission Control and Time-based SLAs (YARN-1051)

New ReservationService API (to reserve resources)

Agents + Plan + SharingPolicy to organize future allocations

Leverage underlying scheduler

Future Directions

Work with MSR-India on RDL estimates for Hive and MR

Advanced agents for placement ($$-based and optimal algos)

Enforcing decisions (Linux Containers, Drawbridge, Pacer)

Conclusion

Cluster stdlib: REEF “factoring out recurring components”

Dryad DAG computations

Tez DAG computations (focus on interactive and Hive support)

Storm stream processing

Spark interactive / in-memory / iterative

Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel)

Impala scalable, interactive, SQL-like query

HoYA Hbase on yarn

Stratoshpere parallel iterative computations

REEF, Weave, Spring-Hadoop meta-frameworks to help build apps

Focusing on YARN: many applications

Lots of repeated work

Communication

Configuration

Data and Control Flow

Error handling / fault-tolerance

Common “better than hadoop” tricks:

Avoid Scheduling overheads

Control Excessive disk IO

Are YARN/Mesos/Omega enough?

The Challenge

YARN / HDFS

SQL / Hive … … Machine Learning

u  Fault Tolerance

u  Row/Column Storage

u  High Bandwidth Networking

The Challenge

YARN / HDFS


u  Fault Awareness

u  Local data caching

u  Low Latency Networking

SQL / Hive

The Challenge

YARN / HDFS

… … Machine Learning

SQL / Hive

REEF in the Stack

YARN / HDFS

… … Machine Learning

REEF

REEF in the Stack (Future)

YARN / HDFS


REEF

Operator API and Library

Logical Abstraction

REEF

Client Job1

Resource Manager

Scheduler


Task

Task

Task Task

Task App Master

Task Task

Negotiate access to more resources (ResourceRequest)

REEF

Client Job1

Resource Manager

Scheduler


Evaluator

Task

serv

ices

Evaluator

Task

serv

ices

REEF RT

Driver

Name-based

User control flow logic

Retains State!

User data crunching

logic

Fault-detection

Injection-based checkable

configuration

Event-based Control flow

REEF: Computation and Data Management

Extensible Control Flow Data Management Services Storage

Network

State Management

Job Driver Control plane implementation. User code executed on YARN’s Application Master

Activity User code executed within an Evaluator.

Evaluator Execution Environment for Activities. One Evaluator is bound to one YARN Container.

Control Flow is centralized in the Driver Evaluator, Tasks configuration and launch

Error Handling is centralized in the Driver All exceptions are forwarded to the Driver

All APIs are asynchronous Support for:

Caching / Checkpointing / Group communication Example Apps Running on REEF

MR, Asynch Page Rank, ML regressions, PCA, distributed shell,….

REEF Summary

(Open-sourced with Apache License)

Big-Data Systems

Ongoing focus

Future work

Leverage high-level app semantics Coordinate tiered-storage and scheduling

Conclusions

Adding Preemption to YARN, and open-sourcing it to Apache

Limited mechanisms to “revise current schedule”

Patience

Container killing

To enforce global properties

Leave resources fallow (e.g., CapacityScheduler) à low utilization

Kill containers (e.g., FairScheduler) à wasted work

(Old) new trick

Support work-preserving preemption

(via) checkpointing à more than preemption

State of the Art

Changes throughout YARN

Client Job1

RM

Scheduler


App Master Task

Task

Task

Task

Task

Task Task

PreemptionMessage { Strict { Set<ContainerID> } Flexible { Set<ResourceRequest>, Set<ContainerID> } }

Collaborative application Policy-based binding for Flexible preemption requests

Use of Preemption Context: Outdated information Delayed effects of actions Multi-actor orchestration Interesting type of preemption: RM declarative request AM bounds it to containers

Changes throughout YARN

Client Job1

RM

Scheduler


MR AM Task

Task

Task

Task

Task

Task Task

When can I preempt? tag safe UDFs or user-saved state @Preemptable public class MyReducer{ … }

Common Checkpoint Service WriteChannel cwc = cs.create(); cwc.write(…state…); CheckpointID cid = cs.commit(cwc); ReadChannel crc = cs.open(cid);

57

CapacityScheduler + Unreservation + Preemption: memory utilization

CapacityScheduler (allow overcapacity)

CapacityScheduler (no overcapacity)

Client Job1

RM

Scheduler


App Master Task

Task

Task

Task

Task

Task Task

MR-5192

MR-5194

MR-5197

MR-5189

MR-5189

MR-5176 YARN-569

MR-5196

(Metapoint) Experience contributing to Apache

Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished

With @Preemptable

tag imperative code with semantic property

Generalize this trick

expose semantic properties to platform (@PreserveSortOrder)

allow platforms to optimize execution (map-reduce pipelining)

REEF seems the logical place where to do this.

Tagging UDFs

(Basic) Building block for:

Enables efficient preemption

Dynamic Optimizations (task splitting, efficiency improvements)

Fault Tolerance

Other uses for Checkpointing

61

The Evolution of Big Data Frameworks

Data & Analytics

hoc app ad

resource definition

house growth

fault tolerance

based slas

embedded systems

time os

cluster