Top Banner
From the dive bars of silicon valley to the World Tour Carlo Curino
61

The Evolution of Big Data Frameworks

Aug 23, 2014

Download

Data & Analytics

The talk presents the evolution of Big-Data systems from single-purpose MapReduce frameworks to fully general computational infrastructures. In particular, I will follow the evolution of Hadoop, and show the benefits and challenges of a new architectural paradigm that decouples the resource management component (YARN) from the specifics of the application frameworks (e.g., MapReduce, Tez, REEF, Giraph, Naiad, Dryad, Spark,...). We argue that beside the primary goals of increasing scalability and programming model flexibility, this transformation dramatically facilitates innovation.

In this context, I will present some of our contributions to the evolution of Hadoop (namely: work-preserving preemption, and predictable resource allocation), and comment on the fascinating experience of working on open- source technologies from within Microsoft. The current Hadoop APIs (HDFS and YARN) provide the cluster equivalent of an OS API. With this as a backdrop, I will present our attempt to create the equivalent of stdlib for the cluster: the REEF project.

Carlo A. Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Evolution of Big Data Frameworks

From the dive bars of silicon valley to the World Tour

Carlo Curino

Page 2: The Evolution of Big Data Frameworks

PhD + PostDoc in databases

“we did all of this 30 years ago!”

Yahoo! Research

“webscale, webscale, webscale!”

Microsoft – CISL

“enterprise + cloud + search engine + big-data”

My perspective

Page 3: The Evolution of Big Data Frameworks

Cluster as an Embedded systems (map-reduce)

single-purpose clusters

General purpose cluster OS (YARN, Mesos, Omega, Corona)

standardizing access to computational resources

Real-time OS for the cluster !? (Rayon)

predictable resource allocation

Cluster stdlib (REEF)

factoring out common functionalities

Agenda

Page 4: The Evolution of Big Data Frameworks

Cluster as an Embedded Systems “the era of map-reduce only clusters”

Page 5: The Evolution of Big Data Frameworks

Purpose-built technology

Within large web companies

Well targeted mission (process webcrawl)

à scale and fault tolerance

The origin

Google leading the pack

Google File System + MapReduce (2003/2004)

Open-source and parallel efforts

Yahoo! Hadoop ecosystem HDFS + MR (2006/2007)

Microsoft Scope/Cosmos (2008) (more than MR)

Page 6: The Evolution of Big Data Frameworks

In-house growth

What was the key to success for Hadoop?

Page 7: The Evolution of Big Data Frameworks

In-house growth (following Hadoop story) Access, access, access…

All the data sit in the DFS

Trivial to use massive compute power

à lots of new applications

But… everything has to be MR

Cast any computation as map-only job

MPI, graph processing, streaming, launching web-servers!?!

Page 8: The Evolution of Big Data Frameworks

Popularization

Everybody wants Big Data

Insight from raw data is cool

Outside MS and Google, Big-Data == Hadoop

Hadoop as catch-all big-data solution (and cluster manager)

Page 9: The Evolution of Big Data Frameworks

Not just massive in-house clusters

New challenges?

New deployment environments

Small clusters (10s of machines)

Public Cloud

Page 10: The Evolution of Big Data Frameworks

New deployment challenges

Small clusters

Efficiency matter more than scalability

Admin/tuning done by mere mortals

Cloud

Untrusted users (security)

Users are paying (availability, predictability)

Users are unrelated to each other (performance isolation)

Page 11: The Evolution of Big Data Frameworks

Classic MapReduce

Page 12: The Evolution of Big Data Frameworks

Classic MapReduce

Page 13: The Evolution of Big Data Frameworks

Classic MapReduce

Page 14: The Evolution of Big Data Frameworks

Classic Hadoop (1.0) Architecture

Client Job1

JobTracker

Scheduler

TaskTracker TaskTracker TaskTracker

Map Reduce Map Reduce

Map

Map

Map Map

Map Map

Map

Red.

Red. Red.

Red.

Red.

Red.

Reduce

Handles resource management Global invariants fairness/capacity Determines who runs / resources / where

Manages MapReduce application flow maps before reducers, re-run upon failure, etc..

Page 15: The Evolution of Big Data Frameworks

What are the key shortcomings of (old) Hadoop?

Hadoop 1.0 Shortcomings

Page 16: The Evolution of Big Data Frameworks

Programming model rigidity

JobTracker manages resources

JobTracker manages application workflow (data dependencies)

Performance and Availability

Map vs Reduce slots lead to low cluster utilization (~70%)

JobTracker had too much to do: scalability concern

JobTracker is a single point of failure

Hadoop 1.0 Shortcomings (similar to original MR)

Page 17: The Evolution of Big Data Frameworks

General purpose cluster OS “Cluster OS (YARN)”

Page 18: The Evolution of Big Data Frameworks

YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)*

Request-based central scheduler

Mesos (2011, UCB, open-sourced, tested at Twitter)*

Offer-based two level scheduler

Omega (2013, Google, simulation?)*

Shared-state-based scheduling

Corona (2013, Facebook, production)

YARN-like but offered-based

Four proposals

* all three were best-papers or best-student-paper

Page 19: The Evolution of Big Data Frameworks

YARN

Ad-hocapp

Ad-hocapp

Ad-hocapp

Ad-hocApps

YARN

MR v2

Tez Giraph Storm DryadREEF

...

Hive / Pig

Hadoop 1.x(MapReduce)

MR v1

Hive / Pig

Users

ApplicationFrameworks

ProgrammingModel(s)

Cluster OS (Resource Management)

Hadoop 1 World Hadoop 2 World

File SystemHDFS 1 HDFS 2

Hardware

Ad-hocapp

Ad-hocapp

Page 20: The Evolution of Big Data Frameworks

A new architecture for Hadoop

Decouples resource management from programming model

(MapReduce is an “application” running on YARN)

YARN (or Hadoop 2.x)

Page 21: The Evolution of Big Data Frameworks

YARN (Hadoop 2) Architecture

Client Job1

Resource Manager

Scheduler

NodeManager NodeManager NodeManager

Task

Task

Task Task

Task App Master

Task Task

Negotiate access to more resources (ResourceRequest)

Page 22: The Evolution of Big Data Frameworks

Flexibility, Performance and Availability

Multiple Programming Models

Central components do less à scale better

Easier High-Availability (e.g., RM vs AM)

Why does this matter?

System Jobs/Day Tasks/Day Cores pegged Hadoop 1.0 77k 4M 3.2

YARN 125k (150k) 12M (15M) 6 (10)

Page 23: The Evolution of Big Data Frameworks

Anything else you can think?

Page 24: The Evolution of Big Data Frameworks

Maintenance, Upgrade, and Experimentation

Run with multiple framework versions (at one time)

Trying out a new ideas is as is as launching a job

Anything else you can think?

Page 25: The Evolution of Big Data Frameworks

Real-time OS for the cluster (?) “predictable resource allocation”

Page 26: The Evolution of Big Data Frameworks

YARN (Cosmos, Mesos, and Corona)

support instantaneous scheduling invariants (fairness/capacity)

maximize cluster throughput (eye to locality)

Current trends

New applications (require “gang” and dependencies)

Consolidation of production/test clusters + Cloud

(SLA jobs mixed with best-effort jobs)

Motivation

Page 27: The Evolution of Big Data Frameworks

Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie)

Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza)

Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph)

Example Use Cases

Page 28: The Evolution of Big Data Frameworks

In a consolidated cluster:

Time-based SLAs for production jobs (completion deadline)

Good latency for best-effort jobs

High cluster utilization/throughput

(Support rich applications: gang and skylines)

High-Level Goals

Page 29: The Evolution of Big Data Frameworks

Decompose time-based SLAs in

resource definition: via RDL

predictable resource allocation: planning + scheduling

Divide and Conquer time-based SLAs

Page 30: The Evolution of Big Data Frameworks

Expose to planner application needs

time: start (s), finish (f)

resources: capacity (w), total parallelism (h),

minimum parallelism (l), min lease duration (t)

Resource Definition Language (RDL) 1/2

Page 31: The Evolution of Big Data Frameworks

Skylines / pipelines:

dependencies: among atomic allocations

(ALL, ANY, ORDER)

Resource Definition Language (RDL) 2/2

Page 32: The Evolution of Big Data Frameworks

Important classes

Framework semantics: Perforator modeling of Scope/Hive

Machine Learning: gang + bounded iterations (PREDict)

Periodic jobs: history-based resource definition

Coming up with RDL specs

prediction

Page 33: The Evolution of Big Data Frameworks

Root 100%

Staging 15%

Production 60%

J1 10%

J2 40%

J3 10%

Post 5%

Best Effort 20%

Planning vs Scheduling

Plan Follower

J3

J1

J3

J1

J3

J1

J3 J1 J3

??

Scheduling (fine-grained but time-oblivious)

Resource Definition

Planning (coarse but time-aware)

Preemption

Plan Sharing Policy

ReservationService

Resource Manager

Sys Model / Feedback

Page 34: The Evolution of Big Data Frameworks

Some example run:

lots of queues for gridmix

Microsoft pipelines

Dynamic queues

Page 35: The Evolution of Big Data Frameworks

Improves

production job SLAs

best-effort jobs latency

cluster utilization and throughput

Comparing against Hadoop CapacityScheduler

Page 36: The Evolution of Big Data Frameworks

Under promise, over deliver

Plan for late execution, and run as early as you can

Greedy Agent

GB

Page 37: The Evolution of Big Data Frameworks

Coping with imperfections (system)

compensate RDL based on black-box models of overheads

Coping with Failures (system)

re-plan (move/kill allocations) in response of system- observable resource issues

Coping with Failures/Misprediction (user)

continue in best-effort mode when reservation expires

re-negotiate existing reservations

Dealing with “Reality”

Page 38: The Evolution of Big Data Frameworks

Sharing Policy: CapacityOverTimePolicy

constrains: instantaneous max, and running avg

e.g., no user can exceed an instantaneous 30% allocation, and

an average of 10% in any 24h period of time

single partial scan of plan: O(|alloc| + |window|)

User Quotas (trade-off flexibility to fairness)

Page 39: The Evolution of Big Data Frameworks

Introduce Admission Control and Time-based SLAs (YARN-1051)

New ReservationService API (to reserve resources)

Agents + Plan + SharingPolicy to organize future allocations

Leverage underlying scheduler

Future Directions

Work with MSR-India on RDL estimates for Hive and MR

Advanced agents for placement ($$-based and optimal algos)

Enforcing decisions (Linux Containers, Drawbridge, Pacer)

Conclusion

Page 40: The Evolution of Big Data Frameworks

Cluster stdlib: REEF “factoring out recurring components”

Page 41: The Evolution of Big Data Frameworks

Dryad DAG computations

Tez DAG computations (focus on interactive and Hive support)

Storm stream processing

Spark interactive / in-memory / iterative

Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel)

Impala scalable, interactive, SQL-like query

HoYA Hbase on yarn

Stratoshpere parallel iterative computations

REEF, Weave, Spring-Hadoop meta-frameworks to help build apps

Focusing on YARN: many applications

Page 42: The Evolution of Big Data Frameworks

Lots of repeated work

Communication

Configuration

Data and Control Flow

Error handling / fault-tolerance

Common “better than hadoop” tricks:

Avoid Scheduling overheads

Control Excessive disk IO

Are YARN/Mesos/Omega enough?

Page 43: The Evolution of Big Data Frameworks

The Challenge

YARN / HDFS

SQL / Hive … … Machine Learning

u  Fault Tolerance

u  Row/Column Storage

u  High Bandwidth Networking

Page 44: The Evolution of Big Data Frameworks

The Challenge

YARN / HDFS

SQL / Hive … … Machine Learning

u  Fault Awareness

u  Local data caching

u  Low Latency Networking

Page 45: The Evolution of Big Data Frameworks

SQL / Hive

The Challenge

YARN / HDFS

… … Machine Learning

Page 46: The Evolution of Big Data Frameworks

SQL / Hive

REEF in the Stack

YARN / HDFS

… … Machine Learning

REEF

Page 47: The Evolution of Big Data Frameworks

REEF in the Stack (Future)

YARN / HDFS

SQL / Hive … … Machine Learning

REEF

Operator API and Library

Logical Abstraction

Page 48: The Evolution of Big Data Frameworks

REEF

Client Job1

Resource Manager

Scheduler

NodeManager NodeManager NodeManager

Task

Task

Task Task

Task App Master

Task Task

Negotiate access to more resources (ResourceRequest)

Page 49: The Evolution of Big Data Frameworks

REEF

Client Job1

Resource Manager

Scheduler

NodeManager NodeManager NodeManager

Evaluator

Task

serv

ices

Evaluator

Task

serv

ices

REEF RT

Driver

Name-based

User control flow logic

Retains State!

User data crunching

logic

Fault-detection

Injection-based checkable

configuration

Event-based Control flow

Page 50: The Evolution of Big Data Frameworks

REEF: Computation and Data Management

Extensible Control Flow Data Management Services Storage

Network

State Management

Job Driver Control plane implementation. User code executed on YARN’s Application Master

Activity User code executed within an Evaluator.

Evaluator Execution Environment for Activities. One Evaluator is bound to one YARN Container.

Page 51: The Evolution of Big Data Frameworks

Control Flow is centralized in the Driver Evaluator, Tasks configuration and launch

Error Handling is centralized in the Driver All exceptions are forwarded to the Driver

All APIs are asynchronous Support for:

Caching / Checkpointing / Group communication Example Apps Running on REEF

MR, Asynch Page Rank, ML regressions, PCA, distributed shell,….

REEF Summary

(Open-sourced with Apache License)

Page 52: The Evolution of Big Data Frameworks

Big-Data Systems

Ongoing focus

Future work

Leverage high-level app semantics Coordinate tiered-storage and scheduling

Conclusions

Page 53: The Evolution of Big Data Frameworks

Adding Preemption to YARN, and open-sourcing it to Apache

Page 54: The Evolution of Big Data Frameworks

Limited mechanisms to “revise current schedule”

Patience

Container killing

To enforce global properties

Leave resources fallow (e.g., CapacityScheduler) à low utilization

Kill containers (e.g., FairScheduler) à wasted work

(Old) new trick

Support work-preserving preemption

(via) checkpointing à more than preemption

State of the Art

Page 55: The Evolution of Big Data Frameworks

Changes throughout YARN

Client Job1

RM

Scheduler

NodeManager NodeManager NodeManager

App Master Task

Task

Task

Task

Task

Task Task

PreemptionMessage { Strict { Set<ContainerID> } Flexible { Set<ResourceRequest>, Set<ContainerID> } }

Collaborative application Policy-based binding for Flexible preemption requests

Use of Preemption Context: Outdated information Delayed effects of actions Multi-actor orchestration Interesting type of preemption: RM declarative request AM bounds it to containers

Page 56: The Evolution of Big Data Frameworks

Changes throughout YARN

Client Job1

RM

Scheduler

NodeManager NodeManager NodeManager

MR AM Task

Task

Task

Task

Task

Task Task

When can I preempt? tag safe UDFs or user-saved state @Preemptable public class MyReducer{ … }

Common Checkpoint Service WriteChannel cwc = cs.create(); cwc.write(…state…); CheckpointID cid = cs.commit(cwc); ReadChannel crc = cs.open(cid);

Page 57: The Evolution of Big Data Frameworks

57

CapacityScheduler + Unreservation + Preemption: memory utilization

CapacityScheduler (allow overcapacity)

CapacityScheduler (no overcapacity)

Page 58: The Evolution of Big Data Frameworks

Client Job1

RM

Scheduler

NodeManager NodeManager NodeManager

App Master Task

Task

Task

Task

Task

Task Task

MR-5192

MR-5194

MR-5197

MR-5189

MR-5189

MR-5176 YARN-569

MR-5196

(Metapoint) Experience contributing to Apache

Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished

Page 59: The Evolution of Big Data Frameworks

With @Preemptable

tag imperative code with semantic property

Generalize this trick

expose semantic properties to platform (@PreserveSortOrder)

allow platforms to optimize execution (map-reduce pipelining)

REEF seems the logical place where to do this.

Tagging UDFs

Page 60: The Evolution of Big Data Frameworks

(Basic) Building block for:

Enables efficient preemption

Dynamic Optimizations (task splitting, efficiency improvements)

Fault Tolerance

Other uses for Checkpointing

Page 61: The Evolution of Big Data Frameworks

61