September SF Hadoop User Group 2013

© Hortonworks Inc. 2011

Hadoop YARNSF Hadoop Users Meetup

Vinod Kumar Vavilapalli

vinodkv [at] { apache dot org | hortonworks dot com }

@tshooter

Page 1

© Hortonworks Inc. 2011Page 2

Myself

• 6.25 Hadoop-years old• Previously at Yahoo!, @Hortonworks now.• Last thing at college – a two node tomcat cluster. Three months later, first thing at job, brought down a 800 node cluster ;)

• Hadoop YARN lead. Apache Hadoop PMC, Apache Member

• MapReduce, HadoopOnDemand, CapacityScheduler, Hadoop security

• Ambari/Stinger/ random trouble shooting


YARN: A new abstraction layer

HADOOP 1.0

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

MapReduce(data processing)

Others(data processing)

HADOOP 2.0

Single Use SystemBatch Apps

Multi Purpose PlatformBatch, Interactive, Online, Streaming, …

Page 3


Concepts

Page 4

HDFS

YARN

MRv2 Tez

Platform

Applications & Frameworks

Job #1 Job #2Jobs


Concepts

• Platform• Framework• Application

–Application is a job submitted to the framework–Example – Map Reduce Job

• Container–Basic unit of allocation–Fine-grained resource allocation across multiple resource

types (memory, cpu, disk, network, gpu etc.)– container_0 = 2GB, 1CPU

– container_1 = 1GB, 6 CPU


Architecture

Architecting the Future of Big DataPage 6


Hadoop MapReduce Classic

• JobTracker

–Manages cluster resources and job scheduling

• TaskTracker

–Per-node agent

–Manage tasks


Current Limitations

• Scalability–Maximum Cluster size – 4,000 nodes–Maximum concurrent tasks – 40,000–Coarse synchronization in JobTracker

• Single point of failure–Failure kills all queued and running jobs–Jobs need to be re-submitted by users

• Restart is very tricky due to complex state

Page 8Architecting the Future of Big Data


Current Limitations contd.

• Hard partition of resources into map and reduce slots–Low resource utilization

• Lacks support for alternate paradigms– Iterative applications implemented using MapReduce are

10x slower–Hacks for the likes of MPI/Graph Processing

• Lack of wire-compatible protocols –Client and cluster must be of same version–Applications and workflows cannot migrate to different

clusters



Requirements

• Reliability

• Availability

• Utilization

• Wire Compatibility

• Agility & Evolution – Ability for customers to control

upgrades to the grid software stack.• Scalability - Clusters of 6,000-10,000 machines

–Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks

–100,000+ concurrent tasks–10,000 concurrent jobs



Architecture: Philosophy

• General-purpose, distributed application framework–Cannot scale monolithic masters. Or monsters?–Distribute responsibilities

• ResourceManager – Central scheduler–Only resource arbitration–No failure handling–Provide necessary information to AMs

• Push everything possible responsibility to ApplicationMaster(s)–Don’t trust ApplicationMaster(s)–User land library!


Architecture

• Resource Manager–Global resource scheduler–Hierarchical queues

• Node Manager–Per-machine agent–Manages the life-cycle of container–Container resource monitoring

• Application Master–Per-application–Manages application scheduling and task execution–E.g. MapReduce Application Master


YARN Architecture



Apache Hadoop MapReduce on YARN


NodeManager NodeManager NodeManager NodeManager

map 1.1

reduce2.1

ResourceManager



map1.2

reduce1.1

MR AM 1

map2.1

map2.2

reduce2.2

MR AM2

Scheduler


Global Scheduler (ResourceManager)

• Resource arbitration • Multiple resource dimensions

–<priority, data-locality, memory, cpu, …>

• In-built support for data-locality –Node, Rack etc.–Unique to YARN.


Scheduler Concepts

• Input from AM(s) is a dynamic list of ResourceRequests –<resource-name, resource-capability>–Resource name: (hostname / rackname / any)–Resource capability: (memory, cpu, …) –Essentially an inverted <name, capability> request map from

AM to RM–No notion of tasks!

• Output - Container–Resource(s) grant on a specific machine–Verifiable allocation: via Container Tokens



Fault tolerance

• Task/container failures– Application Masters should take care, it’s their business

• Node failures– ResourceManager marks the nodes as failed, informs all the apps / Application

Masters. AMs can chose to ignore failure or rerun work depending on what they want.

• Application Master failures– ResourceManager restarts AMs that have failed.– One Application can have multiple ApplicationAttempts– Every ApplicationAttempt should store state, so that next ApplicationAttempt can

recover from failure

• ResourceManager failures– ResourceManager saves state, can do host/ip failover today.– Recovers state, but kills all current work as of now– Work preserving restart– HA



Writing your own apps



Application Master

• Dynamically allocated per-application on startup• Responsible for individual application scheduling and life-cycle management

• Request and obtain containers for it’s tasks–Do a second-level schedule i.e. containers to component

tasks–Start/stop containers on NodeManagers

• Handle all task/container errors• Obtain resource hints/meta-information from RM for better scheduling–Peek-ahead into resource availability–Faulty resources (node, rack etc.)



Writing Custom Applications

• Grand total of 3 protocols• ApplicationClientProtocol

–Application launching program–submitApplication

• ApplicationMasterProtocol–Protocol between AM & RM for resource allocation–registerApplication / allocate / finishApplication

• ContainerManagementProtocol–Protocol between AM & NM for container start/stop–startContainer / stopContainer



Other things to take care of

• Container/tasks• Client• UI• Recovery• Container -> AM communication• Application History



Libraries for app/framework writers

• YarnClient, AMRMClient, NMClient• More projects:

– Higher level APIs– Weave, REEF



Other goodies

• Rolling upgrades• Multiple versions of MR at the same time• Same scheduling algorithms – Capacity, fairness• Secure from start• Locality for generic apps• Log aggregation• Everything on the same cluster



Existing applications



Compatibility with Apache Hadoop 1.x

• org.apache.hadoop.mapred – Add 1 property to your existing mapred-site.xml

– mapreduce.framework.name = yarn

– Continue submitting using bin/hadoop – Nothing Else Just Run Your MapReduce Jobs!

• org.apache.hadoop.mapreduce – Generally run without changes, recompilation, or minor updates – If your existing apps fail recompile against the new MRv2 jars

• Pig– Scripts built on Pig 10.1+ run without changes

• Hive– Queries built on Hive 10.0+ run without changes

• Streaming, Pipes, Oozie, Sqoop ….


Any Performance Gains?

• Significant gains across the board!

• MapReduce–Lots of runtime improvements–Map side, reduce side–Better shuffle

• So much better throughput• Y! can run lot more jobs on lesser number of nodes in lesser time

More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/


http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/

http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/


Testing?

• Testing, *lots* of it• Benchmarks: Blog post soon• Integration testing/ full-stack

–HBase–Pig–Hive–Oozie–…

• Functional tests



Deployment

• Beta last month–Misnomer: 10s of PB of storage, on 0.23, a previous state of

YARN before 2.0–Significantly wide variety of applications and load

• GA

–Very soon, less than a month away

–Bugs, blockers only now



How do I get it?



YARN beta releases

• Apache Hadoop Core 2.1.0-Beta– Official beta release from Apache– YARN APIs are stable– Backwards compatible with MapReduce 1 jobs– Blocker bugs have been resolved

• Features in HDP 2.0 Beta– Apache Ambari deploys YARN and Mapreduce 2– Capacity Scheduler for YARN– Full stack tested

Page 30


Future



Looking ahead

• YARN Improvements• Alternate programming models: Apache Tez, Storm.• Long(er) running services (e.g. Hbase): Hoya• ResourceManager HA• Work-preserving restart of resourcemanager• Reconnect running containers to AMs• Gang scheduling• Multi-dimensional resources: CPU in. Disk (capacity, IOPS), network?



Ecosystem

• Spark (UCB) on YARN• Real-time data processing

–Storm (Twitter) on YARN

• Graph processing – Apache Giraph on YARN• OpenMPI on YARN? • PAAS on YARN?

• Yarnify: *. on YARN



Questions & Answers

TRY download at hortonworks.com

LEARN Hortonworks University

FOLLOWtwitter: @hortonworks

Facebook: facebook.com/hortonworks

MORE EVENTShortonworks.com/events

Page 34

Further questions & comments: [email protected]

September SF Hadoop User Group 2013

Technology

future of big data page

yarn page

data processing hadoop

resource capability

apache hadoop mapreduce

mapreduce application

yarn architecture page

application scheduling