Top Banner
Extending Hadoop beyond MapReduce Page 1 Mahadev Konar Co-Founder @mahadevkonar (@hortonworks)
25

Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Jul 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Extending Hadoop beyond MapReduce

Page 1

Mahadev Konar

Co-Founder

@mahadevkonar (@hortonworks)

Page 2: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Bio

Page 2

•  Apache Hadoop since 2006 - committer and PMC member

– Developed and supported Map Reduce @Yahoo!

- Core member of design and development team on MR Next Gen

•  Apache ZooKeeper since 2008 – committer and current PMC chair

–  Lead Apache ZooKeeper development and support @ Yahoo!

•  Co-founder @hortonworks

Page 3: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Bio

Page 3

•  Apache Hadoop since 2006 - committer and PMC member

– Developed and supported Map Reduce @Yahoo!

- Core member of design and development team for MR Next Gen

•  Apache ZooKeeper since 2008 – committer and current PMC chair

–  Lead Apache ZooKeeper development and support @ Yahoo!

•  Co-founder @hortonworks

Page 4: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Agenda

Page 4

• Overview

• Current Limitations and Requirements

• Architectures

• Improvements and Updates

• Q&A

Page 5: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Hadoop MapReduce Classic

•  JobTracker

–  Manages cluster resources and job scheduling

•  TaskTracker

–  Per-node agent

–  Manage tasks

Page 6: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Current Limitations

•  Hard partition of resources into map and reduce slots

–  Low resource utilization

•  Lacks support for alternate paradigms

–  Iterative applications implemented using MapReduce are

10x slower

–  Hacks for the likes of MPI/Graph Processing

•  Lack of wire-compatible protocols

–  Client and cluster must be of same version

–  Applications and workflows cannot migrate to different

clusters

6

Page 7: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Current Limitations

•  Utilization

•  Scalability

–  Maximum Cluster size – 4,000 nodes

–  Maximum concurrent tasks – 40,000

–  Coarse synchronization in JobTracker

•  Single point of failure

–  Failure kills all queued and running jobs

–  Jobs need to be re-submitted by users

•  Restart is very tricky due to complex state

7

Page 8: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Requirements

•  Reliability

•  Availability

•  Utilization

•  Wire Compatibility

•  Agility & Evolution – Ability for customers to control

upgrades to the grid software stack.

•  Scalability - Clusters of 6,000-10,000 machines

–  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB

disks

–  100,000+ concurrent tasks

–  10,000 concurrent jobs

8

Page 9: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Design Centre

•  Split up the two major functions of JobTracker

–  Cluster resource management

–  Application life-cycle management

•  MapReduce becomes user-land library

9

Page 10: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Architecture

•  Application

–  Application is a job submitted to the framework

–  Example – Map Reduce Job

•  Container

–  Basic unit of allocation

–  Example – container A = 2GB, 1CPU

–  Replaces the fixed map/reduce slots

10

Page 11: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Architecture

•  Resource Manager

–  Global resource scheduler

–  Hierarchical queues

•  Node Manager

–  Per-machine agent

–  Manages the life-cycle of container

–  Container resource monitoring

•  Application Master

–  Per-application

–  Manages application scheduling and task execution

–  E.g. MapReduce Application Master

11

Page 12: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Architecture

ResourceManager

MapReduce Status

Job Submission

Client

NodeManager

NodeManager

Container

NodeManager

App Mstr

Node Status

Resource Request

ResourceManager

Client

MapReduce Status

Job Submission

Client

NodeManager

NodeManager

App Mstr Container

NodeManager

App Mstr

Node Status

Resource Request

ResourceManager

Client

MapReduce Status

Job Submission

Client

NodeManager

Container Container

NodeManager

App Mstr Container

NodeManager

Container App Mstr

Node Status

Resource Request

Page 13: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Architecture – Resource Manager

Resource Manager

Ap

plic

atio

ns

Ma

na

ge

r

Scheduler

Re

so

urc

e

Tra

cke

r

•  Applications Manager

–  Responsible for launching and monitoring Application Masters

(per Application process)

–  Restarts an Application Master on failure

•  Scheduler

–  Responsible for allocating resources to the Application

•  Resource Tracker

–  Responsible for managing the nodes in the cluster

Page 14: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Improvements vis-à-vis classic MapReduce

14

•  Utilization

–  Generic resource model •  Memory

•  CPU

•  Disk b/q

•  Network b/w

–  Remove fixed partition of map and reduce slot

•  Scalability

–  Application life-cycle management is very expensive

–  Partition resource management and application life-cycle

management

–  Application management is distributed

–  Hardware trends - Currently run clusters of 4,000 machines •  6,000 2012 machines > 12,000 2009 machines

•  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>

Page 15: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

•  Fault Tolerance and Availability

–  Resource Manager

•  No single point of failure – state saved in ZooKeeper (coming

soon)

•  Application Masters are restarted automatically on RM restart

–  Application Master

•  Optional failover via application-specific checkpoint

•  MapReduce applications pick up where they left off via state saved

in HDFS

•  Wire Compatibility

–  Protocols are wire-compatible

–  Old clients can talk to new servers

–  Rolling upgrades

15

Improvements vis-à-vis classic MapReduce

Page 16: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

•  Innovation and Agility

–  MapReduce now becomes a user-land library

–  Multiple versions of MapReduce can run in the same cluster

(a la Apache Pig)

•  Faster deployment cycles for improvements

–  Customers upgrade MapReduce versions on their schedule

–  Users can customize MapReduce

16

Improvements vis-à-vis classic MapReduce

Page 17: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

•  Support for programming paradigms other than MapReduce

–  MPI

–  Master-Worker

–  Machine Learning

–  Iterative processing

–  Enabled by allowing the use of paradigm-specific application master

–  Run all on the same Hadoop cluster

17

Improvements vis-à-vis classic MapReduce

Page 18: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Is it released?

•  Available in 0.23.1 release

•  Coming soon 0.23.2 release

18

Page 19: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Any Performance Gains?

Page 19

• 2x+ across the board

• MapReduce

– Unlock lots of improvements from Terasort record (Owen/Arun,

2009)

– Shuffle 30%+

– Small Jobs – Uber AM

– Re-use task slots (container reuse)

More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/

Page 20: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Testing?

Page 20

• Testing, *lots* of it

• Benchmarks (every release should be at least as good as the last one)

• Integration testing

– HBase

– Pig

– Hive

– Oozie

• Functional tests

– Nightly

– Over1000 functional tests for Map-Reduce alone

– Several hundred for Pig/Hive etc.

• Deployment discipline

Page 21: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Benchmarks

Page 21

• Benchmark every part of the HDFS & MR pipeline

– HDFS read/write throughput

– NN operations

– Scan, Shuffle, Sort

• GridMixv3

– Run production traces in test clusters

– Thousands of jobs

– Stress mode v/s Replay mode

Page 22: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Deployment

Page 22

• Alpha/Test (early UAT) in November 2011

– Small scale (500-800 nodes)

• Alpha in February 2012

– Majority of users

– ~1000 nodes per cluster, > 2,000 nodes in all

• Beta

– Misnomer: 10s of PB of storage

– Significantly wide variety of applications and load

–  4000+ nodes per cluster, > 15000 nodes in all

– Q2, 2012

• Production

– Well, it’s production

– Mid-to-late Q2 2012

Page 23: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Questions?

Page 23

hadoop-0.23.1 (alpha release):

http://hadoop.apache.org/common/releases.html

Release Documentation:

http://hadoop.apache.org/common/docs/r0.23.1/

Hortonworks website:

http://hortonworks.com/

Page 24: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Other Resources

Page 24 © Hortonworks Inc. 2012

• Hadoop Summit

– June 13-14

– San Jose, California

– www.Hadoopsummit.org

• Hadoop Training and Certification

– Developing Solutions Using Apache Hadoop

– Administering Apache Hadoop

– http://hortonworks.com/training/

• On-demand Webinars

– Available now on Hortonworks website

– http://hortonworks.com/webinars/

Page 25: Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

Thank You @mahadevkonar