Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Managing Hadoop, HBase and Storm Clusters at Yahoo ScalePRESENTED BY Dheeraj Kapur, Savitha Ravikrishnan June 30, 2016⎪

Agenda

Topic Speaker(s)

Introduction, HDFS RU, HBase RU & Storm RU Dheeraj Kapur

YARN RU, Component RU, Distributed Cache & Sharelib Savitha Ravikrishnan

Q&A All Presenters

HadoopSummit 2016

Hadoop at Yahoo

Grid Infrastructure at Yahoo

HadoopSummit 2016

▪ A multi-tenant, secure, distributed compute and storage environment, based on Hadoop stack for large scale data processing

▪ 3 data centers, over 45k physical nodes.

▪ 18 YARN (Hadoop) clusters, having 350 to 5200 nodes.

▪ 9 HBase clusters, having 80 to 1080 nodes.

▪ 13 Storm clusters, having 40 to 250 nodes

Grid Stack

ZookeeperBackendSupport

HadoopStorage

HadoopCompute

HadoopServices

Support Shop Monitoring Starling for

logging

HDFS Hbase as NoSql store

Hcatalog for metadata registry

YARN (Mapred) and Tez for Batch processing

Storm for stream processing

Spark for iterative programming

PIG for ETL

Hive for SQL

Oozie for workflows

Proxy services

GDM for data Mang

Café on Spark for

ML

Deployment Model

DataNode NodeManager

NameNode RM

DataNodes RegionServers

NameNode HBase Master Nimbus

Supervisor

Administration, Management and Monitoring

ZooKeeperPools

HTTP/HDFS/GDM Load Proxies

Applications and Data

DataFeeds

Data Stores

Oozie Server

HS2/HCat

HadoopSummit 2016

HDFS

Hadoop Rolling Upgrade

▪ Complete CI/CD for HDFS and YARN Upgrades

▪ Build software and config “tgz” and push to repo servers

▪ Installs software and configs in pre-deploy phase, activate during upgrade

▪ Slow upgrade 1 node per cycle

▪ Each component is upgraded independently i.e HDFS, YARN & Client

HadoopSummit 2016

Release Configs/Bundles:---doc: This file is auto generatedpackages: - label: hadoop version: 2.7.2.13.1606200235-20160620-000 - label: conf version: 2.7.2.13.1606200235-20160620-000 - label: gridjdk version: 1.7.0_17.1303042057-20160620-000 - label: yjava_jdk version: 1.8.0_60.51-20160620-000

Package Download (pre- deploy)

RU process

Git (release

info)

Namenode, Datanodes,ResourcemanagerHBaseMaster, Regionserver,Gateways

Repo Farm

JenkinsStart

Servers/Cluster

ygrid-deploy-software

CI/CD process

Git (release

info)

JenkinsStart

HDFS Upgrade

RU process

Finalize RU

Create Dir Structure

Put NN in RU mode

SNN Upgrade

NN Failover

SNN Upgrade

foreach DNSelect DN

Check installed version

Stop DN Activate newsoftware

Start DN

Wait for DN to join

Stop/terminate RU on X failures

1

2

3a

3b

3c

4a

4b

4c 4d

4e

4f

After 100 hosts aresuccessfully upgraded

Check HDFS used %age, Live nodes consistency on NNs

Terminate Upgrade incase of more than X failure

Involves service and IP failover from NN to SNN and vice versa

Safeupgrade-dn

Hadoop 2.7.x improvements over 2.6.x

Performance

▪ Reduce NN failover by parallelizing the quota init

▪ Datanode layout inefficiency causing high I/O load.

▪ Use a offline upgrade script to speed up the layout upgrade.

▪ Adding fake metrics sink to subvert JMX cache fix, causing delays in datanode upgrade/health check.

▪ Improved datanode shutdown speed

Failure handling

▪ Reduce the read/write failures by blocking clients until DN is fully initialized.

YARN

YARN Rolling Upgrade

▪ Minimize downtime, maximize service availability

▪ Work preserving restart on RM and NM

▪ Retains state for 10mins.

▪ Ensures that applications continuously run during a RM restart

▪ Save state, update software, restart and restore state.

▪ Uses leveldb as state store

▪ After RM restarts, it loads all the application metadata and other credentials from state-store and populates them into memory.

HadoopSummit 2016

CI/CD process

Git (release

info)

JenkinsStart

YARN Upgrade

RU process

Create Dir Structure

ResourceManagerUpgrade

HistoryServer Upgrade

Foreach NMSelect NM

Check installed version

Safestop NM(kill -9)

Activate newsoftware

Start NM

Wait for NM to join

Stop/terminate RU on X failures

Timeline Server

Upgrade

1

2

2a

2b 2c

2d

2e 3

4

5

Terminate Upgrade incase of more than X failure

Distributed cache & Sharelib

Distributed Cache

▪ Distributed cache distributes application-specific, large, read-only files efficiently.

▪ Applications specify the files to be cached in URLs (hdfs://) in the Job

▪ DistributedCache tracks the modification timestamps of the cached files.

▪ DistributedCache can be used to distribute simple, read-only data or text files and more complex types such as archives and JAR files.

HadoopSummit 2016

Sharelib▪ "Sharelib" is a management system for a directory in HDFS named /sharelib, which exists on every

cluster.

▪ Shared libraries can simplify the deployment and management of applications.

▪ The target directory is /sharelib, under which you will find various things: /sharelib/v1 - where all the packages are

• /sharelib/v1/conf - where the unique metafile for the cluster is (and all previous versions)

• /sharelib/v1/{tez, pig, ... } - where the package versions are kept

▪ The links/tags (metafile) are unique per cluster.

▪ Grid Ops maintains shared libraries on HDFS of each cluster

▪ Packages in shared libraries include mapreduce, pig, hbase, hcatalog, hive and oozie.

HadoopSummit 2016

JenkinsStart

SharelibUploader

GitBundles

Verify DistCache

DownloadtoDo

packagesDist repo

Re-package and upload

package

Re-generate Meta info (HDFS)

Upload to Oozie

Sharelib Update

Generate clients to update

Subsystems

Component Upgrade

HadoopSummit 2016

▪ New Releases : CI environment continuously releases certified builds & their versions.

▪ Generate state : Package rulesets contain the list of core packages and their dependencies for each & every cluster

▪ Deploy cookbooks : contain chef code and configuration that is pushed to Chef server

▪ Deploy pipelines : are YAML files that specify the flow & order of the deploy for every environment/cluster.

▪ Validation jobs : are run after a deploy completes on all the nodes which ensures end-to-end functionality is working as expected.

Components UpgradeCI

processComponent

versions

GitBundles

Certified Releases

Rule set files(cluster:

component specific)

Git bundles

Certified package

version infoStatefiles

Build Farms

Cookbook, Roles, Env,

Attribute files

Git (release info)

Build Farms

Artifactory

Ruby (Rake)

New Release Package Rulesets Deploy cookbooks

A B

Build Farms

Rspec rubocop, state generate,compare & upload

Validate incrementversion

1 2 3

Chef

CD process

Components Upgrade cont..

Git (release info)

Build Farms

Statefiles

Deploy Pipeline

Component Node

Ruby (Rake)Min size, zerodowntime check, targetsize, validate

Chef-client, cookbook-converge,graceful shutdown and healthcheck

4

Chef

A B

HBase

HBase Rolling Upgrade

Release Configs:default: group: 'all' command: 'start' system: 'ALL' verbose: 'true' retry: 3 upgradeREST: 'false' upgradeGateway: 'true' dryrun: 'false' force: 'false' upgrade_type: 'rolling' skip_nn_upgrade: 'false' skip_master_upgrade: 'false'

Workflow definitions:default:

continue_on_failure: - broken - badnodes

relux.red: - master - default - user - ca_soln-stage - perf,perf2,projects - restALL

▪Workflow based system.

▪Complete CI/CD for HDFS and HBase Upgrades

▪Build tgz and push to repo servers

▪Installs software before hand, activate new release during upgrade

▪Each component and Region group is upgraded independently i.e HDFS, group of regionservers.

CI/CD process

Git (release

info)

JenkinsStart

Put NN in RU mode &

Upgrade NN SNN

Master Upgrade

Region-server

Upgrade process

Stargate Upgrade

Gateway Upgrade

HBase Upgrade

Foreach DN/RS

Offload Regions

Reload RegionsUpgrade

regionserver

Repo ServerPackage +

conf version

Stop Regionserver

DN Safeupgrade,

Stop DN

Upgrade and Start DN

Upgrade and Start RS

1

2

3

4

3a

3c

3b

3d 3e

3f

3f

5HDFS Rolling Upgrade process

Iterate over each group

Iterate over each server in a group

STORM

Storm Rolling Upgrade Release Configs:default: parallel: 10 verbose: 'true' retry: 3 dryrun: 'false' upgrade_type: 'rolling' quarantine: 'true' terminate_on_failure: 'true' sup_failure_threshold: 10 sendmail_to: '[email protected]' sendmail_cc: '[email protected], [email protected]' cluster_workflow: cluster1.colo1: pacemaker_drpc cluster2.colo2: default

Workflow Defination:default:

rolling_task: - upgradeNimbus - bounceNimbus - upgradeSupervisor - bounceSupervisor - upgradeDRPC - bounceDRPC - upgradeGateways - doGatewayTask - verifySupervisor - runDRPCTestTopology - verifySoftwareVersion

full_upgrade_task: - killAllTopologies - specifyOperation_stop - sleep10 - bounceNimbus - bounceSupervisor - bounceDRPC - clearDiskCache - cleanZKP - upgradeNimbus - upgradeSupervisor - upgradeDRPC - specifyOperation_start - bounceNimbus - bounceSupervisor - bounceDRPC - upgradeGateways - doGatewayTask - verifySupervisor - runDRPCTestTopology - verifySoftwareVersion

▪Complete CI/CD system. Statefiles are build per component and pushed to artifactory before upgrade

▪Installs software before hand, activate new release during upgrade

▪Each component is upgraded independently i.e Pacemaker, Nimbus, DRPC & Supervisor

Storm Upgrade CI/CD process

Git (release

info)

JenkinsStart

Artifactory (State files & Release info)

RE Jenkins and SD process

Pacemaker Upgrade

Nimbus Upgrade

Supervisor Upgrade

Bounce Workers

DRPC Upgrade

DRPC Upgrade

Verify Supervisors

Run Test/Validatio

n topology

Audit All Components

RE Jenkins lets to statefile generation for each component and updates git with release info

Statefiles are published in artifactory and downloaded during upgrade

Upgrade fails if more than X supervisors fails to upgrade

Rolling Upgrade timeline

Component Parallelism Hadoop 2.6.x Hadoop 2.7.x Hbase 0.98.x Storm 0.10.1.x

HDFS (4k nodes) 1 4 days 1 day X X

YARN (4k nodes) 1 1 day 1 day X X

HBase (1k nodes) 1-4 4-5 days X 4-5 days X

Storm (350 nodes)

10 X X X 4-6 hrs

Components 1 1-2 hrs 1-2 hrs 1-2 hrs X

HadoopSummit 2016

AB DB FB HB IB JB LB PB UB BT LT PT TT UT BR DR IR LR MR PR99.600

99.650

99.700

99.750

99.800

99.850

99.900

99.950

100.000

99.928

99.898

99.940

99.68799.705

Rolling Upgrade ImpactYTD Availability by Cluster

99.990

Thank You

HadoopSummit 2016

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Technology

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale