Managing Hadoop, HBase and Storm Clusters at Yahoo ScalePRESENTED BY Dheeraj Kapur, Savitha Ravikrishnan June 30, 2016⎪
Agenda
Topic Speaker(s)
Introduction, HDFS RU, HBase RU & Storm RU Dheeraj Kapur
YARN RU, Component RU, Distributed Cache & Sharelib Savitha Ravikrishnan
Q&A All Presenters
HadoopSummit 2016
Hadoop at Yahoo
Grid Infrastructure at Yahoo
HadoopSummit 2016
▪ A multi-tenant, secure, distributed compute and storage environment, based on Hadoop stack for large scale data processing
▪ 3 data centers, over 45k physical nodes.
▪ 18 YARN (Hadoop) clusters, having 350 to 5200 nodes.
▪ 9 HBase clusters, having 80 to 1080 nodes.
▪ 13 Storm clusters, having 40 to 250 nodes
Grid Stack
ZookeeperBackendSupport
HadoopStorage
HadoopCompute
HadoopServices
Support Shop Monitoring Starling for
logging
HDFS Hbase as NoSql store
Hcatalog for metadata registry
YARN (Mapred) and Tez for Batch processing
Storm for stream processing
Spark for iterative programming
PIG for ETL
Hive for SQL
Oozie for workflows
Proxy services
GDM for data Mang
Café on Spark for
ML
Deployment Model
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeperPools
HTTP/HDFS/GDM Load Proxies
Applications and Data
DataFeeds
Data Stores
Oozie Server
HS2/HCat
HadoopSummit 2016
HDFS
Hadoop Rolling Upgrade
▪ Complete CI/CD for HDFS and YARN Upgrades
▪ Build software and config “tgz” and push to repo servers
▪ Installs software and configs in pre-deploy phase, activate during upgrade
▪ Slow upgrade 1 node per cycle
▪ Each component is upgraded independently i.e HDFS, YARN & Client
HadoopSummit 2016
Release Configs/Bundles:---doc: This file is auto generatedpackages: - label: hadoop version: 2.7.2.13.1606200235-20160620-000 - label: conf version: 2.7.2.13.1606200235-20160620-000 - label: gridjdk version: 1.7.0_17.1303042057-20160620-000 - label: yjava_jdk version: 1.8.0_60.51-20160620-000
Package Download (pre- deploy)
RU process
Git (release
info)
Namenode, Datanodes,ResourcemanagerHBaseMaster, Regionserver,Gateways
Repo Farm
JenkinsStart
Servers/Cluster
ygrid-deploy-software
CI/CD process
Git (release
info)
JenkinsStart
HDFS Upgrade
RU process
Finalize RU
Create Dir Structure
Put NN in RU mode
SNN Upgrade
NN Failover
SNN Upgrade
foreach DNSelect DN
Check installed version
Stop DN Activate newsoftware
Start DN
Wait for DN to join
Stop/terminate RU on X failures
1
2
3a
3b
3c
4a
4b
4c 4d
4e
4f
After 100 hosts aresuccessfully upgraded
Check HDFS used %age, Live nodes consistency on NNs
Terminate Upgrade incase of more than X failure
Involves service and IP failover from NN to SNN and vice versa
Safeupgrade-dn
Hadoop 2.7.x improvements over 2.6.x
Performance
▪ Reduce NN failover by parallelizing the quota init
▪ Datanode layout inefficiency causing high I/O load.
▪ Use a offline upgrade script to speed up the layout upgrade.
▪ Adding fake metrics sink to subvert JMX cache fix, causing delays in datanode upgrade/health check.
▪ Improved datanode shutdown speed
Failure handling
▪ Reduce the read/write failures by blocking clients until DN is fully initialized.
YARN
YARN Rolling Upgrade
▪ Minimize downtime, maximize service availability
▪ Work preserving restart on RM and NM
▪ Retains state for 10mins.
▪ Ensures that applications continuously run during a RM restart
▪ Save state, update software, restart and restore state.
▪ Uses leveldb as state store
▪ After RM restarts, it loads all the application metadata and other credentials from state-store and populates them into memory.
HadoopSummit 2016
CI/CD process
Git (release
info)
JenkinsStart
YARN Upgrade
RU process
Create Dir Structure
ResourceManagerUpgrade
HistoryServer Upgrade
Foreach NMSelect NM
Check installed version
Safestop NM(kill -9)
Activate newsoftware
Start NM
Wait for NM to join
Stop/terminate RU on X failures
Timeline Server
Upgrade
1
2
2a
2b 2c
2d
2e 3
4
5
Terminate Upgrade incase of more than X failure
Distributed cache & Sharelib
Distributed Cache
▪ Distributed cache distributes application-specific, large, read-only files efficiently.
▪ Applications specify the files to be cached in URLs (hdfs://) in the Job
▪ DistributedCache tracks the modification timestamps of the cached files.
▪ DistributedCache can be used to distribute simple, read-only data or text files and more complex types such as archives and JAR files.
HadoopSummit 2016
Sharelib▪ "Sharelib" is a management system for a directory in HDFS named /sharelib, which exists on every
cluster.
▪ Shared libraries can simplify the deployment and management of applications.
▪ The target directory is /sharelib, under which you will find various things: /sharelib/v1 - where all the packages are
• /sharelib/v1/conf - where the unique metafile for the cluster is (and all previous versions)
• /sharelib/v1/{tez, pig, ... } - where the package versions are kept
▪ The links/tags (metafile) are unique per cluster.
▪ Grid Ops maintains shared libraries on HDFS of each cluster
▪ Packages in shared libraries include mapreduce, pig, hbase, hcatalog, hive and oozie.
HadoopSummit 2016
JenkinsStart
SharelibUploader
GitBundles
Verify DistCache
DownloadtoDo
packagesDist repo
Re-package and upload
package
Re-generate Meta info (HDFS)
Upload to Oozie
Sharelib Update
Generate clients to update
Subsystems
Component Upgrade
HadoopSummit 2016
▪ New Releases : CI environment continuously releases certified builds & their versions.
▪ Generate state : Package rulesets contain the list of core packages and their dependencies for each & every cluster
▪ Deploy cookbooks : contain chef code and configuration that is pushed to Chef server
▪ Deploy pipelines : are YAML files that specify the flow & order of the deploy for every environment/cluster.
▪ Validation jobs : are run after a deploy completes on all the nodes which ensures end-to-end functionality is working as expected.
Components UpgradeCI
processComponent
versions
GitBundles
Certified Releases
Rule set files(cluster:
component specific)
Git bundles
Certified package
version infoStatefiles
Build Farms
Cookbook, Roles, Env,
Attribute files
Git (release info)
Build Farms
Artifactory
Ruby (Rake)
New Release Package Rulesets Deploy cookbooks
A B
Build Farms
Rspec rubocop, state generate,compare & upload
Validate incrementversion
1 2 3
Chef
CD process
Components Upgrade cont..
Git (release info)
Build Farms
Statefiles
Deploy Pipeline
Component Node
Ruby (Rake)Min size, zerodowntime check, targetsize, validate
Chef-client, cookbook-converge,graceful shutdown and healthcheck
4
Chef
A B
HBase
HBase Rolling Upgrade
Release Configs:default: group: 'all' command: 'start' system: 'ALL' verbose: 'true' retry: 3 upgradeREST: 'false' upgradeGateway: 'true' dryrun: 'false' force: 'false' upgrade_type: 'rolling' skip_nn_upgrade: 'false' skip_master_upgrade: 'false'
Workflow definitions:default:
continue_on_failure: - broken - badnodes
relux.red: - master - default - user - ca_soln-stage - perf,perf2,projects - restALL
▪Workflow based system.
▪Complete CI/CD for HDFS and HBase Upgrades
▪Build tgz and push to repo servers
▪Installs software before hand, activate new release during upgrade
▪Each component and Region group is upgraded independently i.e HDFS, group of regionservers.
CI/CD process
Git (release
info)
JenkinsStart
Put NN in RU mode &
Upgrade NN SNN
Master Upgrade
Region-server
Upgrade process
Stargate Upgrade
Gateway Upgrade
HBase Upgrade
Foreach DN/RS
Offload Regions
Reload RegionsUpgrade
regionserver
Repo ServerPackage +
conf version
Stop Regionserver
DN Safeupgrade,
Stop DN
Upgrade and Start DN
Upgrade and Start RS
1
2
3
4
3a
3c
3b
3d 3e
3f
3f
5HDFS Rolling Upgrade process
Iterate over each group
Iterate over each server in a group
STORM
Storm Rolling Upgrade Release Configs:default: parallel: 10 verbose: 'true' retry: 3 dryrun: 'false' upgrade_type: 'rolling' quarantine: 'true' terminate_on_failure: 'true' sup_failure_threshold: 10 sendmail_to: '[email protected]' sendmail_cc: '[email protected], [email protected]' cluster_workflow: cluster1.colo1: pacemaker_drpc cluster2.colo2: default
Workflow Defination:default:
rolling_task: - upgradeNimbus - bounceNimbus - upgradeSupervisor - bounceSupervisor - upgradeDRPC - bounceDRPC - upgradeGateways - doGatewayTask - verifySupervisor - runDRPCTestTopology - verifySoftwareVersion
full_upgrade_task: - killAllTopologies - specifyOperation_stop - sleep10 - bounceNimbus - bounceSupervisor - bounceDRPC - clearDiskCache - cleanZKP - upgradeNimbus - upgradeSupervisor - upgradeDRPC - specifyOperation_start - bounceNimbus - bounceSupervisor - bounceDRPC - upgradeGateways - doGatewayTask - verifySupervisor - runDRPCTestTopology - verifySoftwareVersion
▪Complete CI/CD system. Statefiles are build per component and pushed to artifactory before upgrade
▪Installs software before hand, activate new release during upgrade
▪Each component is upgraded independently i.e Pacemaker, Nimbus, DRPC & Supervisor
Storm Upgrade CI/CD process
Git (release
info)
JenkinsStart
Artifactory (State files & Release info)
RE Jenkins and SD process
Pacemaker Upgrade
Nimbus Upgrade
Supervisor Upgrade
Bounce Workers
DRPC Upgrade
DRPC Upgrade
Verify Supervisors
Run Test/Validatio
n topology
Audit All Components
RE Jenkins lets to statefile generation for each component and updates git with release info
Statefiles are published in artifactory and downloaded during upgrade
Upgrade fails if more than X supervisors fails to upgrade
Rolling Upgrade timeline
Component Parallelism Hadoop 2.6.x Hadoop 2.7.x Hbase 0.98.x Storm 0.10.1.x
HDFS (4k nodes) 1 4 days 1 day X X
YARN (4k nodes) 1 1 day 1 day X X
HBase (1k nodes) 1-4 4-5 days X 4-5 days X
Storm (350 nodes)
10 X X X 4-6 hrs
Components 1 1-2 hrs 1-2 hrs 1-2 hrs X
HadoopSummit 2016
AB DB FB HB IB JB LB PB UB BT LT PT TT UT BR DR IR LR MR PR99.600
99.650
99.700
99.750
99.800
99.850
99.900
99.950
100.000
99.928
99.898
99.940
99.68799.705
Rolling Upgrade ImpactYTD Availability by Cluster
99.990
Thank You
HadoopSummit 2016