Hadoop in Virtual Machines
Richard McDougall, VMwareSanjay Radia, Hortonworks
Hadoop Summit, 2012
Part 1
Say What?
• VMs will just add overhead, due to I/O virt• VMs run on SAN, we’re all about local disks• Hadoop does it’s own cluster management• It’ll do resource management in 2.0• And even HA is coming to Hadoop
• And… what is the point, anyway?
But you’ve been asking…
• Can I virtualize my Hadoop, so that I can make it easier, quicker to get a cluster up and running
• Is it possible to run Hadoop on those spare machine cycles I have on hundreds/thousands of nodes?
• Can I make my system more available by using some of the standard HA features?
And the savvy are asking…
• Can I avoid having to install special hardware for the master services, like name-node, job-tracker?
• Can I dynamically change the size of the cluster to use more resources?
• Can I use VM isolation to increase security or guard against resource-intensive neighbors?
• Is it feasible to provision virtual-clusters, giving out one each to a business unit?
Ok, so first what about the concerns?
SAN Storage
$2 - $10/Gigabyte
$1M gets:0.5Petabytes
1,000,000 IOPS1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:1 Petabyte
400,000 IOPS2Gbyte/sec
• Use your SAN? … if you want to.
Local Storage
$0.05/Gigabyte
$1M gets:20 Petabytes
10,000,000 IOPS800 Gbytes/sec
Hadoop Using Local Disks
Virtualization Host
HadoopVirtualMachine
OtherWorkload
VMDK
Datanode
VMDK VMDK
Ext4 Ext4 Ext4
Task Tracker
Shared
Storage
OS Image - VMDK
Hadoop Perf in a VM (Ratio is elapsed time to physical, Lower Is Better)
Pi
TestDFS
IO-w
rite
TestDFS
IO-re
ad
TeraGen 1 TB
TeraSort
1 TB
TeraVali
date 1 TB
TeraGen 3.5 TB
TeraSort
3.5 TB
TeraVali
date 3.5 TB
0
0.2
0.4
0.6
0.8
1
1.2
1 VM2 VMs
Ratio
to N
ative
Storage
Evolution of Hadoop on VMs
Compute
Current Hadoop:
Combined Storage/Compute
Storage
T1 T2
VM VM VM
VMVM
VM
Hadoop in VM- VM lifecycle
determinedby Datanode
- NOT Elastic- Limited to Hadoop
Multi-Tenancy
Separate Storage- Separate compute
from data- Elastic compute- Enable shared
workloads- Raise utilization
Separate Compute Clusters- Separate virtual clusters
per tenant- Stronger VM-grade security
and resource isolation- Enable deployment of
multiple Hadoop runtime versions
Virtualization Host
1. Hadoop Task Tracker and Data Node in a VM
VirtualHadoopNode
OtherWorkload
VMDK
Datanode
Task Tracker
Slot
SlotAdd/RemoveSlots?
Grow/Shrinkby tens of GB?
Grow/Shrink of a VM is one approach
2. Add/remove Virtual Nodes
Virtualization Host
VirtualHadoopNode
OtherWorkload
VMDK
Datanode
Task Tracker
Slot
Slot
VirtualHadoopNode
VMDK
Datanode
Task Tracker
Slot
Slot
Just add/remove more virtual nodes?
But State makes it hard to power-off a node
Virtualization Host
VirtualHadoopNode
OtherWorkload
VMDK
Datanode
Task Tracker
Slot
Slot
Powering off the Hadoop VMwould in effect fail the datanode
Adding a node needs data…
Virtualization Host
VirtualHadoopNode
OtherWorkload
VMDK
Datanode
Task Tracker
Slot
Slot
Adding a node would require TBs of data replication
VirtualHadoopNode
VMDK
Datanode
Task Tracker
Slot
Slot
VirtualHadoopNode
Datanode
2. Separated Compute and Data
Virtualization Host
VirtualHadoopNode
OtherWorkload
VMDK
Task Tracker
Slot
SlotVirtualHadoopNode
VMDK
Task Tracker
Slot
SlotVirtualHadoopNode
VirtualHadoopNode
Task Tracker
Slot
Slot
Truly Elastic Hadoop:Scalable through virtual nodes
Dataflow with separated Compute/Data
Virtualization Host
VirtualHadoopNode
VMDK
Datanode
VirtualHadoopNode
NodeManager
Slot
Slot
Virtual Switch
Virtual NIC Virtual NIC
NIC Drivers
Performance Analysis of Split
1 Datanode VM, 1 Compute nodes VM per Host
Datanode Datanode
NodeManager
NodeManager
NodeManager
NodeManager
Datanode Datanode
1 Combined Compute/Datanode VM per Host
Workload: Teragen, Terasort, TeravalidateHW Configuration: 8 cores, 96GB RAM, 16 disks per host x 2 nodes
Performance Analysis of Split(Elapsed time: ratio to combined)
Teragen Terasort Teravalidate0
0.2
0.4
0.6
0.8
1
1.2
CombinedSplit
Vir
tual
Had
oop
Qu
eue
Tying it together: Elastic Hadoop
Host Host Host Host Host Host
Distributed File System (HDFS, KFS, GPFS, MAPR, Isilon,…)
Namespace Namespace Namespace
Vir
tual
Had
oop
Vir
tual
Had
oop
Vir
tual
Had
oop
Publi
c
Publi
c
Secre
tData Layer
Runtime Layer
Coke Pepsi
Demo: Shrink/Expand Cluster
Demo: Shrink/Expand Cluster
Datanode
Web Server
Web Server
Datanode
Web Server
Web Server
Datanode
NodeManager
NodeManager
Datanode
NodeManager
NodeManager
Setup 1 Datanodes, 2 Nodemanagers and 2 web servers on each physical host
Web Server
Web Server
Web Server
Web Server
NodeManager
NodeManager
NodeManager
NodeManager
Demo: Shrink/Expand Cluster
Datanode
Web Server
Web Server
Datanode
Web Server
Web Server
Datanode
NodeManagerNodeManager
Web Server
Web Server
Datanode
NodeManagerNodeManager
When web load is high in daytime, we can suspend some Nodemanagers and power on more Web servers.
Web Server
Web Server
NodeManager
NodeManager
NodeManager
NodeManager
Demo
Part 2
Expand Hadoop Ecosystem
• Hortonworks goal– Expand Hadoop ecosystem– Provide first class support of various platforms
• Hadoop should run well on VMs• VMs offer several advantages as presented earlier
• Take advantage of vSphere for HA
Page 25
VMware-Hortonworks Joint Engineering
• First class support for VMs– Topology plugins (Hadoop-8468)
• 2 VMs can be on same host– Pick closer data– Schedule tasks closer– Don’t put two replicas on same host
– MR-tmp on HDFS using block pools• Elastic Compute-VMs will not need local disk
– Fast communications within VMs
Page 26
27
Hadoop Total System Availability Architecture
HA Cluster for Master Daemons
Server Server Server
NN JT
Failover
N+K failover
Apps Running Outside
Apps pau
se/re
try
Pause/retry JT into Safemode
NN
job job job job job
Slave Nodes of Hadoop Cluster
© Hortonworks Inc. 2011
HA is coming in 1.0 Using Total System Availability Architecture
28
29
HA in Hadoop 1 with HDP1
• Total System Availability Architecture– Namenode
• Clients pause automatically• JobTracker pauses automatically
– Other Hadoop master services (JT, …) coming
• Use industry proven HA framework– VMWare vSphere-HA
• Failover, fencing, …• Corner cases are tricky – if not addressed, corruption
– Addition benefits: • N-N & N+K failover• Migration for maintenance
Hadoop NN/JT HA with vSphere
Page 30
NameNode HA – Failover Times
• NameNode Failover times with vSphere and LinuxHA– Failure detection + Failover – 0.5 to 2 minutes
– OS bootup needed for vSphere – 10-20 seconds
– Namenode Startup (exit safemode)
• Small/Medium clusters – 1 to 2 minutes
• Large cluster – 5 to 15 minutes
• NameNode startup time measurements– 60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 40 sec
– 180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 120 sec
Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates
31
Demo
Summary
• Advantages of Hadoop on VMs– Cluster Management– Cluster consolidation– Greater Elasticity in mixed environment– Alternate multi-tenancy to capacity scheduler’s
offerings• HA for Hadoop Master Daemons
– vSphere based HA for NN, JT, … in Hadoop 1– Total System Availability Architecture
Page 33