Top Banner
Hadoop in Virtual Machines Richard McDougall, VMware Sanjay Radia, Hortonworks Hadoop Summit, 2012
32

Hadoop on Virtual Machines

Jan 28, 2015

Download

Technology

Hadoop on Virtualization talk at Hadoop Summit 2012
Richard McDougall
Sanjay Radia
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop on Virtual Machines

Hadoop in Virtual Machines

Richard McDougall, VMwareSanjay Radia, Hortonworks

Hadoop Summit, 2012

Page 2: Hadoop on Virtual Machines

Part 1

Page 3: Hadoop on Virtual Machines

Say What?

• VMs will just add overhead, due to I/O virt• VMs run on SAN, we’re all about local disks• Hadoop does it’s own cluster management• It’ll do resource management in 2.0• And even HA is coming to Hadoop

• And… what is the point, anyway?

Page 4: Hadoop on Virtual Machines

But you’ve been asking…

• Can I virtualize my Hadoop, so that I can make it easier, quicker to get a cluster up and running

• Is it possible to run Hadoop on those spare machine cycles I have on hundreds/thousands of nodes?

• Can I make my system more available by using some of the standard HA features?

Page 5: Hadoop on Virtual Machines

And the savvy are asking…

• Can I avoid having to install special hardware for the master services, like name-node, job-tracker?

• Can I dynamically change the size of the cluster to use more resources?

• Can I use VM isolation to increase security or guard against resource-intensive neighbors?

• Is it feasible to provision virtual-clusters, giving out one each to a business unit?

Page 6: Hadoop on Virtual Machines

Ok, so first what about the concerns?

SAN Storage

$2 - $10/Gigabyte

$1M gets:0.5Petabytes

1,000,000 IOPS1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets:1 Petabyte

400,000 IOPS2Gbyte/sec

• Use your SAN? … if you want to.

Local Storage

$0.05/Gigabyte

$1M gets:20 Petabytes

10,000,000 IOPS800 Gbytes/sec

Page 7: Hadoop on Virtual Machines

Hadoop Using Local Disks

Virtualization Host

HadoopVirtualMachine

OtherWorkload

VMDK

Datanode

VMDK VMDK

Ext4 Ext4 Ext4

Task Tracker

Shared

Storage

OS Image - VMDK

Page 8: Hadoop on Virtual Machines

Hadoop Perf in a VM (Ratio is elapsed time to physical, Lower Is Better)

Pi

TestDFS

IO-w

rite

TestDFS

IO-re

ad

TeraGen 1 TB

TeraSort

1 TB

TeraVali

date 1 TB

TeraGen 3.5 TB

TeraSort

3.5 TB

TeraVali

date 3.5 TB

0

0.2

0.4

0.6

0.8

1

1.2

1 VM2 VMs

Ratio

to N

ative

Page 9: Hadoop on Virtual Machines

Storage

Evolution of Hadoop on VMs

Compute

Current Hadoop:

Combined Storage/Compute

Storage

T1 T2

VM VM VM

VMVM

VM

Hadoop in VM- VM lifecycle

determinedby Datanode

- NOT Elastic- Limited to Hadoop

Multi-Tenancy

Separate Storage- Separate compute

from data- Elastic compute- Enable shared

workloads- Raise utilization

Separate Compute Clusters- Separate virtual clusters

per tenant- Stronger VM-grade security

and resource isolation- Enable deployment of

multiple Hadoop runtime versions

Page 10: Hadoop on Virtual Machines

Virtualization Host

1. Hadoop Task Tracker and Data Node in a VM

VirtualHadoopNode

OtherWorkload

VMDK

Datanode

Task Tracker

Slot

SlotAdd/RemoveSlots?

Grow/Shrinkby tens of GB?

Grow/Shrink of a VM is one approach

Page 11: Hadoop on Virtual Machines

2. Add/remove Virtual Nodes

Virtualization Host

VirtualHadoopNode

OtherWorkload

VMDK

Datanode

Task Tracker

Slot

Slot

VirtualHadoopNode

VMDK

Datanode

Task Tracker

Slot

Slot

Just add/remove more virtual nodes?

Page 12: Hadoop on Virtual Machines

But State makes it hard to power-off a node

Virtualization Host

VirtualHadoopNode

OtherWorkload

VMDK

Datanode

Task Tracker

Slot

Slot

Powering off the Hadoop VMwould in effect fail the datanode

Page 13: Hadoop on Virtual Machines

Adding a node needs data…

Virtualization Host

VirtualHadoopNode

OtherWorkload

VMDK

Datanode

Task Tracker

Slot

Slot

Adding a node would require TBs of data replication

VirtualHadoopNode

VMDK

Datanode

Task Tracker

Slot

Slot

Page 14: Hadoop on Virtual Machines

VirtualHadoopNode

Datanode

2. Separated Compute and Data

Virtualization Host

VirtualHadoopNode

OtherWorkload

VMDK

Task Tracker

Slot

SlotVirtualHadoopNode

VMDK

Task Tracker

Slot

SlotVirtualHadoopNode

VirtualHadoopNode

Task Tracker

Slot

Slot

Truly Elastic Hadoop:Scalable through virtual nodes

Page 15: Hadoop on Virtual Machines

Dataflow with separated Compute/Data

Virtualization Host

VirtualHadoopNode

VMDK

Datanode

VirtualHadoopNode

NodeManager

Slot

Slot

Virtual Switch

Virtual NIC Virtual NIC

NIC Drivers

Page 16: Hadoop on Virtual Machines

Performance Analysis of Split

1 Datanode VM, 1 Compute nodes VM per Host

Datanode Datanode

NodeManager

NodeManager

NodeManager

NodeManager

Datanode Datanode

1 Combined Compute/Datanode VM per Host

Workload: Teragen, Terasort, TeravalidateHW Configuration: 8 cores, 96GB RAM, 16 disks per host x 2 nodes

Page 17: Hadoop on Virtual Machines

Performance Analysis of Split(Elapsed time: ratio to combined)

Teragen Terasort Teravalidate0

0.2

0.4

0.6

0.8

1

1.2

CombinedSplit

Page 18: Hadoop on Virtual Machines

Vir

tual

Had

oop

Qu

eue

Tying it together: Elastic Hadoop

Host Host Host Host Host Host

Distributed File System (HDFS, KFS, GPFS, MAPR, Isilon,…)

Namespace Namespace Namespace

Vir

tual

Had

oop

Vir

tual

Had

oop

Vir

tual

Had

oop

Publi

c

Publi

c

Secre

tData Layer

Runtime Layer

Coke Pepsi

Page 19: Hadoop on Virtual Machines

Demo: Shrink/Expand Cluster

Page 20: Hadoop on Virtual Machines

Demo: Shrink/Expand Cluster

Datanode

Web Server

Web Server

Datanode

Web Server

Web Server

Datanode

NodeManager

NodeManager

Datanode

NodeManager

NodeManager

Setup 1 Datanodes, 2 Nodemanagers and 2 web servers on each physical host

Web Server

Web Server

Web Server

Web Server

NodeManager

NodeManager

NodeManager

NodeManager

Page 21: Hadoop on Virtual Machines

Demo: Shrink/Expand Cluster

Datanode

Web Server

Web Server

Datanode

Web Server

Web Server

Datanode

NodeManagerNodeManager

Web Server

Web Server

Datanode

NodeManagerNodeManager

When web load is high in daytime, we can suspend some Nodemanagers and power on more Web servers.

Web Server

Web Server

NodeManager

NodeManager

NodeManager

NodeManager

Page 22: Hadoop on Virtual Machines

Demo

Page 23: Hadoop on Virtual Machines

Part 2

Page 24: Hadoop on Virtual Machines

Expand Hadoop Ecosystem

• Hortonworks goal– Expand Hadoop ecosystem– Provide first class support of various platforms

• Hadoop should run well on VMs• VMs offer several advantages as presented earlier

• Take advantage of vSphere for HA

Page 25

Page 25: Hadoop on Virtual Machines

VMware-Hortonworks Joint Engineering

• First class support for VMs– Topology plugins (Hadoop-8468)

• 2 VMs can be on same host– Pick closer data– Schedule tasks closer– Don’t put two replicas on same host

– MR-tmp on HDFS using block pools• Elastic Compute-VMs will not need local disk

– Fast communications within VMs

Page 26

Page 26: Hadoop on Virtual Machines

27

Hadoop Total System Availability Architecture

HA Cluster for Master Daemons

Server Server Server

NN JT

Failover

N+K failover

Apps Running Outside

Apps pau

se/re

try

Pause/retry JT into Safemode

NN

job job job job job

Slave Nodes of Hadoop Cluster

Page 27: Hadoop on Virtual Machines

© Hortonworks Inc. 2011

HA is coming in 1.0 Using Total System Availability Architecture

28

Page 28: Hadoop on Virtual Machines

29

HA in Hadoop 1 with HDP1

• Total System Availability Architecture– Namenode

• Clients pause automatically• JobTracker pauses automatically

– Other Hadoop master services (JT, …) coming

• Use industry proven HA framework– VMWare vSphere-HA

• Failover, fencing, …• Corner cases are tricky – if not addressed, corruption

– Addition benefits: • N-N & N+K failover• Migration for maintenance

Page 29: Hadoop on Virtual Machines

Hadoop NN/JT HA with vSphere

Page 30

Page 30: Hadoop on Virtual Machines

NameNode HA – Failover Times

• NameNode Failover times with vSphere and LinuxHA– Failure detection + Failover – 0.5 to 2 minutes

– OS bootup needed for vSphere – 10-20 seconds

– Namenode Startup (exit safemode)

• Small/Medium clusters – 1 to 2 minutes

• Large cluster – 5 to 15 minutes

• NameNode startup time measurements– 60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 40 sec

– 180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 120 sec

Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates

31

Page 31: Hadoop on Virtual Machines

Demo

Page 32: Hadoop on Virtual Machines

Summary

• Advantages of Hadoop on VMs– Cluster Management– Cluster consolidation– Greater Elasticity in mixed environment– Alternate multi-tenancy to capacity scheduler’s

offerings• HA for Hadoop Master Daemons

– vSphere based HA for NN, JT, … in Hadoop 1– Total System Availability Architecture

Page 33