VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

Big Data Extensions: Advanced Features and

Customer Case Study

Jayanth Gummaraju, VMware

Sasha Kipervarg, Identified, Inc.

VAPP5484

#VAPP5484

2

Data Is Exploding & Hadoop Is Driving Growth

Unstructured data driving growth Hadoop adoption is ramping

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Structured Unstructured

Complex unstructured data

forecasted to outpace structured relational data by 10x by 2020

Evaluating53%

In-production

23%

Piloting18%

Testing2%

Don't know2%

Other2%

Source: Forrester Survey of 60 CIOs , September 2011

• Unstructured data explosion and Hadoop capabilities causing CIOs to

reconsider Enterprise data strategy

• Hadoop’s ability to process raw data at cost presents intriguing value

proposition

3

Agenda

Big Data Extensions Overview

Virtualized Hadoop at Identified Inc.

Advanced Features

4

Questions for Audience

Familiarity with Hadoop

1. New to Hadoop

2. Reasonably familiar

3. Expert

Hadoop cluster sizes

1. < 10 nodes

2. 10-50 nodes

3. > 50 nodes

Virtualizing Hadoop

1. Never virtualized

2. Actively exploring virtualization

3. Running virtualized Hadoop in test-dev/production

5

Big Data on vSphere: Value Proposition

Basic Features

• Fast provisioning

• Minutes/hours instead of days

• Workload Consolidation

• Multiple virtual clusters co-exist on same physical hardware

• High Availability

• Not limited to NameNode, JobTracker

Advanced Features

• Auto-elasticity

• High Resource Utilization

• True multi-tenancy

• VM-grade security, performance, and configuration isolation

6

Serengeti

vSphere Resource

Management

Hadoop Virtualization Extensions

vSphere Big Data Extensions: Program Highlights

Open source project

Tool to simplify virtualized Hadoop deployment & operations

Serengeti

Virtualization changes for core Hadoop

Contributed back to Apache Hadoop

Advanced resource management on vSphere

Big Data applications-specific extension to DRS

7

What is Hadoop?

Distributed processing of large data sets across clusters of computers

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

map()

reduce()

Input D

ata

Outp

ut D

ata

Split

[k1, v1]

Sort

by k1

Merge

[k1, [v1, v2, v3,…]]

map()

map()

reduce()

8

Slave Node 1 Slave Node 2 Slave Node 3

Input File

Tasks Are Scheduled Where Data Resides

JobTracker Job

DataNode

TaskTracker

Split 1 – 64MB

Task - 1

Split 2 – 64MB

Split 3 – 64MB

TaskTracker TaskTracker

DataNode DataNode

Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB

Task - 2 Task - 3

NameNode

9

Myth: Virtual Performance Is Sub-optimal

[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]

(lower is better)

32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM

10

Agenda



Advanced Features

11

Agenda



Advanced Features

12

Compute-Data Separation

Combined Storage/ Compute

VM

Hadoop in VM

• VM lifecycle determined by Datanode

• Limited elasticity

• Limited to Hadoop Multi-Tenancy

Storage

Compute

VM

VM

Separate Storage

• Separate compute from data

• Elastic compute

• Enable shared workloads

• Raise utilization

Storage

T1 T2

VM

VM

VM

Separate Compute Tenants

• Separate virtual clusters per tenant

• Stronger VM-grade security and resource isolation

• Enable deployment of multiple Hadoop runtime versions

Slave Node

13

Dataflow with Separated Compute/Data

Virtual Hadoop

Node

Virtual Hadoop

Node

ESX Host

Virtual Hadoop Node

VMDK

DataNode

Virtual Hadoop Node TaskTracker

Slot

Slot

Virtual Switch

Virtual NIC Virtual NIC

NIC Drivers

14

Elastic Scalability & Multi-Tenancy

Deploy separate compute clusters for different tenants sharing HDFS.

According to priority and available resources, power-on/off compute VMs

Experimentation Dynamic resourcepool

Data layer

Production

recommendation engine

Compute layer Compute

VM

Compute

VM

Compute

VM

Compute

VM

Compute

VM Compute

VM

Compute

VM

Compute

VM

Compute

VM Compute

VM

Compute

VM Compute

VM

Compute

VM

Compute

VM

Compute

VM

Experimentation Production

Compute

VM

Job

Tracker

Job

Tracker

VMware vSphere + Big Data Extensions

15

Auto-elastic Hadoop in Action

ESX ESX ESX

J

T

DATA VM DATA VM DATA VM

Local Disks

SAN/NAS Non-Hadoop VMs

Hadoop Compute VMs

JT: JobTracker

TT: TaskTracker

NN: NameNode

VHM: Virtual Hadoop Manager

N

N

T

T

T

T T

T

VirtualCenter Management Server

DRS DRS DRS DRS DRS

VHM

Hadoop HDFS VMs

T

T

T

T T

T

J

T

16

Advanced Resource Management using Virtual Hadoop Manager

State, stats

(Slots used,

Pending work)

Commands

(Decommission,

Recommission)

Stats and VM

configuration

Serengeti Job

Tracker

vCenter DB

Manual/Auto

Power on/off

Virtual Hadoop Manager (VHM)

Job

Tracker

Task

Tracker

Task

Tracker

Task

Tracker

vCenter Server

VC

actions

Hadoop

actions

Serengeti

Configuration

VC

state and stats

Hadoop

state and stats

Auto-Scaling Algorithms

Cluster

Configuration

17

Auto-Scaling Algorithms: 5 Key Insights

① Expand or Shrink clusters based on ambient data

• Expand when there is work and no imminent contention

• Shrink when there is contention

• Predictable scaling for matching customer expectation, ease of testing, etc.

② Use contention detection as an input to scaling response

• Contention reflects user's resource control settings and workload demands

③ Act as an extension to DRS for distributed applications spanning multiple VMs

• A glue between DRS and Application-scheduler

• Penalize few VMs heavily rather than all VMs lightly/uniformly

④ React only if there is true contention and in a timely manner

• Actively used resources are deprived

• Do not react to transients

⑤ Use Hysteresis and Control Theory concepts to guide decisions

• E.g., transient windows and thresholds, feedback from previous actions, etc.

18

Shrinking-related Metrics

CPU is being deprived

• VC metric: CPU Ready

• Time that vCPU is ready to run, but cannot be scheduled on a pCPU

Memory is being deprived

• VC metrics:

• Usage: Active Memory, Granted Memory

• Reclamation: Memory Ballooning, Host Swap

• Typically starts with ballooning then leads to host swapping

TaskTracker is dead or faulty

• Hadoop metrics: Alive Nodes and Task Failures

19

Expansion-related Metrics

Jobs are present

• Hadoop metrics: jobs_preparing, jobs_running

High slot usage

• Hadoop metrics: map_slots_used, max_map_slots, reduce_slots_used,

max_reduce_slots

High task throughput

• Hadoop metrics: maps_completed, reduces_completed

No imminent contention

• VC metrics: CPU Ready, Memory Ballooning

20

Auto-elasticity Demo

21

What’s Next?

Resource management enhancements

• Algorithmic optimizations

• Contention metrics related to Disk/Network IO

Auto-elasticity support for YARN and HBase

• YARN – Hadoop 2.x

• HBase – Hadoop database

Serengeti enhancements

• Support for additional Hadoop distros

Hadoop extensions

• Dynamic resource configuration

22

Main Takeaways

Value proposition

• Fast provisioning

• Workload consolidation

• Elasticity better resource utilization

• Multi-tenancy using VMs differentiated service

Key technologies

• Serengeti

• Advanced Resource Management

• Hadoop Virtual Extensions

Host Host Host

vSphere Platform

Make vSphere the platform of choice for running Big Data

23

Questions?

Contact information

• Jayanth Gummaraju [email protected]

• Sasha Kipervarg [email protected]

Other related sessions

• Breakout session (VAPP5402, VAPP5762)

• Big Data Panel (VAPP5626)

• Hands-on lab (HOL-SDC-1309)

For more information (including download information)

• vSphere Big Data Extensions http://www.vmware.com/hadoop

• Project Serengeti http://www.projectserengeti.org

mailto:[email protected]

mailto:[email protected]

http://www.vmware.com/hadoop

http://www.projectserengeti.org/

THANK YOU

Big Data Extensions: Advanced Features and

Customer Case Study

Jayanth Gummaraju, VMware

Sasha Kipervarg, Identified, Inc.

VAPP5484

#VAPP5484

VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

Technology

vm vm lifecycle

data elastic compute

core hadoop

hadoop capabilities

stats hadoop state

autoelastic hadoop

virtual hadoop manager

nodes virtualizing hadoop