VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

Big Data Extensions: Advanced Features and

Customer Case Study

Jayanth Gummaraju, VMware

Sasha Kipervarg, Identified, Inc.

VAPP5484

#VAPP5484

Data Is Exploding & Hadoop Is Driving Growth

Unstructured data driving growth Hadoop adoption is ramping

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Structured Unstructured

Complex unstructured data

forecasted to outpace structured relational data by 10x by 2020

Evaluating53%

In-production

Piloting18%

Testing2%

Don't know2%

Other2%

Source: Forrester Survey of 60 CIOs , September 2011

• Unstructured data explosion and Hadoop capabilities causing CIOs to

reconsider Enterprise data strategy

• Hadoop’s ability to process raw data at cost presents intriguing value

proposition

Agenda

Big Data Extensions Overview

Virtualized Hadoop at Identified Inc.

Advanced Features

Questions for Audience

Familiarity with Hadoop

1. New to Hadoop

2. Reasonably familiar

3. Expert

Hadoop cluster sizes

1. < 10 nodes

2. 10-50 nodes

3. > 50 nodes

Virtualizing Hadoop

1. Never virtualized

2. Actively exploring virtualization

3. Running virtualized Hadoop in test-dev/production

Big Data on vSphere: Value Proposition

Basic Features

• Fast provisioning

• Minutes/hours instead of days

• Workload Consolidation

• Multiple virtual clusters co-exist on same physical hardware

• High Availability

• Not limited to NameNode, JobTracker

Advanced Features

• Auto-elasticity

• High Resource Utilization

• True multi-tenancy

• VM-grade security, performance, and configuration isolation

Serengeti

vSphere Resource

Management

Hadoop Virtualization Extensions

vSphere Big Data Extensions: Program Highlights

Open source project

Tool to simplify virtualized Hadoop deployment & operations

Serengeti

Virtualization changes for core Hadoop

Contributed back to Apache Hadoop

Advanced resource management on vSphere

Big Data applications-specific extension to DRS

What is Hadoop?

Distributed processing of large data sets across clusters of computers

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

reduce()

Input D

[k1, v1]

[k1, [v1, v2, v3,…]]

reduce()

Slave Node 1 Slave Node 2 Slave Node 3

Input File

Tasks Are Scheduled Where Data Resides

JobTracker Job

DataNode

TaskTracker

Split 1 – 64MB

Task - 1

Split 2 – 64MB

Split 3 – 64MB

TaskTracker TaskTracker

DataNode DataNode

Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB

Task - 2 Task - 3

NameNode

Myth: Virtual Performance Is Sub-optimal

[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]

(lower is better)

32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM

Agenda

Advanced Features

Agenda

Advanced Features

Compute-Data Separation

Combined Storage/ Compute

Hadoop in VM

• VM lifecycle determined by Datanode

• Limited elasticity

• Limited to Hadoop Multi-Tenancy

Storage

Compute

Separate Storage

• Separate compute from data

• Elastic compute

• Enable shared workloads

• Raise utilization

Storage

Separate Compute Tenants

• Separate virtual clusters per tenant

• Stronger VM-grade security and resource isolation

• Enable deployment of multiple Hadoop runtime versions

Slave Node

Dataflow with Separated Compute/Data

Virtual Hadoop

ESX Host

Virtual Hadoop Node

DataNode

Virtual Hadoop Node TaskTracker

Virtual Switch

Virtual NIC Virtual NIC

NIC Drivers

Elastic Scalability & Multi-Tenancy

Deploy separate compute clusters for different tenants sharing HDFS.

According to priority and available resources, power-on/off compute VMs

Experimentation Dynamic resourcepool

Data layer

Production

recommendation engine

Compute layer Compute

Compute

VM Compute

Compute

VM Compute

Compute

VM Compute

Compute

Experimentation Production

Compute

Tracker

VMware vSphere + Big Data Extensions

Auto-elastic Hadoop in Action

ESX ESX ESX

DATA VM DATA VM DATA VM

Local Disks

SAN/NAS Non-Hadoop VMs

Hadoop Compute VMs

JT: JobTracker

TT: TaskTracker

NN: NameNode

VHM: Virtual Hadoop Manager

VirtualCenter Management Server

DRS DRS DRS DRS DRS

Hadoop HDFS VMs

Advanced Resource Management using Virtual Hadoop Manager

State, stats

(Slots used,

Pending work)

Commands

(Decommission,

Recommission)

Stats and VM

configuration

Serengeti Job

Tracker

vCenter DB

Manual/Auto

Power on/off

Virtual Hadoop Manager (VHM)

Tracker

vCenter Server

actions

Hadoop

actions

Serengeti

Configuration

state and stats

Hadoop

state and stats

Auto-Scaling Algorithms

Cluster

Configuration

Auto-Scaling Algorithms: 5 Key Insights

① Expand or Shrink clusters based on ambient data

• Expand when there is work and no imminent contention

• Shrink when there is contention

• Predictable scaling for matching customer expectation, ease of testing, etc.

② Use contention detection as an input to scaling response

• Contention reflects user's resource control settings and workload demands

③ Act as an extension to DRS for distributed applications spanning multiple VMs

• A glue between DRS and Application-scheduler

• Penalize few VMs heavily rather than all VMs lightly/uniformly

④ React only if there is true contention and in a timely manner

• Actively used resources are deprived

• Do not react to transients

⑤ Use Hysteresis and Control Theory concepts to guide decisions

• E.g., transient windows and thresholds, feedback from previous actions, etc.

Shrinking-related Metrics

CPU is being deprived

• VC metric: CPU Ready

• Time that vCPU is ready to run, but cannot be scheduled on a pCPU

Memory is being deprived

• VC metrics:

• Usage: Active Memory, Granted Memory

• Reclamation: Memory Ballooning, Host Swap

• Typically starts with ballooning then leads to host swapping

TaskTracker is dead or faulty

• Hadoop metrics: Alive Nodes and Task Failures

Expansion-related Metrics

Jobs are present

• Hadoop metrics: jobs_preparing, jobs_running

High slot usage

• Hadoop metrics: map_slots_used, max_map_slots, reduce_slots_used,

max_reduce_slots

High task throughput

• Hadoop metrics: maps_completed, reduces_completed

No imminent contention

• VC metrics: CPU Ready, Memory Ballooning

Auto-elasticity Demo

What’s Next?

Resource management enhancements

• Algorithmic optimizations

• Contention metrics related to Disk/Network IO

Auto-elasticity support for YARN and HBase

• YARN – Hadoop 2.x

• HBase – Hadoop database

Serengeti enhancements

• Support for additional Hadoop distros

Hadoop extensions

• Dynamic resource configuration

Main Takeaways

Value proposition

• Fast provisioning

• Workload consolidation

• Elasticity better resource utilization

• Multi-tenancy using VMs differentiated service

Key technologies

• Serengeti

• Advanced Resource Management

• Hadoop Virtual Extensions

Host Host Host

vSphere Platform

Make vSphere the platform of choice for running Big Data

Questions?

Contact information

• Jayanth Gummaraju jgummaraju@vmware.com

• Sasha Kipervarg sasha@identified.com

Other related sessions

• Breakout session (VAPP5402, VAPP5762)

• Big Data Panel (VAPP5626)

• Hands-on lab (HOL-SDC-1309)

For more information (including download information)

• vSphere Big Data Extensions http://www.vmware.com/hadoop

• Project Serengeti http://www.projectserengeti.org

THANK YOU

Big Data Extensions: Advanced Features and

Customer Case Study

Jayanth Gummaraju, VMware

Sasha Kipervarg, Identified, Inc.

VAPP5484

#VAPP5484

VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

vm vm lifecycle

data elastic compute

core hadoop

hadoop capabilities

stats hadoop state

autoelastic hadoop

virtual hadoop manager

nodes virtualizing hadoop

Technology

STO1885BU Key vSAN Use Cases with Customer Case or ... ·.....

VMworld 2013 Fave Photos - VMblog

FME EXTENSIONS - marketplace.magento.com...FME EXTENSIONS...

SQLFire at VMworld Europe 2011

VMworld 2015: Rethinking Enterprise Storage: Rise Of Hyper.....

FME EXTENSIONS - marketplace.magento.com€¦ · FME...

Cisco at vmworld 2015 joann_starke_let_your_business_soar

VMworld 2014: Extreme Performance Series

VMworld 2017 Core Storage

HP at VMworld 2011

VMworld 2014: Virtualizing Databases

VMworld 2014: VMware OpenStack

VMworld 2013: VMware Virtual SAN

Inside the Hadoop Machine @ VMworld

Customer Guide to New Extensions - ATCOCustomer Guide to New...

download3.vmware.comdownload3.vmware.com/vmworld/2006/mdc519...