Top Banner
Big Data Extensions: Advanced Features and Customer Case Study Jayanth Gummaraju, VMware Sasha Kipervarg, Identified, Inc. VAPP5484 #VAPP5484
26

VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

Jan 22, 2015

Download

Technology

VMworld

VMworld 2013

Jayanth Gummaraju, VMware
Sasha Kipervarg, Identified, Inc.

Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

Big Data Extensions: Advanced Features and

Customer Case Study

Jayanth Gummaraju, VMware

Sasha Kipervarg, Identified, Inc.

VAPP5484

#VAPP5484

Page 2: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

2

Data Is Exploding & Hadoop Is Driving Growth

Unstructured data driving growth Hadoop adoption is ramping

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Structured Unstructured

Complex unstructured data

forecasted to outpace structured relational data by 10x by 2020

Evaluating53%

In-production

23%

Piloting18%

Testing2%

Don't know2%

Other2%

Source: Forrester Survey of 60 CIOs , September 2011

• Unstructured data explosion and Hadoop capabilities causing CIOs to

reconsider Enterprise data strategy

• Hadoop’s ability to process raw data at cost presents intriguing value

proposition

Page 3: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

3

Agenda

Big Data Extensions Overview

Virtualized Hadoop at Identified Inc.

Advanced Features

Page 4: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

4

Questions for Audience

Familiarity with Hadoop

1. New to Hadoop

2. Reasonably familiar

3. Expert

Hadoop cluster sizes

1. < 10 nodes

2. 10-50 nodes

3. > 50 nodes

Virtualizing Hadoop

1. Never virtualized

2. Actively exploring virtualization

3. Running virtualized Hadoop in test-dev/production

Page 5: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

5

Big Data on vSphere: Value Proposition

Basic Features

• Fast provisioning

• Minutes/hours instead of days

• Workload Consolidation

• Multiple virtual clusters co-exist on same physical hardware

• High Availability

• Not limited to NameNode, JobTracker

Advanced Features

• Auto-elasticity

• High Resource Utilization

• True multi-tenancy

• VM-grade security, performance, and configuration isolation

Page 6: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

6

Serengeti

vSphere Resource

Management

Hadoop Virtualization Extensions

vSphere Big Data Extensions: Program Highlights

Open source project

Tool to simplify virtualized Hadoop deployment & operations

Serengeti

Virtualization changes for core Hadoop

Contributed back to Apache Hadoop

Advanced resource management on vSphere

Big Data applications-specific extension to DRS

Page 7: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

7

What is Hadoop?

Distributed processing of large data sets across clusters of computers

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

map()

reduce()

Input D

ata

Outp

ut D

ata

Split

[k1, v1]

Sort

by k1

Merge

[k1, [v1, v2, v3,…]]

map()

map()

reduce()

Page 8: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

8

Slave Node 1 Slave Node 2 Slave Node 3

Input File

Tasks Are Scheduled Where Data Resides

JobTracker Job

DataNode

TaskTracker

Split 1 – 64MB

Task - 1

Split 2 – 64MB

Split 3 – 64MB

TaskTracker TaskTracker

DataNode DataNode

Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB

Task - 2 Task - 3

NameNode

Page 9: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

9

Myth: Virtual Performance Is Sub-optimal

[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]

(lower is better)

32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM

Page 10: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

10

Agenda

Big Data Extensions Overview

Virtualized Hadoop at Identified Inc.

Advanced Features

Page 11: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

11

Agenda

Big Data Extensions Overview

Virtualized Hadoop at Identified Inc.

Advanced Features

Page 12: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

12

Compute-Data Separation

Combined Storage/ Compute

VM

Hadoop in VM

• VM lifecycle determined by Datanode

• Limited elasticity

• Limited to Hadoop Multi-Tenancy

Storage

Compute

VM

VM

Separate Storage

• Separate compute from data

• Elastic compute

• Enable shared workloads

• Raise utilization

Storage

T1 T2

VM

VM

VM

Separate Compute Tenants

• Separate virtual clusters per tenant

• Stronger VM-grade security and resource isolation

• Enable deployment of multiple Hadoop runtime versions

Slave Node

Page 13: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

13

Dataflow with Separated Compute/Data

Virtual Hadoop

Node

Virtual Hadoop

Node

ESX Host

Virtual Hadoop Node

VMDK

DataNode

Virtual Hadoop Node TaskTracker

Slot

Slot

Virtual Switch

Virtual NIC Virtual NIC

NIC Drivers

Page 14: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

14

Elastic Scalability & Multi-Tenancy

Deploy separate compute clusters for different tenants sharing HDFS.

According to priority and available resources, power-on/off compute VMs

Experimentation Dynamic resourcepool

Data layer

Production

recommendation engine

Compute layer Compute

VM

Compute

VM

Compute

VM

Compute

VM

Compute

VM Compute

VM

Compute

VM

Compute

VM

Compute

VM Compute

VM

Compute

VM Compute

VM

Compute

VM

Compute

VM

Compute

VM

Experimentation Production

Compute

VM

Job

Tracker

Job

Tracker

VMware vSphere + Big Data Extensions

Page 15: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

15

Auto-elastic Hadoop in Action

ESX ESX ESX

J

T

DATA VM DATA VM DATA VM

Local Disks

SAN/NAS Non-Hadoop VMs

Hadoop Compute VMs

JT: JobTracker

TT: TaskTracker

NN: NameNode

VHM: Virtual Hadoop Manager

N

N

T

T

T

T T

T

VirtualCenter Management Server

DRS DRS DRS DRS DRS

VHM

Hadoop HDFS VMs

T

T

T

T T

T

J

T

Page 16: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

16

Advanced Resource Management using Virtual Hadoop Manager

State, stats

(Slots used,

Pending work)

Commands

(Decommission,

Recommission)

Stats and VM

configuration

Serengeti Job

Tracker

vCenter DB

Manual/Auto

Power on/off

Virtual Hadoop Manager (VHM)

Job

Tracker

Task

Tracker

Task

Tracker

Task

Tracker

vCenter Server

VC

actions

Hadoop

actions

Serengeti

Configuration

VC

state and stats

Hadoop

state and stats

Auto-Scaling Algorithms

Cluster

Configuration

Page 17: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

17

Auto-Scaling Algorithms: 5 Key Insights

① Expand or Shrink clusters based on ambient data

• Expand when there is work and no imminent contention

• Shrink when there is contention

• Predictable scaling for matching customer expectation, ease of testing, etc.

② Use contention detection as an input to scaling response

• Contention reflects user's resource control settings and workload demands

③ Act as an extension to DRS for distributed applications spanning multiple VMs

• A glue between DRS and Application-scheduler

• Penalize few VMs heavily rather than all VMs lightly/uniformly

④ React only if there is true contention and in a timely manner

• Actively used resources are deprived

• Do not react to transients

⑤ Use Hysteresis and Control Theory concepts to guide decisions

• E.g., transient windows and thresholds, feedback from previous actions, etc.

Page 18: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

18

Shrinking-related Metrics

CPU is being deprived

• VC metric: CPU Ready

• Time that vCPU is ready to run, but cannot be scheduled on a pCPU

Memory is being deprived

• VC metrics:

• Usage: Active Memory, Granted Memory

• Reclamation: Memory Ballooning, Host Swap

• Typically starts with ballooning then leads to host swapping

TaskTracker is dead or faulty

• Hadoop metrics: Alive Nodes and Task Failures

Page 19: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

19

Expansion-related Metrics

Jobs are present

• Hadoop metrics: jobs_preparing, jobs_running

High slot usage

• Hadoop metrics: map_slots_used, max_map_slots, reduce_slots_used,

max_reduce_slots

High task throughput

• Hadoop metrics: maps_completed, reduces_completed

No imminent contention

• VC metrics: CPU Ready, Memory Ballooning

Page 20: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

20

Auto-elasticity Demo

Page 21: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

21

What’s Next?

Resource management enhancements

• Algorithmic optimizations

• Contention metrics related to Disk/Network IO

Auto-elasticity support for YARN and HBase

• YARN – Hadoop 2.x

• HBase – Hadoop database

Serengeti enhancements

• Support for additional Hadoop distros

Hadoop extensions

• Dynamic resource configuration

Page 22: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

22

Main Takeaways

Value proposition

• Fast provisioning

• Workload consolidation

• Elasticity better resource utilization

• Multi-tenancy using VMs differentiated service

Key technologies

• Serengeti

• Advanced Resource Management

• Hadoop Virtual Extensions

Host Host Host

vSphere Platform

Make vSphere the platform of choice for running Big Data

Page 23: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

23

Questions?

Contact information

• Jayanth Gummaraju [email protected]

• Sasha Kipervarg [email protected]

Other related sessions

• Breakout session (VAPP5402, VAPP5762)

• Big Data Panel (VAPP5626)

• Hands-on lab (HOL-SDC-1309)

For more information (including download information)

• vSphere Big Data Extensions http://www.vmware.com/hadoop

• Project Serengeti http://www.projectserengeti.org

Page 24: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

THANK YOU

Page 25: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
Page 26: VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study

Big Data Extensions: Advanced Features and

Customer Case Study

Jayanth Gummaraju, VMware

Sasha Kipervarg, Identified, Inc.

VAPP5484

#VAPP5484