© 2009 VMware Inc. All rights reserved
Is Your Cloud Ready for Big Data?
Richard McDougall
CTO, Storage and Application Services
2
Not Just for the Web Giants – The Intelligent Enterprise
3
Real-time analysis allows instant understanding of
market dynamics.
Retailers can have intimate understanding of their
customers needs and use direct targeted marketing.
Market Segment Analysis ! Personalized Customer Targeting`
4
The Emerging Pattern of Big Data Systems: Retail Example
Real-Time Streams
Exa-scale Data Store
Parallel Data Processing
Real-Time Processing
Machine Learning
Data Science
Cloud Infrastructure
5
Storage: Plan for Peta-scale Data Storage and Processing
0.01
0.1
1
10
100
1000
2000 2003 2006 2009 2012 2015
Online Apps Analytics
PB of Data
Analytics Rapidly Outgrows Traditional Data Size by 100x
6
Unprecedented Scale
“Data transparency, amplified by Social Networks
generates data at a scale never seen before”
- The Human Face of Big Data
We are creating an Exabyte of data every minute in 2013
Yottabyte by 2030
7
A single GE Jet Engine produces 10 Terabytes of data in one hour
– 90 Petabytes per year.
Enabling early detection of
faults, common mode failures, product engineering feedback.
Post Mortem ! Proactively Maintained Connected Product
8
The Emerging Pattern of Big Data Systems: Manufacturing
Exa-scale Data Store
Parallel Data Processing
Real-Time Processing Machine
Learning
Data Science
Cloud Infrastructure
Real-Time Sensor
Analytics Support Product
Engineering
© 2009 VMware Inc. All rights reserved
Cloud Platform
10
Cloud Platform: Supporting Mixed Big Data Workloads
Machine Learning Hadoop Real-Time
Analytics
Change workload types to Real-time Analytics, Machine Learning , Hadoop above cloud infra, too
Cloud Infrastructure
Machine Learning
Hadoop
Real-Time Analytics
Management
Network/Security
Storage/Availability
Compute
11
Cloud Platform: Supporting Multiple Tenants
Change workload types to Real-time Analytics, Machine Learning , Hadoop above cloud infra, too
Cloud Infrastructure
Management
Network/Security
Storage/Availability
Compute
Web User Analytics
Financial Analysis
Historical Customer Behavior
12
What if you can…
Experimentation
Production recommendation engine
Production Ad Targeting
Test/Dev
Production
Test
Production
Test
Experimentation
Recommendation engine Ad targeting
Experimentation
One physical platform to support multiple virtual big data clusters
13
Values of a Cloud Platform for Big Data
Agility / Rapid deployment
Lower Capex
Isolation for resource control and security
1
2
3
Operational efficiency 4
Management
Network/Security
Storage/Availability
Compute
14
Hadoop as a Service
! Shrink and expand cluster on demand
! Independent scaling of Compute and data
! Strong multi-tenancy
Elasticity & Multi-tenancy
! High availability for entire Hadoop stack
! One click to setup
! Battle-tested
High Availability
! Rapid deployment ! One stop command
center
! Easy to configure/reconfigure
Operational Simplicity
15
Self Service Access to Big Data Environments
Developer • 3 Hadoop nodes • Cloudera, Pivotal
MapR • Small VM • Local storage • No HA • …
Data Scientist • 5 Hadoop nodes • Cloudera, Pivotal • Hive, Pig • Medium VM • HA • …
High priority • 50 Hadoop nodes • Cloudera • Hive, Pig • Large VM • HA • …
… • … • …
Templates for Different Cloud Users
16
Hadoop batch analysis
Big Data needs a Mix of Workloads
File System/Data Store
Host Host Host Host Host Host
HBase real-time queries
NoSQL Cassandra, Mongo, etc Big SQL
Impala, Pivotal HawQ
Compute layer
Platform Virtualization Technology
Host
Other Spark, Shark, Solr,
Platfora, Etc,…
17
Strong Isolation between Workloads is Key
Hungry Workload 1
Reckless Workload 2
Nosy Workload 3
Virtualization Platform
18
Community activity in Isolation and Resource Management
! YARN • Goal: Support workloads other than M-R on Hadoop • Initial need is for MPI/M-R from Yahoo
• Non-posix File system self selects workload types
! Mesos • Distributed Resource Broker
• Mixed Workloads with some RM
• Active project, in use at Twitter • Leverages OS Virtualization – e.g. cgroups
! Virtualization • Virtual machine as the primary isolation, resource management and
versioned deployment container
• Basis for Project Serengeti
19
Use case: Elastic Hadoop with Tiered SLA
• Production workloads has high priority • Experimentation workloads has lower priority
Experimentation Dynamic resourcepool
Data layer
Production recommendation engine
Compute layer Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Experimentation Production
Compute VM
Experimentation Mapreduce
Production Mapreduce
vSphere
20
Cloud Enabled Auto-elastic Hadoop
ESX ESX ESX
JT
DATA VM DATA VM DATA VM
Local Disks
SAN/NAS Non-Hadoop VMs Hadoop Compute VMs
JT: JobTracker TT: TaskTracker NN: NameNode VHM: Virtual Hadoop Manager
NN
TT
TT
TTVHM
Hadoop HDFS VMs
TT
TT
TT
JT
21
Hadoop Performance with Virtualization
[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]
(lower is better)
32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM
© 2009 VMware Inc. All rights reserved
Network Platform
23
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
L2%Switch%
Top%of%Rack%Switch%
L2%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
L2%Switch% L2%Switch%
Aggrega7ng%Switch%
Aggrega7ng%Switch%
A Typical Network Architecture
24
Traditional Networks: Core Switch is the Choke Point
Network Topology
Modeled Bandwidth Non Uniform Bandwidth
Core
Aggregation
Rack
Hosts Hosts
100s of Gbits 10s of Gbits
25
Modern Networks: Great for Big Data
Uniform Bandwidth
Network Topology
Modeled Bandwidth
Spine
Leaf
Hosts
26
Flat Networks Allow for New Infrastructure Models
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Storage%
Storage%
Storage%
Storage%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
Storage Converged
Storage Compute
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Storage%
Separated Storage
Separated Storage
© 2009 VMware Inc. All rights reserved
Storage Platform
28
Use Local Disk where it’s Needed
SAN Storage
$2 - $10/Gigabyte
$1M gets: 0.5Petabytes
200,000 IOPS 8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets: 1 Petabyte
200,000 IOPS 10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets: 10 Petabytes 400,000 IOPS
250 Gbytes/sec
29
Storage Economics: Traditional vs. Scale-out
$-
$0.50
$1.00
$1.50
$2.00
$2.50
$3.00
$3.50
$4.00
$4.50
$5.00
$5.50
0.5 1 2 4 8 16 32 64 128
Cost per GB
Petabytes Deployed
Traditional SAN/NAS
Distributed Object
Storage HDFS MAPR CEPH Gluster Scality Scale-out NAS
Isilon
30
Big Data Storage Architectures
External SAN With HDFS
Local Disks With HDFS or
Other
External Scale-out NAS
HDFS, CEPH, MAPR, Gluster, Scality,
…
31
Features from New Storage Solutions
Snapshots
Clones Erasure Coding
NFS Access
Universal File Store
Geo Replication
Posix Support SSD Capability QoS Controls
© 2009 VMware Inc. All rights reserved
Summary
33
Customers Winning from Consolidated Big Data Platforms
“Dedicated hardware makes no sense”
“Software-defined Datacenter enables rapid deployment multiple tenants and labs”
“Our mixed workloads include Hadoop, Database, ETL and
App-servers”
“Any performance penalties are minor” Management
Network/Security
Storage/Availability
Compute
34
Cloud Infrastructure is Ready for Big Data – Are you?
Cloud Infrastructure
© 2009 VMware Inc. All rights reserved
Is Your Cloud Ready for Big Data?
Richard McDougall
CTO, Storage and Application Services
@richardmcdougll