© 2014 IBM Corporation Best Practices Building a Multi-tenant Big Data Infrastructure STAC Summit 2014 - NYC Gord Sissons, [email protected] @GJSissons
Jul 10, 2015
© 2014 IBM Corporation
Best Practices Building a
Multi-tenant Big Data Infrastructure
STAC Summit 2014 - NYC
Gord Sissons, [email protected] @GJSissons
© 2014 IBM Corporation 2
Agenda
What do we mean by multi-tenancy?
Our evolving view - from HPC to HPA
Enter Big Data
Client example – multi-tenant Hadoop
New frameworks & Benchmarking Hadoop
Closing thoughts
© 2014 IBM Corporation 3
Multi-tenancy is an over-loaded term
Virtualization
Multiple users, lines-of-business
Multiple application instances & versions
Multi-tenant datastores – security isolation
Multiple distributed frameworks
Multiple instances of the same framework
Our viewpoint shaped by managing scaled-out cluster
infrastructure for the Financial Services Community
Means different things to different people
© 2014 IBM Corporation 4
HPC, HPA
IBM Platform
Symphony
Low latency scheduling
Dynamic resource sharing
ISV applications
Extensive APIs High-performance SOA
A high-performance, shared
grid infrastructure for risk
analytics
From a shared infrastructure for risk analytics to born-in-the-cloud frameworks
Batch
IBM Platform
LSF
Multi-headed
Configurations
Batch workloads
On a shared infrastructure,
sharing resources according
to policy – a broad set of
workloads
Our evolving view of multi-tenancy
© 2014 IBM Corporation 5
Client requirements
Need for guaranteed service levels, notion of ownership
Time-variant, directed sharing policies
Dynamic, transparent service orchestration
Support for multiple concurrent applications
Agile flexing & resource reclaim
A simple value proposition to the business – sign on to a shared
infrastructure and have guaranteed resource ownership, and a better
quality of service than you could realize on dedicated infrastructure
© 2014 IBM Corporation 6
split 0
split 1
split 2
split 3
split 4
split 5
Map
Map
Map
Reduce
Reduce
Reduce
C Client
output 0
output 1
output 2
M Master
Input Files
Map Phase
Intermediate Files
Reduce Phase
Output Files
Enter Hadoop - much attention for new workloads
Data warehouse modernization
Fraud analytics
Audit & compliance
Social media analytics
360 view of the customer
Machine data analytics
Text analytics
Tick analytics
Trade visibility
Click-stream analytics
Vehicle telematics
History repeating itself - Much as distributed system dominate large-
scale HPC, the same is becoming true in data management
© 2014 IBM Corporation 7
HPC, HPA
IBM Platform
Symphony
Low latency scheduling
Dynamic resource sharing
ISV applications
Extensive APIs High-performance SOA
A high-performance, shared
grid infrastructure for risk
analytics
From a shared infrastructure for risk analytics to born-in-the-cloud frameworks
Batch
IBM Platform
LSF
Multi-headed
Configurations
Batch workloads
On a shared infrastructure,
sharing resources according
to policy
Big Data
IBM Platform
Symphony
Advanced Edition
MapReduce
Multitenancy
Agile Scheduling
Hadoop MapReduce
Advanced, high-performance
MapReduce framework with
Hadoop compatibility and
multitenancy
Our evolving view of multi-tenancy
© 2014 IBM Corporation 8
Cluster Sprawl – The Elephant in the Room
Diverse applications with different dependencies
Different distributions, versions & tools
Life cycle management challenges – dev, QA, test, production
Big Data is more than just Hadoop – multiple projects and frameworks
© 2014 IBM Corporation 9
HPC, HPA
IBM Platform
Symphony
Low latency scheduling
Dynamic resource sharing
ISV applications
Extensive APIs High-performance SOA
A high-performance, shared
grid infrastructure for risk
analytics
From a shared infrastructure for risk analytics to born-in-the-cloud frameworks
Batch
IBM Platform
LSF
Multi-headed
Configurations
Batch workloads
On a shared infrastructure,
sharing resources according
to policy
Big Data
IBM Platform
Symphony
Advanced Edition
Low latency MapReduce
Multitenancy
Agile Scheduling
Hadoop MapReduce
Advanced, high-performance
MapReduce framework with 100%
Hadoop compatibility and
sophisticated multitenancy
Application
Frameworks
IBM Application
Services Controller
Complex Service
Orchestration
Advanced Services
“Born in the cloud”
application frameworks
Our evolving view of multi-tenancy
© 2014 IBM Corporation 10
Customer example
US financial institution, approx 9M customers Retail banking, credit cards, insurance, portfolio mgmt, real-estate, retirement
planning & more
Began Hadoop journey in ~2010 Deliver new services, reduce costs, off-load warehouse, provide timely data
access to analysts & data scientists
Target application areas CRM, click-stream analytics, fraud alerting, actuarial underwriting, social data
analytics, vehicle telematics / geo-spatial analytics
Rapid success, internal demand & security requirements
drove the need for an architecture re-think in ~2012
Deployed IBM Platform Symphony MapReduce + Elastic Storage
(based on IBM GPFS) realizing a shared, multi-tenant analytics grid
© 2014 IBM Corporation 11
App #1
User Group #1
App #2
User Group #2
App #3
User Group #3
App #4
User Group #4
App #5
User Group #5
App #6
User Group #6
App #7
User Group #7
App #n
User Group #n
…
Shared infrastructure – current state
Over two-dozen lines of business sharing production cluster
1 PB deployed, rapid growth trajectory - ~ 40% reduction in storage requirement
Security isolation, guaranteed service-levels, show-back accounting
Significant performance & operational gains, higher infrastructure utilization
Avoided the need for additional production clusters
InfoSphere BigInsights - Enterprise-grade Hadoop
Platform Symphony MapReduce – Multi-tenancy, high-performance, service level guarantees
IBM Elastic Storage (based on IBM GPFS) - HDFS compatible, POSIX, enterprise-features
© 2014 IBM Corporation 12
Planned cluster expansion – early 2015
Expanding the Hadoop infrastructure
Deploying Spark to support new applications
Big R deployment serving data scientists community
Pilot Hadoop-as-a-service on cloud
SQL-on-Hadoop deployment to serve demand from analysts
© 2014 IBM Corporation 13
Hadoop-DS Benchmark – October 2014
IBM developed benchmark reflecting growing interest in SQL-on-Hadoop
Showcase IBM’s Big SQL capability
Big Data DS benchmark - based on TPC-DS
Fully complies with the TPC-DS schema requirement
Uses all 99 queries
Meets the multi-user requirement
Has been audited by a TPC-DS auditor but as a non-TPC benchmark
Select deviations from TPC-DS due to Hadoop limitations:
No data maintenance operations, referential integrity enforcement, or ACID
property validation as these are not feasible with HDFS
Additional statistics used
Metric adjustments
No price/performance measures included
Not an official TPC benchmark result
© 2014 IBM Corporation 14
Benchmarking SQL language compatibility
Key points
With competing solutions, many
queries needed to be re-written
Owing to various restrictions,
some queries could not be re-
written or failed at run-time
Re-writing queries in a
benchmark scenario where
results are known is one thing –
doing this against real production
databases is another
Minimum 3.6x speed advantage
across 46 common query set
InfoSphere BigInsights runs all queries with 12 allowable modifications
Detailed presentation on SlideShare: http://www.slideshare.net/IBM_IM/hadoop-ds-benchmark-results
Audited by InfoSizing, certified TPC auditors – letter of attestation available
© 2014 IBM Corporation 15
Resource manager included in Hadoop 2.x and later
Decouples Hadoop workload & resource management
Introduces a general purpose application container
Enjoys broad industry support
By all means use it, but understand current limitations
Missing flexible resource sharing policies, not yet widely deployed
outside Hadoop contexts, limited application service orchestration
capabilities
What about YARN?
Yet Another Resource Negotiator
© 2014 IBM Corporation 16
Closing thoughts
http://ibm.com/platformcomputing
http://ibm.com/hadoop
Be clear on what you mean by multi-tenancy
The right approach to building a shared
infrastructure will depend on what you have
Consider the need for policy management and the
ability to orchestrate services for a wide variety of
distributed frameworks
© 2014 IBM Corporation 17