Top Banner
© 2014 IBM Corporation Best Practices Building a Multi-tenant Big Data Infrastructure STAC Summit 2014 - NYC Gord Sissons, [email protected] @GJSissons
17

STAC Summit 2014 - Building a multitenant Big Data infrastructure

Jul 10, 2015

Download

Technology

Gord Sissons

IBM presentation to the STAC Summit in New York city. November 13th 2014. Best practices in building a multitenant big data infrastructure with Hadoop
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation

Best Practices Building a

Multi-tenant Big Data Infrastructure

STAC Summit 2014 - NYC

Gord Sissons, [email protected] @GJSissons

Page 2: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 2

Agenda

What do we mean by multi-tenancy?

Our evolving view - from HPC to HPA

Enter Big Data

Client example – multi-tenant Hadoop

New frameworks & Benchmarking Hadoop

Closing thoughts

Page 3: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 3

Multi-tenancy is an over-loaded term

Virtualization

Multiple users, lines-of-business

Multiple application instances & versions

Multi-tenant datastores – security isolation

Multiple distributed frameworks

Multiple instances of the same framework

Our viewpoint shaped by managing scaled-out cluster

infrastructure for the Financial Services Community

Means different things to different people

Page 4: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 4

HPC, HPA

IBM Platform

Symphony

Low latency scheduling

Dynamic resource sharing

ISV applications

Extensive APIs High-performance SOA

A high-performance, shared

grid infrastructure for risk

analytics

From a shared infrastructure for risk analytics to born-in-the-cloud frameworks

Batch

IBM Platform

LSF

Multi-headed

Configurations

Batch workloads

On a shared infrastructure,

sharing resources according

to policy – a broad set of

workloads

Our evolving view of multi-tenancy

Page 5: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 5

Client requirements

Need for guaranteed service levels, notion of ownership

Time-variant, directed sharing policies

Dynamic, transparent service orchestration

Support for multiple concurrent applications

Agile flexing & resource reclaim

A simple value proposition to the business – sign on to a shared

infrastructure and have guaranteed resource ownership, and a better

quality of service than you could realize on dedicated infrastructure

Page 6: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 6

split 0

split 1

split 2

split 3

split 4

split 5

Map

Map

Map

Reduce

Reduce

Reduce

C Client

output 0

output 1

output 2

M Master

Input Files

Map Phase

Intermediate Files

Reduce Phase

Output Files

Enter Hadoop - much attention for new workloads

Data warehouse modernization

Fraud analytics

Audit & compliance

Social media analytics

360 view of the customer

Machine data analytics

Text analytics

Tick analytics

Trade visibility

Click-stream analytics

Vehicle telematics

History repeating itself - Much as distributed system dominate large-

scale HPC, the same is becoming true in data management

Page 7: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 7

HPC, HPA

IBM Platform

Symphony

Low latency scheduling

Dynamic resource sharing

ISV applications

Extensive APIs High-performance SOA

A high-performance, shared

grid infrastructure for risk

analytics

From a shared infrastructure for risk analytics to born-in-the-cloud frameworks

Batch

IBM Platform

LSF

Multi-headed

Configurations

Batch workloads

On a shared infrastructure,

sharing resources according

to policy

Big Data

IBM Platform

Symphony

Advanced Edition

MapReduce

Multitenancy

Agile Scheduling

Hadoop MapReduce

Advanced, high-performance

MapReduce framework with

Hadoop compatibility and

multitenancy

Our evolving view of multi-tenancy

Page 8: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 8

Cluster Sprawl – The Elephant in the Room

Diverse applications with different dependencies

Different distributions, versions & tools

Life cycle management challenges – dev, QA, test, production

Big Data is more than just Hadoop – multiple projects and frameworks

Page 9: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 9

HPC, HPA

IBM Platform

Symphony

Low latency scheduling

Dynamic resource sharing

ISV applications

Extensive APIs High-performance SOA

A high-performance, shared

grid infrastructure for risk

analytics

From a shared infrastructure for risk analytics to born-in-the-cloud frameworks

Batch

IBM Platform

LSF

Multi-headed

Configurations

Batch workloads

On a shared infrastructure,

sharing resources according

to policy

Big Data

IBM Platform

Symphony

Advanced Edition

Low latency MapReduce

Multitenancy

Agile Scheduling

Hadoop MapReduce

Advanced, high-performance

MapReduce framework with 100%

Hadoop compatibility and

sophisticated multitenancy

Application

Frameworks

IBM Application

Services Controller

Complex Service

Orchestration

Advanced Services

“Born in the cloud”

application frameworks

Our evolving view of multi-tenancy

Page 10: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 10

Customer example

US financial institution, approx 9M customers Retail banking, credit cards, insurance, portfolio mgmt, real-estate, retirement

planning & more

Began Hadoop journey in ~2010 Deliver new services, reduce costs, off-load warehouse, provide timely data

access to analysts & data scientists

Target application areas CRM, click-stream analytics, fraud alerting, actuarial underwriting, social data

analytics, vehicle telematics / geo-spatial analytics

Rapid success, internal demand & security requirements

drove the need for an architecture re-think in ~2012

Deployed IBM Platform Symphony MapReduce + Elastic Storage

(based on IBM GPFS) realizing a shared, multi-tenant analytics grid

Page 11: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 11

App #1

User Group #1

App #2

User Group #2

App #3

User Group #3

App #4

User Group #4

App #5

User Group #5

App #6

User Group #6

App #7

User Group #7

App #n

User Group #n

Shared infrastructure – current state

Over two-dozen lines of business sharing production cluster

1 PB deployed, rapid growth trajectory - ~ 40% reduction in storage requirement

Security isolation, guaranteed service-levels, show-back accounting

Significant performance & operational gains, higher infrastructure utilization

Avoided the need for additional production clusters

InfoSphere BigInsights - Enterprise-grade Hadoop

Platform Symphony MapReduce – Multi-tenancy, high-performance, service level guarantees

IBM Elastic Storage (based on IBM GPFS) - HDFS compatible, POSIX, enterprise-features

Page 12: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 12

Planned cluster expansion – early 2015

Expanding the Hadoop infrastructure

Deploying Spark to support new applications

Big R deployment serving data scientists community

Pilot Hadoop-as-a-service on cloud

SQL-on-Hadoop deployment to serve demand from analysts

Page 13: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 13

Hadoop-DS Benchmark – October 2014

IBM developed benchmark reflecting growing interest in SQL-on-Hadoop

Showcase IBM’s Big SQL capability

Big Data DS benchmark - based on TPC-DS

Fully complies with the TPC-DS schema requirement

Uses all 99 queries

Meets the multi-user requirement

Has been audited by a TPC-DS auditor but as a non-TPC benchmark

Select deviations from TPC-DS due to Hadoop limitations:

No data maintenance operations, referential integrity enforcement, or ACID

property validation as these are not feasible with HDFS

Additional statistics used

Metric adjustments

No price/performance measures included

Not an official TPC benchmark result

Page 14: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 14

Benchmarking SQL language compatibility

Key points

With competing solutions, many

queries needed to be re-written

Owing to various restrictions,

some queries could not be re-

written or failed at run-time

Re-writing queries in a

benchmark scenario where

results are known is one thing –

doing this against real production

databases is another

Minimum 3.6x speed advantage

across 46 common query set

InfoSphere BigInsights runs all queries with 12 allowable modifications

Detailed presentation on SlideShare: http://www.slideshare.net/IBM_IM/hadoop-ds-benchmark-results

Audited by InfoSizing, certified TPC auditors – letter of attestation available

Page 15: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 15

Resource manager included in Hadoop 2.x and later

Decouples Hadoop workload & resource management

Introduces a general purpose application container

Enjoys broad industry support

By all means use it, but understand current limitations

Missing flexible resource sharing policies, not yet widely deployed

outside Hadoop contexts, limited application service orchestration

capabilities

What about YARN?

Yet Another Resource Negotiator

Page 16: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 16

Closing thoughts

http://ibm.com/platformcomputing

http://ibm.com/hadoop

Be clear on what you mean by multi-tenancy

The right approach to building a shared

infrastructure will depend on what you have

Consider the need for policy management and the

ability to orchestrate services for a wide variety of

distributed frameworks

Page 17: STAC Summit 2014 - Building a multitenant Big Data infrastructure

© 2014 IBM Corporation 17