Top Banner
Lecture 11 Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, SJTU May 8th, 2017 1896 1920 1987 2006 CS075
76

Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Jun 26, 2018

Download

Documents

truongdien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Lecture 11Introduction to BigData Analysis Stack

Jeremy WeiCenter for HPC, SJTU

May 8th, 2017

1896 1920 1987 2006CS075

Page 2: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+

Outline

• What is BigData?

• Hadoop Ecosystem

• Spark’s Improvement over Hadoop

2

Page 3: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

What is Big Data?

Volume- exceeds limits of traditional

column and row relational DB

- constantly growing

Velocity- arrives rapidly, often in real

time

Variety- does not have a standard

structure, e.g. text, images

“Big data are high volume, high velocity, and high variety

information assets that require new forms of processing to

enable enhanced decision making, insight discovery and

process optimization” (Gartner 2012)

Requires

Vertical scalability- ability to grow storage to

accommodate new ‘records’

Requires

Data streaming- real time processing,

analysis and transformation

Requires

Horizontal scalability

- ability to add additional data

structures

Page 4: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

How is big data generated?

Social media: posts, pictures and

videos

Purchase transaction records

Mobile phone GPS signals

High volume administrative

& transactional records

Sensors gathering information: e.g.

Climate, traffic etc.

Digital satellite images

Page 5: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 1: Smart meter project

Day

To

tal d

aily e

lectr

icity c

on

su

mp

tio

n (

kW

h)

Irish smart meter pilot study:

Single meter, total daily electricity consumption

July 2

009

Octob

er 2009

Febru

ary 20

10

May

201

0

Augus

t 2010

Dec

embe

r 201

0

Christmas 2009 Christmas 2010

Consecutive days w ith low

consumption, possibly a w eek aw ay?

50

100

150

Page 6: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 1: Smart meter projectResearch Question: Investigate the potential of

smart meter electricity data (high frequency – 30

mins) to identify household occupancy levels,

potentially household structure

• England and Ireland both conducted

pilots of rollout in 2009-2010 – data now

available for research

• Southampton University commissioned

by Beyond 2011 to conduct preliminary

research (due mid Feb 2014)

Page 7: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 2: Mobile Phone Project

• 4 pilot projects:

– Smartmeter

– Mobile Phones

– Prices

– Twitter

• RD&I Research Innovation Labs

Page 8: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 2: Mobile Phone ProjectResearch Question: To investigate the

possibility of using mobile phone data to model

population flows, eg travel to work statistics

• Location data:

– Telefonica proposal to provide aggregate

data on origin-destination flows

• Requirement to engage with GDS

before proceeding further

Page 9: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 3: Prices ProjectResearch Question: To investigate how we

can scrape prices data from the internet and

how this data could be used within price

statistics

• 2-day workshop held with big data

experts from Statistics Netherlands

• Focus on groceries

• Early prototype code in place

• Engagement with Billion Prices Project

Page 10: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 3: Prices by webscraping

Rendered webpage:

HTML code:......

</div><div class="productLists" id="endFacets-1"><ul class="cf products line"><li id="p-254942348-3" class=" first"><div

class="desc"><h3 class="inBasketInfoContainer"><a id="h-254942348" href="/groceries/Product/Details/?id=254942348" class="si_pl_254942348-title"><span class="image"><img

src="http://img.tesco.com/Groceries/pi/121\5010044000121\IDShot_90x90.jpg" alt="" /><!----></span>Warburtons Toastie Sliced

White Bread 800G</a></h3><p class="limitedLife"><a href="http://www.tesco.com/groceries/zones/default.aspx?name=quality-and-

freshness">Delivering the freshest food to your door- Find out more &gt;</a></p><div class="descContent"><!----><div

class="promo"><a href="/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31234788" title="All products available for this offer" id="flyout-254942348-promo-A31234788--pos" class="promoFlyout"><span class="promoImgBox"><img

src="/Groceries/UIAssets/I/Sites/Retail/Superstore/Online/Product/pos/2for.png" class="promoFlyout promo" alt="Special Offer"

id="flyout-254942348-promo-A31234788--posimg" /></span><em>Any 2 for £2.00</em></a><span> valid from 21/1/2014 until

10/2/2014</span></div><div class="tools"><div class="moreInfo"><a href="/groceries/Product/Details/?id=254942348"

class="midiFlyout" id="flyout-254942348-midi-0-"><img class="midiFlyout hd" src="http://ui.tescoassets.com/groceries/UIAssets/I/../Compressed/I_635209615845382232/Sites/Retail/Superstore/Online/Product/i

nfoBlue.gif" alt="" title="View product information" id="flyout-254942348-midi-1-" /></a></div><!----><div

class="links"><ul><li><a

href="http://www.tesco.com/groceries/product/browse/default.aspx?notepad=white%20sliced%20loaf%20800g&amp;N=4294793217"

class="shelfFlyout active plaintooltip" id="s-tt-254942348" title="Premium White Bread"> Rest of <span class="hide">Premium White Bread <!----></span>shelf </a></li></ul></div></div></div></div><div class="quantity"><div class="content addToBasket"><p

class="price"><span class="linePrice">£1.45<!----></span><span class="linePriceAbbr"> (£0.18/100g)</span></p><h4

class="hide">Add to basket</h4><form method="post" id="fMultisearch-254942348"

.....

Page 11: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 3:

The Billion Prices Project @ MIT

Daily Online

Price Index

(United

States)

Lehman Brothers files

for bankruptcy (15 Sept

2008)

Page 12: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 4: Twitter ProjectResearch Question: To investigate how to capture

geo-located tweets from Twitter and how this data

might provide insights on commuting patterns and

internal migration

• Opportunity to start experimenting early on with big

data technologies

• Pilot work has successfully harvested geo-located

tweets from the live Twitter feed using Python and

Twitter API

• Need to determine whether planned application will

exceed rate-limits

Page 13: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 4: Twitter ProjectTemporal Patterns of International Mobility by selected country

Page 14: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Pilot 4: Mobility patterns from

Twitter

Dover

Calais

Page 15: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Implication of BigData

• Analyzing the whole collection, not sampled ones.

• Long-tailed low-density heterogeneous data.

• Correlation, not causation.

Page 16: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

ArchitectureofBigDataAnalytics

16

Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications

Data

Mining

OLAP

Reports

QueriesHadoop

MapReduce

Pig

Hive

Jaql

Zookeeper

Hbase

Cassandra

Oozie

Avro

Mahout

Others

Middleware

Extract

Transform

Load

Data

Warehouse

Traditional

Format

CSV,

Tables

* Internal

* External

* Multiple

formats

* Multiple

locations

* Multiple

applications

BigData

SourcesBigData

Transformation

BigData

Platforms&Tools

BigData

Analytics

Applications

BigData

Analytics

Transformed

DataRaw

Data

Page 17: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+

17

HDFS

MapReduce Processing

Storage

Source: http://hadoop.apache.org/

Page 18: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+

Why Hadoop?

18

Page 19: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+

MapReduce Workflow

19

Page 20: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+

MapReduce

20

Page 21: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+

MapReduce

21

Page 22: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 23: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+ HBase:HadoopDatabase

23

Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 24: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+Pig:High-levelscriptsforHadoop

24

Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 25: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+ Sqoop:SQLimporterforHadoop

25

Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 26: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+ Hive:SQLonMapReduce

26

Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 27: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+Impala:High-performanceSQLonMapReduce

27

Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 28: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

+ HadoopEcosystem

28

Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 29: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Limitations of Hadoop

• Needs synchronization before shuffle, in which the slowest task may become the bottleneck.

• Needs serialization between iteration, which is unfit for iterative jobs.

• Batch jobs is NOT optimized for streaming interactive queries.

• Do NOT utilize memory efficiently.

Page 30: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Next-gen Big Data Processing Goals

• Low latency (interactive) queries on

historical data: enable faster decisions

– E.g., identify why a site is slow and fix it

• Low latency queries on live data (streaming):

enable decisions on real-time data

– E.g., detect & block worms in real-time (a

worm may infect 1mil hosts in 1.3sec)

• Sophisticated data processing: enable

‚better‛ decisions

– E.g., anomaly detection, trend analysis

Page 31: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Today’s Open Analytics Stack…

• ..mostly focused on large on-disk datasets: great for

batch but slow

Application

Storage

Data Processing

Infrastructure

Page 32: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Goals

Batch

Interactive Streaming

One stack to

rule them all!

§ Easy to combine batch, streaming, and interactive computations

§ Easy to develop sophisticated algorithms

§ Compatible with existing open source ecosystem (Hadoop/HDFS)

Page 33: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Support Interactive and Streaming Comp.

• Aggressive use of memory

• Why?

1. Memory transfer rates >> disk or SSDs

2. Many datasets already fit into memory

• Inputs of over 90% of jobs in

Facebook, Yahoo!, and Bing clusters

fit into memory

• e.g., 1TB = 1 billion records @ 1KB

each

3. Memory density (still) grows with Moore’s law

• RAM/SSD hybrid memories at

horizon High end datacenter node

16 cores

10-30TB

128-512GB

1-4TB

10Gbps

0.2-1GB/s

(x10 disks)

1-4GB/s(x4 disks)

40-60GB/s

Page 34: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Support Interactive and Streaming Comp.

• Increase parallelism

• Why?

– Reduce work per node à

improve latency

• Techniques:

– Low latency parallel scheduler

that achieve high locality

– Optimized parallel communication

patterns (e.g., shuffle, broadcast)

– Efficient recovery from failures

and straggler mitigation

result

T

result

Tnew (< T)

Page 35: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

• Trade between result accuracy and

response times

• Why?

– In-memory processing does not

guarantee interactive query

processing

• e.g., ~10’s sec just to scan 512

GB RAM!

• Gap between memory capacity

and transfer rate increasing

• Challenges:

– accurately estimate error and

running time for…

– … arbitrary computations

128-512GB

16 cores

40-60GB/s

doubles every 18 months

doubles every 36 months

Support Interactive and Streaming Comp.

Page 36: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Spark

In-Memory Cluster Computing for

Iterative and Interactive Applications

UC Berkeley

Page 37: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Background

• Commodity clusters have become an important computing

platform for a variety of applications

– In industry: search, machine translation, ad targeting, …

– In research: bioinformatics, NLP, climate simulation, …

• High-level cluster programming models like MapReduce

power many of these apps

• Theme of this work: provide similarly powerful abstractions

for a broader class of applications

Page 38: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Motivation

Current popular programming models for clusters transform data flowing from stable storage to stable storage

e.g., MapReduce:

Map

Map

Map

Reduce

Reduce

Input Output

Page 39: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Motivation• Acyclic data flow is a powerful abstraction, but is

not efficient for applications that repeatedly reuse a

working set of data:

– Iterative algorithms (many in machine learning)

– Interactive data mining tools (R, Excel, Python)

• Spark makes working sets a first-class concept to

efficiently support these apps

Page 40: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Spark Goal

• Provide distributed memory abstractions for clusters to

support apps with working sets

• Retain the attractive properties of MapReduce:

– Fault tolerance (for crashes & stragglers)

– Data locality

– Scalability

Solution: augment data flow model with

“resilient distributed datasets” (RDDs)

Page 41: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Programming Model

• Resilient distributed datasets (RDDs)

– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

– Can be cached across parallel operations

• Parallel operations on RDDs

– Reduce, collect, count, save, …

• Restricted shared variables

– Accumulators, broadcast variables

Page 42: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Example: Log Mining

•Load error messages from a log into memory,

then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worke

r

Worke

r

Worke

r

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base

RDDTransformed

RDD

Cached RDD Parallel

operation

Result: full-text search of

Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Page 43: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

RDDs in More Detail• An RDD is an immutable, partitioned, logical collection of records

– Need not be materialized, but rather contains information to rebuild a dataset from stable storage

• Partitioning can be based on a key in each record (using hash or range partitioning)

• Built using bulk transformations on other RDDs

• Can be cached for future reuse

Page 44: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 45: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

RDD Operations

Transformations

(define a new RDD)

map

filter

sample

union

groupByKey

reduceByKey

join

cache

Parallel operations (Actions)

(return a result to driver)

reduce

collect

count

save

lookupKey

Page 46: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 47: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 48: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 49: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 50: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 51: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

RDD Fault Tolerance

• RDDs maintain lineage information that can be used to

reconstruct lost partitions

• e.g.:cachedMsgs = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

.cache()

HdfsRDDpath: hdfs://…

FilteredRDDfunc:

contains(...)

MappedRDDfunc: split(…)

CachedRDD

Page 52: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Example 1: Logistic Regression

• Goal: find best line separating two sets of points

target

random initial line

Page 53: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Logistic Regression Code

• val data = spark.textFile(...).map(readPoint).cache()

• var w = Vector.random(D)

• for (i <- 1 to ITERATIONS) {

• val gradient = data.map(p =>

• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

• ).reduce(_ + _)

• w -= gradient

• }

• println("Final w: " + w)

Page 54: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s

Page 55: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Example 2: MapReduce

• MapReduce data flow can be expressed using RDD

transformations

res = data.flatMap(rec => myMapFunc(rec)).groupByKey().map((key, vals) => myReduceFunc(key, vals))

Or with combiners:

res = data.flatMap(rec => myMapFunc(rec)).reduceByKey(myCombiner).map((key, val) => myReduceFunc(key, val))

Page 56: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Example 3

Page 57: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 58: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 59: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Other Spark Applications

• Twitter spam classification (Justin Ma)

• EM alg. for traffic prediction (Mobile Millennium)

• K-means clustering

• Alternating Least Squares matrix factorization

• In-memory OLAP aggregation on Hive data

• SQL on Spark (future work)

Page 60: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...
Page 61: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Berkeley Data Analytics Stack

(BDAS)

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across

frameworks

(multi-programming for datacenters)

Efficient data sharing across

frameworks

Data Processing

• in-memory processing

• trade between time, quality, and

cost

Application

New apps: AMP-Genomics, Carat, …

Page 62: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Berkeley AMPLab

§ ‚Launched‛ January 2011: 6 Year Plan

– 8 CS Faculty

– ~40 students

– 3 software engineers

• Organized for collaboration:

lg

orit

hm

sa

chi

ne

s

eo

pl

e

Page 63: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Berkeley AMPLab

• Funding:

– XData, CISE Expedition Grant

– Industrial, founding sponsors

– 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry &

Research:

• Berkeley Data Analytics Stack (BDAS)

• Release as Open Source

Page 64: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Berkeley Data Analytics Stack

(BDAS)• Existing stack components….

HDFS

MPI…

Resource

Mgmnt.

Data

Mgmnt.

Data

Processing

Hadoop

HIVE Pig

HBase Storm

Data Management

Data Processing

Resource Management

Page 65: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Mesos• Management platform that allows multiple framework to share

cluster

• Compatible with existing open analytics stack

• Deployed in production at Twitter on 3,500+ servers

HDFS

MPI…

Resource

Mgmnt.

Data

Mgmnt.

Data

Processing

Hadoop

HIVE Pig

HBase Storm

Mesos

Page 66: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Spark• In-memory framework for interactive and iterative

computations

– Resilient Distributed Dataset (RDD): fault-tolerance, in-

memory storage abstraction

• Scala interface, Java and Python APIs

HDFS

Mesos

MPI

Resource

Mgmnt.

Data

Mgmnt.

Data

ProcessingStorm…

Spark Hadoop

HIVE Pig

Page 67: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Spark Streaming [Alpha Release]

• Large scale streaming computation

• Ensure exactly one semantics

• Integrated with Spark à unifies batch, interactive, and streaming

computations!

HDFS

Mesos

MP

I

Resource

Mgmnt.

Data

Mgmnt.

Data

Processing

Hadoop

HIVE PigStor

m

Spark

Spark

Streamin

g

Page 68: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Shark à Spark SQL

• HIVE over Spark: SQL-like interface (supports Hive 0.9)

– up to 100x faster for in-memory data, and 5-10x for disk

• In tests on hundreds node cluster at

HDFS

Mesos

MP

I

Resource

Mgmnt.

Data

Mgmnt.

Data

ProcessingStor

m

Spark

Spark

Streamin

g Shark

Hadoop

HIVE Pig

Page 69: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Tachyon

• High-throughput, fault-tolerant in-memory storage

• Interface compatible to HDFS

• Support for Spark and Hadoop

HDFS

Mesos

MP

I

Resource

Mgmnt.

Data

Mgmnt.

Data

Processing

Hadoop

HIVE PigStor

m

Spark

Spark

Streamin

g Shark

Tachyon

Page 70: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

BlinkDB

• Large scale approximate query engine

• Allow users to specify error or time bounds

• Preliminary prototype starting being tested at Facebook

Mesos

MP

I

Resource

Mgmnt.

Data

ProcessingStor

m…

Spark

Spark

Streamin

g Shark

BlinkDB

HDFSData

Mgmnt.Tachyon

Hadoop

Pig

HIVE

Page 71: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

SparkGraph

• GraphLab API and Toolkits on top of Spark

• Fault tolerance by leveraging Spark

Mesos

MP

I

Resource

Mgmnt.

Data

ProcessingStor

m…

Spark

Spark

Streamin

g Shark

BlinkDB

HDFSData

Mgmnt.Tachyon

Hadoop

HIVE

PigSpark

Graph

Page 72: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

MLlib

• Declarative approach to ML

• Develop scalable ML algorithms

• Make ML accessible to non-experts

Mesos

MP

I

Resource

Mgmnt.

Data

ProcessingStor

m…

Spark

Spark

Streamin

g Shark

BlinkDB

HDFSData

Mgmnt.Tachyon

Hadoop

HIVE

PigSpark

Graph

MLbas

e

Page 73: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Compatible with Open Source Ecosystem

• Support existing interfaces whenever possible

Mesos

MP

I

Resource

Mgmnt.

Data

ProcessingStor

m…

Spark

Spark

Streamin

g Shark

BlinkDB

HDFSData

Mgmnt.Tachyon

Hadoop

HIVE

PigSpark

Graph

MLbas

e

GraphLab API

Hive Interface

and Shell

HDFS API

Compatibility layer for

Hadoop, Storm, MPI,

etc to run over Mesos

Page 74: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Compatible with Open Source Ecosystem

• Use existing interfaces whenever possible

Mesos

MP

I

Resource

Mgmnt.

Data

ProcessingStor

m…

Spark

Spark

Streamin

g Shark

BlinkDB

HDFSData

Mgmnt.Tachyon

Hadoop

HIVE

PigSpark

Graph

MLbas

e

Support HDFS

API, S3 API, and

Hive metadata

Support Hive API

Accept inputs

from Kafka,

Flume, Twitter,

TCP

Sockets, …

Page 75: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Summary• Support interactive and streaming computations

– In-memory, fault-tolerant storage abstraction, low-latency scheduling,...

• Easy to combine batch, streaming, and interactivecomputations

– Spark execution engine supports all comp. models

• Easy to develop sophisticated algorithms

– Scala interface, APIs for Java, Python, Hive QL, …

– New frameworks targeted to graph based and ML algorithms

• Compatible with existing open source ecosystem

• Open source (Apache/BSD) and fully committed to release high quality software

– Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer)

Batch

Interactive

Streaming

Spark

Page 76: Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href=" ...

Conclusion

• By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics

• RDDs provide:

– Lineage info for fault recovery and debugging

– Adjustable in-memory caching

– Locality-aware parallel operations

• We plan to make Spark the basis of a suite of batch and interactive data analysis tools