Top Banner
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011
27

A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

A Berkeley View of Big Data

Ion StoicaUC Berkeley

BEARSFebruary 17, 2011

Page 2: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Big Data is Massive…• Facebook:

– 130TB/day: user logs– 200-400TB/day: 83 million pictures

• Google: > 25 PB/day processed data

• Data generated by LHC: 1 PB/sec

• Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year– ~60% increase every year

2

Page 3: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

…and Grows Bigger and Bigger!• More and more devices

• More and more people

• Cheaper and cheaper storage– ~50% increase in GB/$ every year

3

Page 4: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

…and Grows Bigger and Bigger!

• Log everything! – Don’t always know what question you’ll need to answer

• Hard to decide what to delete

– Thankless decision: people know only when you are wrong!

– “Climate Research Unit (CRU) scientists admit they threw away key data used in global warming calculations”

• Stored data grows faster than GB/$4

Page 5: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

What is Big Data?

• You don’t need to be big to have big data problem!– Inadequate tools to analyze data– Data management may dominate infrastructure cost

5

Data that is expensive to manage, and hard to extract value from

Page 6: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Big Data is not Cheap!

• Storing and managing 1PB data: $500K-$1M/ year– Facebook: 200 PB/year

6

• “Typical” cloud-based service startup (e.g., Conviva)– Log storage dominates

infrastructure cost2007 2008 2009 2010

0%10%20%30%40%50%60%70%80%90%

100%

Storage cluster Other

Infr

ast

ruct

ure

co

st

~1PB storage capacity

Page 7: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Hard to Extract Value from Data!

• Data is – Diverse, variety of sources– Uncurated, no schema, inconsistent semantics, syntax– Integration a huge challenge

• No easy way to get answers that are– High-quality– Timely

• Challenge: maximize value from data by getting best possible answers7

Page 8: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Requires Multifaceted Approach

• Three dimensions to improve data analysis– Improving scale, efficiency, and quality of

algorithms (Algorithms)– Scaling up datacenters (Machines)– Leverage human activity and intelligence

(People)

• Need to adaptively and flexibly combine all three dimensions

8

Page 9: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Algorithms, Machines, People• Today’s apps: fixed point in solution space

9

Algorithms

Machines

People

Need techniques to dynamically pick best operating point

search

Watson/IBM

Page 10: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

The AMP Lab

10

search

Watson/IBM

Machines

People

Algorithms

Make sense of data at scale by tightly integrating algorithms, machines, and people

Page 11: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

AMP Faculty and Sponsors• Faculty

– Alex Bayen (mobile sensing platforms)– Armando Fox (systems)– Michael Franklin (databases): Director– Michael Jordan (machine learning): Co-director– Anthony Joseph (security & privacy)– Randy Katz (systems)– David Patterson (systems)– Ion Stoica (systems): Co-director– Scott Shenker (networking)

• Sponsors:

11

Page 12: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Algorithms• State-of-art Machine Learning (ML)

algorithms do not scale– Prohibitive to process all data points

12

How do you know when to stop?

true answer

Est

imat

e

# of data points

Page 13: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Algorithms• Given any problem, data and a budget

– Immediate results with continuous improvement– Calibrate answer: provide error bars

13

Error bars on every answer!

Est

imat

e

# of data points

true answer

Page 14: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Algorithms• Given any problem, data and a time budget

– Immediate results with continuous improvement– Calibrate answer: provide error bars

14

Stop when error smaller than a given threshold

Est

imat

e

# of data pointstime

true answer

Page 15: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Algorithms• Given any problem, data and a time budget

– Automatically pick the best algorithm

15

Est

imat

e

time

pick sophisticated pick simple

error too high

true answersophisticated

simple

Page 16: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Machines• “The datacenter as a computer” still in its

infancy– Special purpose clusters, e.g., Hadoop cluster– Highly variable performance– Hard to program– Hard to debug

16

=?

Page 17: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Machines

• Make datacenter a real computer!

17

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Datacenter “OS” (e.g., Mesos)

• Share datacenter between multiple cluster computing apps

• Provide new abstractions and servicesAMP stack

Existingstack

Page 18: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Machines

• Make datacenter a real computer!

18

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Datacenter “OS” (e.g., Mesos)

Hado

op

MPI

Hype

rtbal

e

Cass

andr

aHiveSupport existing cluster computing apps

AMP stack

Existingstack

Page 19: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Machines

• Make datacenter a real computer!

19

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Spar

kSCADS

Datacenter “OS” (e.g., Mesos)

Hado

op

MPI

Hype

rtbal

e

Cass

andr

aHive PIQL

Support interactive and iterative data analysis (e.g., ML algorithms)

Consistency adjustable data store

Predictive & insightful query language

AMP stack

Existingstack

Page 20: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Machines

• Make datacenter a real computer!

20

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Spar

kSCADS

Datacenter “OS” (e.g., Mesos)

Applications, tools

Hado

op

MPI

Hype

rtbal

e

Cass

andr

aHive PIQL• Advanced ML algorithms• Interactive data mining• Collaborative

visualizationAMP stack

Existingstack

Page 21: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

People• Humans can make sense of messy data!

21

Page 22: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

People• Make people an integrated part of

the system!– Leverage human activity– Leverage human intelligence

(crowdsourcing):• Curate and clean dirty data• Answer imprecise questions• Test and improve algorithms

• Challenge– Inconsistent answer quality in all

dimensions (e.g., type of question, time, cost)

22

Machines + Algorithms

data

, ac

tivity

Que

stio

ns Answ

ers

Page 23: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Real Applications• Mobile Millennium Project

– Alex Bayen, Civil and Environment Engineering, UC Berkeley

• Microsimulation of urban development– Paul Waddell, College of

Environment Design, UC Berkeley

• Crowd based opinion formation– Ken Goldberg, Industrial

Engineering and Operations Research, UC Berkeley

• Personalized Sequencing– Taylor Sittler, UCSF

23

Page 24: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Personalized Sequencing

24

Page 25: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Sequencing

Microsimulation Mobile Millennium

The AMP Lab

25

Machines

People

Algorithms

Make sense of data at scale by tightly integrating algorithms, machines, and people

Page 26: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Big Data in 2020

Almost Certainly:• Create a new

generation of big data scientist

• A real datacenter OS • ML becoming an

engineering discipline• People deeply

integrated in big data analysis pipeline

If We’re Lucky:• System will know

what to throw away• Generate new

knowledge that an individual person cannot

Page 27: A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Summary

• Goal: Tame Big Data Problem– Get results with right quality at the right time

• Approach: Holistically integrate Algorithms, Machines, and People

• Huge research issues across many domains

2727