Top Banner
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011
27

A Berkeley View of Big Data

Feb 25, 2016

Download

Documents

cirila

A Berkeley View of Big Data. Ion Stoica UC Berkeley BEARS February 17, 2011. Big Data is Massive…. Facebook : 130TB/day: user logs 200-400TB/day: 83 million pictures Google: > 25 PB/day processed data Data generated by LHC: 1 PB/sec - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Berkeley View of Big Data

A Berkeley View of Big Data

Ion StoicaUC Berkeley

BEARSFebruary 17, 2011

Page 2: A Berkeley View of Big Data

Big Data is Massive…• Facebook:

– 130TB/day: user logs– 200-400TB/day: 83 million pictures

• Google: > 25 PB/day processed data

• Data generated by LHC: 1 PB/sec

• Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year– ~60% increase every year

2

Page 3: A Berkeley View of Big Data

…and Grows Bigger and Bigger!• More and more devices

• More and more people

• Cheaper and cheaper storage– ~50% increase in GB/$ every year

3

Page 4: A Berkeley View of Big Data

…and Grows Bigger and Bigger!• Log everything!

– Don’t always know what question you’ll need to answer

• Hard to decide what to delete

– Thankless decision: people know only when you are wrong! – “Climate Research Unit (CRU) scientists admit they threw

away key data used in global warming calculations”

• Stored data grows faster than GB/$4

Page 5: A Berkeley View of Big Data

What is Big Data?

• You don’t need to be big to have big data problem!– Inadequate tools to analyze data– Data management may dominate infrastructure cost

5

Data that is expensive to manage, and hard to extract value from

Page 6: A Berkeley View of Big Data

Big Data is not Cheap!• Storing and managing 1PB

data: $500K-$1M/ year– Facebook: 200 PB/year

6

• “Typical” cloud-based service startup (e.g., Conviva)– Log storage dominates

infrastructure cost2007 2008 2009 2010

0%10%20%30%40%50%60%70%80%90%

100%

Storage cluster Other

Infra

stru

ctur

e co

st

~1PB storage capacity

Page 7: A Berkeley View of Big Data

Hard to Extract Value from Data!• Data is

– Diverse, variety of sources– Uncurated, no schema, inconsistent semantics, syntax– Integration a huge challenge

• No easy way to get answers that are– High-quality– Timely

• Challenge: maximize value from data by getting best possible answers7

Page 8: A Berkeley View of Big Data

Requires Multifaceted Approach• Three dimensions to improve data

analysis– Improving scale, efficiency, and quality of

algorithms (Algorithms)– Scaling up datacenters (Machines)– Leverage human activity and intelligence

(People)

• Need to adaptively and flexibly combine all three dimensions

8

Page 9: A Berkeley View of Big Data

Algorithms, Machines, People• Today’s apps: fixed point in solution space

9

Algorithms

Machines

People

Need techniques to dynamically pick best operating point

search

Watson/IBM

Page 10: A Berkeley View of Big Data

The AMP Lab

10

search

Watson/IBM

Machines

People

Algorithms

Make sense of data at scale by tightly integrating algorithms, machines, and people

Page 11: A Berkeley View of Big Data

AMP Faculty and Sponsors• Faculty

– Alex Bayen (mobile sensing platforms)– Armando Fox (systems)– Michael Franklin (databases): Director– Michael Jordan (machine learning): Co-director– Anthony Joseph (security & privacy)– Randy Katz (systems)– David Patterson (systems)– Ion Stoica (systems): Co-director– Scott Shenker (networking)

• Sponsors:

11

Page 12: A Berkeley View of Big Data

Algorithms• State-of-art Machine Learning (ML)

algorithms do not scale– Prohibitive to process all data points

12

How do you know when to stop?

true answer

Est

imat

e

# of data points

Page 13: A Berkeley View of Big Data

Algorithms• Given any problem, data and a budget

– Immediate results with continuous improvement– Calibrate answer: provide error bars

13

Error bars on every answer!

Est

imat

e

# of data points

true answer

Page 14: A Berkeley View of Big Data

Algorithms• Given any problem, data and a time budget

– Immediate results with continuous improvement– Calibrate answer: provide error bars

14

Stop when error smaller than a given threshold

Est

imat

e

# of data pointstime

true answer

Page 15: A Berkeley View of Big Data

Algorithms• Given any problem, data and a time budget

– Automatically pick the best algorithm

15

Est

imat

e

time

pick sophisticated pick simple

error too high

true answersophisticated

simple

Page 16: A Berkeley View of Big Data

Machines• “The datacenter as a computer” still in its

infancy– Special purpose clusters, e.g., Hadoop cluster– Highly variable performance– Hard to program– Hard to debug

16

=?

Page 17: A Berkeley View of Big Data

Machines• Make datacenter a real computer!

17

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Datacenter “OS” (e.g., Mesos)

• Share datacenter between multiple cluster computing apps

• Provide new abstractions and servicesAMP stack

Existingstack

Page 18: A Berkeley View of Big Data

Machines• Make datacenter a real computer!

18

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Datacenter “OS” (e.g., Mesos)

Hado

op

MPI

Hype

rtbal

e

…Ca

ssan

draHive

Support existing cluster computing apps

AMP stack

Existingstack

Page 19: A Berkeley View of Big Data

Machines• Make datacenter a real computer!

19

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Spar

kSCADS

Datacenter “OS” (e.g., Mesos)

Hado

op

MPI

Hype

rtbal

e

…Ca

ssan

draHive PIQL

Support interactive and iterative data analysis (e.g., ML algorithms)

Consistency adjustable data store

Predictive & insightful query language

AMP stack

Existingstack

Page 20: A Berkeley View of Big Data

Machines• Make datacenter a real computer!

20

Node OS(e.g. Linux)

Node OS(e.g. Windows)

Node OS(e.g. Linux)

Spar

kSCADS

Datacenter “OS” (e.g., Mesos)

Applications, tools

Hado

op

MPI

Hype

rtbal

e

…Ca

ssan

draHive PIQL• Advanced ML algorithms

• Interactive data mining• Collaborative

visualizationAMP stack

Existingstack

Page 21: A Berkeley View of Big Data

People• Humans can make sense of messy data!

21

Page 22: A Berkeley View of Big Data

People• Make people an integrated part of

the system!– Leverage human activity– Leverage human intelligence

(crowdsourcing):• Curate and clean dirty data• Answer imprecise questions• Test and improve algorithms

• Challenge– Inconsistent answer quality in all

dimensions (e.g., type of question, time, cost) 22

Machines + Algorithms

data

, ac

tivity

Que

stio

ns Answ

ers

Page 23: A Berkeley View of Big Data

Real Applications• Mobile Millennium Project

– Alex Bayen, Civil and Environment Engineering, UC Berkeley

• Microsimulation of urban development– Paul Waddell, College of

Environment Design, UC Berkeley• Crowd based opinion formation

– Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley

• Personalized Sequencing– Taylor Sittler, UCSF

23

Page 24: A Berkeley View of Big Data

Personalized Sequencing

24

Page 25: A Berkeley View of Big Data

Sequencing

Microsimulation Mobile Millennium

The AMP Lab

25

Machines

People

Algorithms

Make sense of data at scale by tightly integrating algorithms, machines, and people

Page 26: A Berkeley View of Big Data

Big Data in 2020

Almost Certainly:• Create a new

generation of big data scientist

• A real datacenter OS • ML becoming an

engineering discipline• People deeply

integrated in big data analysis pipeline

If We’re Lucky:• System will know

what to throw away• Generate new

knowledge that an individual person cannot

Page 27: A Berkeley View of Big Data

Summary• Goal: Tame Big Data Problem

– Get results with right quality at the right time • Approach: Holistically integrate

Algorithms, Machines, and People • Huge research issues across many

domains

2727