Yucheng Low

Carnegie Mellon

Yucheng Low

AapoKyrola

DannyBickson

A Framework for Machine Learning and Data Mining in the Cloud

JosephGonzalez

CarlosGuestrin

JoeHellerstein

Big Data is Everywhere

72 Hours a MinuteYouTube28 Million

Wikipedia Pages

900 MillionFacebook Users

6 Billion Flickr Photos

2

“… data a new class of economic asset, like currency or gold.”

“…growing at 50 percent a year…”

How will wedesign and implement Big learning systems?

Big Learning

3

Shift Towards Use Of Parallelism in ML

GPUs Multicore Clusters Clouds Supercomputers

ML experts repeatedly solve the same parallel design challenges:

Race conditions, distributed state, communication…

Resulting code is very specialized:difficult to maintain, extend, debug…

Graduate students

Avoid these problems by using high-level abstractions

4

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

5



6

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed



7

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

Embarrassingly Parallel independent computation No Communication needed



8

12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4

Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2

MapReduce – Reduce Phase

9

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Image Features

Attractive Face Statistics

Ugly Face Statistics

U A A U U U A A U A U A

Attractive Faces Ugly Faces

MapReduce for Data-Parallel ML

Excellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

MapReduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

Is there more toMachine Learning

?10

Carnegie Mellon

Exploit Dependencies

12

HockeyScuba Diving

Underwater Hockey

Scuba Diving

Scuba Diving

Scuba Diving

Hockey

Hockey

Hockey

Hockey

Graphs are Everywhere

Use

rs

Movies

Netflix


Docs

Words

Wiki

Text Analysis

Social Network

Probabilistic Analysis

13

Properties of Computation on Graphs

DependencyGraph

IterativeComputation

My Interests

Friends Interests

LocalUpdates

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting


Tensor Factorization

15

Bayesian Tensor Factorization

Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief Propagation

PageRank

CoEM

SVD

LDA

…Many others…Linear Solvers

Splash SamplerAlternating Least

Squares

21

2010Shared Memory

22

Limited CPU PowerLimited MemoryLimited Scalability

Distributed Cloud

- Distributing State- Data Consistency- Fault Tolerance

23

Unlimited amount of computation resources!(up to funding limitations)

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

24

Data GraphData associated with vertices and edges

Vertex Data:• User profile• Current interests estimates

Edge Data:• Relationship (friend, classmate, relative)

Graph:• Social Network

25

Distributed Graph

Partition the graph across multiple machines.

26

Distributed Graph

Ghost vertices maintain adjacency structure and replicate remote data.

“ghost” vertices

27

Distributed Graph

Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …)

“ghost” vertices

28


Consistency Model



29

Pagerank(scope){ // Update the current vertex data

// Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }

Update FunctionsUser-defined program: applied to avertex and transforms data in scope of vertex

Dynamic computation

Update function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation

30

Why Dynamic?

Asynchronous Belief Propagation

Cumulative Vertex Updates

ManyUpdates

FewUpdates

Algorithm identifies and focuses on hidden sequential structure

Graphical Model

Challenge = Boundaries

32

33

Shared Memory Dynamic Schedule

e f g

kjih

dcbaCPU 1

CPU 2

ahab

b

i

Process repeats until scheduler is empty

Scheduler

Distributed Scheduling

e

ih

ba

f g

kj

dcah

fg

j

cb

i

Each machine maintains a schedule over the vertices it owns.

34Distributed Consensus used to identify completion

Ensuring Race-Free CodeHow much can computation overlap?

35


Consistency Model



37

PageRank Revisited

38

Pagerank(scope) {

…}

Racing PageRankThis was actually encountered in user code.

39

Bugs

40

Pagerank(scope) {

…}

Bugs

41

Pagerank(scope) {

…}

tmp

tmp

Throughput != Performance

No Consistency

Higher Throughput

(#updates/sec)

Potentially Slower Convergence of ML

42

Racing Collaborative Filtering

43

0 200 400 600 800 1000 1200 1400 16000.01

0.1

1

d=20 Racing d=20

Time

RMSE

Serializability

44

For every parallel execution, there exists a sequential execution of update functions which produces the same result.

CPU 1

CPU 2

SingleCPU

Parallel

Sequential

time

Serializability Example

45

Read

Write

Update functions one vertex apart can be run in parallel.

Edge Consistency

Overlapping regions are only read.

Stronger / Weaker consistency levels available

User-tunable consistency levelstrades off parallelism & consistency

Solution 1Graph Coloring

Distributed Consistency

Solution 2Distributed Locking

Edge Consistency via Graph Coloring

Vertices of the same color are all at least one vertex apart.Therefore, All vertices of the same color can be run in parallel!

48

Chromatic Distributed EngineTi

me

Execute tasks on all vertices of

color 0Execute tasks

on all vertices of color 0

Ghost Synchronization Completion + Barrier


color 1


color 1

Ghost Synchronization Completion + Barrier

49

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

50

Users Movies

51

Netflix Collaborative Filtering

Ideal

D=100

D=20

# machines

HadoopMPI

GraphLab

# machines

The Cost of Hadoop

52

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the cat

Australia

Istanbul

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Cat” an animal?Is “Istanbul” a place?

Vertices: 2 MillionEdges: 200 Million

53

Distributed GL 32 EC2 Nodes 80 secs

GraphLab 16 Cores 30 min

0.3% of Hadoop time

ProblemsRequire a graph coloring to be available.

Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round.

55

Solution 1Graph Coloring

Distributed Consistency

Solution 2Distributed Locking

Distributed LockingEdge Consistency can be guaranteed through locking.

: RW Lock

57

Consistency Through LockingAcquire write-lock on center vertex, read-lock on adjacent.

58

SolutionPipelining

CPU Machine 1

Machine 2

A CB D

Consistency Through LockingMulticore Setting

PThread RW-Locks

Distributed SettingDistributed LocksChallenges

Latency

A C

B D

A CB DA

59

No Pipelining

lock scope 1

Process request 1

scope 1 acquiredupdate_function 1

release scope 1

Process release 1

Tim

e

60

Pipelining / Latency HidingHide latency using pipelining

lock scope 1

Process request 1

scope 1 acquired

update_function 1release scope 1

Process release 1

lock scope 2

Tim

e lock scope 3 Process request 2Process request 3

scope 2 acquiredscope 3 acquired


61

Latency HidingHide latency using request buffering

lock scope 1

Process request 1

scope 1 acquired


Process release 1

lock scope 2

Tim

e lock scope 3 Process request 2Process request 3

scope 2 acquiredscope 3 acquired


No Pipelining 472 s

Pipelining 10 s

47x Speedup

Residual BP on 190K Vertex 560K Edge Graph4 Machines

62

63

Video Cosegmentation

Segments mean the same

1740 FramesModel: 10.5 million nodes, 31 million edges

Probabilistic Inference Task

64

Video Coseg. Speedups

GraphLab

Ideal

# machines


Consistency Model



65

What if machines fail? How do we provide fault tolerance?

Checkpoint1: Stop the world

2: Write state to disk

68

Snapshot Performance

No Snapshot

Snapshot

One slow machine

Because we have to stop the world, One slow machine slows everything down!

Snapshot time

Slow machine

How can we do better?

Take advantage of consistency

70

Checkpointing1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems.

snapshottedNot snapshotted

71

CheckpointingFine Grained Chandy-Lamport.

Easily implemented within GraphLab as an Update Function!

72

Async. Snapshot Performance

No Snapshot

Snapshot

One slow machine

No penalty incurred by the slow machine!

SummaryExtended GraphLab abstraction to distributed systemsTwo different methods of achieving consistency

Graph ColoringDistributed Locking with pipelining

Efficient implementationsAsynchronous Fault Tolerance with fined-grained Chandy-Lamport

Performance

Useability

Efficiency Scalabilitys

73

Carnegie Mellon University

Release 2.1 + many toolkitshttp://graphlab.org

PageRank: 40x faster than Hadoop

1B edges per second.

Triangle Counting: 282x faster than Hadoop

400M triangles per second.

Major Improvements to be published in OSDI 2012

Yucheng Low

Documents

Yucheng Low