Carnegie Mellon
Yucheng Low
AapoKyrola
DannyBickson
A Framework for Machine Learning and Data Mining in the Cloud
JosephGonzalez
CarlosGuestrin
JoeHellerstein
Big Data is Everywhere
72 Hours a MinuteYouTube28 Million
Wikipedia Pages
900 MillionFacebook Users
6 Billion Flickr Photos
2
“… data a new class of economic asset, like currency or gold.”
“…growing at 50 percent a year…”
How will wedesign and implement Big learning systems?
Big Learning
3
Shift Towards Use Of Parallelism in ML
GPUs Multicore Clusters Clouds Supercomputers
ML experts repeatedly solve the same parallel design challenges:
Race conditions, distributed state, communication…
Resulting code is very specialized:difficult to maintain, extend, debug…
Graduate students
Avoid these problems by using high-level abstractions
4
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
5
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
6
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
7
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
Embarrassingly Parallel independent computation No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
8
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
Embarrassingly Parallel independent computation No Communication needed
CPU 1 CPU 2
MapReduce – Reduce Phase
9
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Image Features
Attractive Face Statistics
Ugly Face Statistics
U A A U U U A A U A U A
Attractive Faces Ugly Faces
MapReduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
MapReduce
Computing SufficientStatistics
Graphical ModelsGibbs Sampling
Belief PropagationVariational Opt.
Semi-Supervised Learning
Label PropagationCoEM
Graph AnalysisPageRank
Triangle Counting
Collaborative Filtering
Tensor Factorization
Is there more toMachine Learning
?10
Carnegie Mellon
Exploit Dependencies
12
HockeyScuba Diving
Underwater Hockey
Scuba Diving
Scuba Diving
Scuba Diving
Hockey
Hockey
Hockey
Hockey
Graphs are Everywhere
Use
rs
Movies
Netflix
Collaborative Filtering
Docs
Words
Wiki
Text Analysis
Social Network
Probabilistic Analysis
13
Properties of Computation on Graphs
DependencyGraph
IterativeComputation
My Interests
Friends Interests
LocalUpdates
ML Tasks Beyond Data-Parallelism
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Graphical ModelsGibbs Sampling
Belief PropagationVariational Opt.
Semi-Supervised Learning
Label PropagationCoEM
Graph AnalysisPageRank
Triangle Counting
Collaborative Filtering
Tensor Factorization
15
Bayesian Tensor Factorization
Gibbs Sampling
MatrixFactorization
Lasso
SVM
Belief Propagation
PageRank
CoEM
SVD
LDA
…Many others…Linear Solvers
Splash SamplerAlternating Least
Squares
21
2010Shared Memory
22
Limited CPU PowerLimited MemoryLimited Scalability
Distributed Cloud
- Distributing State- Data Consistency- Fault Tolerance
23
Unlimited amount of computation resources!(up to funding limitations)
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
24
Data GraphData associated with vertices and edges
Vertex Data:• User profile• Current interests estimates
Edge Data:• Relationship (friend, classmate, relative)
Graph:• Social Network
25
Distributed Graph
Partition the graph across multiple machines.
26
Distributed Graph
Ghost vertices maintain adjacency structure and replicate remote data.
“ghost” vertices
27
Distributed Graph
Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …)
“ghost” vertices
28
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
29
Pagerank(scope){ // Update the current vertex data
// Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }
Update FunctionsUser-defined program: applied to avertex and transforms data in scope of vertex
Dynamic computation
Update function applied (asynchronously) in parallel until convergence
Many schedulers available to prioritize computation
30
Why Dynamic?
Asynchronous Belief Propagation
Cumulative Vertex Updates
ManyUpdates
FewUpdates
Algorithm identifies and focuses on hidden sequential structure
Graphical Model
Challenge = Boundaries
32
33
Shared Memory Dynamic Schedule
e f g
kjih
dcbaCPU 1
CPU 2
ahab
b
i
Process repeats until scheduler is empty
Scheduler
Distributed Scheduling
e
ih
ba
f g
kj
dcah
fg
j
cb
i
Each machine maintains a schedule over the vertices it owns.
34Distributed Consensus used to identify completion
Ensuring Race-Free CodeHow much can computation overlap?
35
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
37
PageRank Revisited
38
Pagerank(scope) {
…}
Racing PageRankThis was actually encountered in user code.
39
Bugs
40
Pagerank(scope) {
…}
Bugs
41
Pagerank(scope) {
…}
tmp
tmp
Throughput != Performance
No Consistency
Higher Throughput
(#updates/sec)
Potentially Slower Convergence of ML
42
Racing Collaborative Filtering
43
0 200 400 600 800 1000 1200 1400 16000.01
0.1
1
d=20 Racing d=20
Time
RMSE
Serializability
44
For every parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
Serializability Example
45
Read
Write
Update functions one vertex apart can be run in parallel.
Edge Consistency
Overlapping regions are only read.
Stronger / Weaker consistency levels available
User-tunable consistency levelstrades off parallelism & consistency
Solution 1Graph Coloring
Distributed Consistency
Solution 2Distributed Locking
Edge Consistency via Graph Coloring
Vertices of the same color are all at least one vertex apart.Therefore, All vertices of the same color can be run in parallel!
48
Chromatic Distributed EngineTi
me
Execute tasks on all vertices of
color 0Execute tasks
on all vertices of color 0
Ghost Synchronization Completion + Barrier
Execute tasks on all vertices of
color 1
Execute tasks on all vertices of
color 1
Ghost Synchronization Completion + Barrier
49
Matrix FactorizationNetflix Collaborative Filtering
Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
50
Users Movies
51
Netflix Collaborative Filtering
Ideal
D=100
D=20
# machines
HadoopMPI
GraphLab
# machines
The Cost of Hadoop
52
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
the cat
Australia
Istanbul
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Cat” an animal?Is “Istanbul” a place?
Vertices: 2 MillionEdges: 200 Million
53
Distributed GL 32 EC2 Nodes 80 secs
GraphLab 16 Cores 30 min
0.3% of Hadoop time
ProblemsRequire a graph coloring to be available.
Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round.
55
Solution 1Graph Coloring
Distributed Consistency
Solution 2Distributed Locking
Distributed LockingEdge Consistency can be guaranteed through locking.
: RW Lock
57
Consistency Through LockingAcquire write-lock on center vertex, read-lock on adjacent.
58
SolutionPipelining
CPU Machine 1
Machine 2
A CB D
Consistency Through LockingMulticore Setting
PThread RW-Locks
Distributed SettingDistributed LocksChallenges
Latency
A C
B D
A CB DA
59
No Pipelining
lock scope 1
Process request 1
scope 1 acquiredupdate_function 1
release scope 1
Process release 1
Tim
e
60
Pipelining / Latency HidingHide latency using pipelining
lock scope 1
Process request 1
scope 1 acquired
update_function 1release scope 1
Process release 1
lock scope 2
Tim
e lock scope 3 Process request 2Process request 3
scope 2 acquiredscope 3 acquired
update_function 2release scope 2
61
Latency HidingHide latency using request buffering
lock scope 1
Process request 1
scope 1 acquired
update_function 1release scope 1
Process release 1
lock scope 2
Tim
e lock scope 3 Process request 2Process request 3
scope 2 acquiredscope 3 acquired
update_function 2release scope 2
No Pipelining 472 s
Pipelining 10 s
47x Speedup
Residual BP on 190K Vertex 560K Edge Graph4 Machines
62
63
Video Cosegmentation
Segments mean the same
1740 FramesModel: 10.5 million nodes, 31 million edges
Probabilistic Inference Task
64
Video Coseg. Speedups
GraphLab
Ideal
# machines
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
65
What if machines fail? How do we provide fault tolerance?
Checkpoint1: Stop the world
2: Write state to disk
68
Snapshot Performance
No Snapshot
Snapshot
One slow machine
Because we have to stop the world, One slow machine slows everything down!
Snapshot time
Slow machine
How can we do better?
Take advantage of consistency
70
Checkpointing1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems.
snapshottedNot snapshotted
71
CheckpointingFine Grained Chandy-Lamport.
Easily implemented within GraphLab as an Update Function!
72
Async. Snapshot Performance
No Snapshot
Snapshot
One slow machine
No penalty incurred by the slow machine!
SummaryExtended GraphLab abstraction to distributed systemsTwo different methods of achieving consistency
Graph ColoringDistributed Locking with pipelining
Efficient implementationsAsynchronous Fault Tolerance with fined-grained Chandy-Lamport
Performance
Useability
Efficiency Scalabilitys
73
Carnegie Mellon University
Release 2.1 + many toolkitshttp://graphlab.org
PageRank: 40x faster than Hadoop
1B edges per second.
Triangle Counting: 282x faster than Hadoop
400M triangles per second.
Major Improvements to be published in OSDI 2012