Carnegie Mellon GraphLab A New Framework for Parallel Machine Learning Yucheng Low Aapo Kyrola Carlos Guestrin Joseph Gonzalez Danny Bickson Joe Hellerstein.

Carnegie Mellon

GraphLabA New Framework for

Parallel Machine Learning

Yucheng LowAapo Kyrola Carlos Guestrin

Joseph GonzalezDanny BicksonJoe Hellerstein

2

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

0.01

0.1

1

10

Exponential Parallelism

Exponentially

Incre

asing

Sequential P

erform

ance

Constant SequentialPerformance

Pro

cess

or

Sp

eed

GH

z

Exponentially Increasing Parallel Performance

Release Date

13 Million Wikipedia Pages3.6 Billion photos on Flickr

Parallel Programming is Hard

Designing efficient Parallel Algorithms is hard

Race conditions and deadlocksParallel memory bottlenecksArchitecture specific concurrencyDifficult to debug

ML experts repeatedly address the same parallel design challenges

3

Avoid these problems by using high-level abstractions.

Graduate students

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

4

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed



5


12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4




6


12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4


CPU 1 CPU 2

MapReduce – Reduce Phase

7

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Fold/Aggregation

Related Data

8

Interdependent Computation: Not MapReduceable

Parallel Computing and ML

Not all algorithms are efficiently data parallel

9

Data-Parallel Complex Parallel Structure

CrossValidation

Feature Extraction

BeliefPropagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

Sampling

Lasso

Common Properties

10

1) Sparse Data Dependencies

2) Local Computations

3) Iterative Updates

• Sparse Primal SVM• Tensor/Matrix Factorization

• Expectation Maximization• Optimization

• Sampling• Belief Propagation

Operation A

Operation B

Gibbs Sampling

11

X4 X5 X6

X9X8

X3X2X1

X7

1) Sparse Data Dependencies

2) Local Computations

3) Iterative Updates

GraphLab is the SolutionDesigned specifically for ML needs

Express data dependenciesIterative

Simplifies the design of parallel programs:

Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architectures

Implementation here is multi-coreDistributed implementation in progress

12

Carnegie Mellon

GraphLab

A New Framework for Parallel Machine

Learning

GraphLab

14

Data Graph Shared Data Table

Scheduling

Update Functions and Scopes

GraphLabModel

Data Graph

15

A Graph with data associated with every vertex and edge.

:Data

x3: Sample valueC(X3): sample counts

Φ(X6,X9): Binary potential

X1

X2

X3

X5

X6

X7

X8

X9

X10

X4

X11

Update Functions

16

Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex

Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex

Update Function Schedule

17

e f g

kjih

dcbaCPU 1

CPU 2

a

h

a

i

b

d

Update Function Schedule

18

e f g

kjih

dcbaCPU 1

CPU 2

a

i

b

d

Static ScheduleScheduler determines the

order of Update Function Evaluations

19

Synchronous Schedule: Every vertex updated simultaneously

Round Robin Schedule: Every vertex updated sequentially

Converged Slowly ConvergingFocus Effort

Need for Dynamic Scheduling

20

Dynamic Schedule

21

e f g

kjih

dcbaCPU 1

CPU 2

a

h

a

b

b

i

Dynamic ScheduleUpdate Functions can insert new tasks into

the schedule

22

FIFO Queue Wildfire BP [Selvatici et al.]

Priority Queue Residual BP [Elidan et al.]

Splash Schedule Splash BP [Gonzalez et al.]

Obtain different algorithms simply by changing a flag!

--scheduler=fifo --scheduler=priority --scheduler=splash

Global Information

What if we need global information?

23

Sum of all the vertices?

Algorithm Parameters?

Sufficient Statistics?

Shared Data Table (SDT)Global constant parameters

24

Constant:Total # Samples

Constant: Temperature

Accumulate Function:

Sync OperationSync is a fold/reduce operation over the graph

25

Sync!

1 3 2

1211

3251

0

Apply Function:

AddDivide by |

V|9222

Accumulate performs an aggregation over verticesApply makes a final modification to the accumulated dataExample: Compute the average of all the vertices

Shared Data Table (SDT)Global constant parametersGlobal computation (Sync Operation)

26

Constant:Total # Samples

Sync: SampleStatistics

Sync: LoglikelihoodConstant: Temperature

Carnegie Mellon

Safetyand

Consistency

27

Write-Write Race

28

Write-Write Race If adjacent update functions write simultaneously

Left update writes: Right update writes:Final Value

Race Conditions + Deadlocks

Just one of the many possible racesRace-free code is extremely difficult to write

29

GraphLab design ensures race-free operation

Scope Rules

30

Guaranteed safety for all update functions

Full Consistency

31

Only allow update functions two vertices apart to be run in parallelReduced opportunities for parallelism

Obtaining More Parallelism

32

Not all update functions will modify the entire scope!

Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices

Edge Consistency

33

Obtaining More Parallelism

34

“Map” operations. Feature extraction on vertex data

Vertex Consistency

35

Sequential ConsistencyGraphLab guarantees sequential

consistency

36

For every parallel execution, there exists a sequential execution of update functions which will produce the same result.

CPU 1

CPU 2

CPU 1

Parallel

Sequential

time

GraphLab

37

GraphLabModel

Data Graph Shared Data Table

Scheduling

Update Functions and

Scopes

Carnegie Mellon

Experiments

38

ExperimentsShared Memory Implemention in C++ using PthreadsTested on a 16 processor machine

4x Quad Core AMD Opteron 838464 GB RAM

Belief Propagation +Parameter Learning

Gibbs SamplingCoEMLasso

39

Compressed SensingSVMPageRankTensor Factorization

Graphical Model Learning

40

3D retinal image denoising

Data Graph: 256x64x64 vertices

Update Function

Belief PropagationSync

Acc: Compute inference statisticsApply:Take a gradient step

Sync: Edge-potential


41

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Optimal

Bett

er

Approx. Priority Schedule

Splash Schedule

15.5x speedup on 16 cpus


42

Inference

Gradient Step

Standard parameter learning takes gradient only after inference is compute

Parallel Inference + Gradient Step

With GraphLab:Take gradient step while inference is running

Ru

nti

me

3x faster!

2100 sec

700 sec

Iterated Simultaneous

Gibbs SamplingTwo methods for sequentially consistency:

43

ScopesEdge Scope

graphlab(gibbs, edge, sweep);

SchedulingGraph Coloring

CPU

1

CPU

2

CPU

3

t0

t1

t2

t3

graphlab(gibbs, vertex, colored);

Gibbs SamplingProtein-protein interaction networks [Elidan et al. 2006]

Pair-wise MRF14K Vertices100K Edges

10x SpeedupScheduling reduceslocking overhead

44

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Optimal

Bett

er

Round robin schedule

Colored Schedule

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

Vertices

Edges

Small 0.2M 20M

Large 2M 200M

the dog

Australia

Catalina Island

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Dog” an animal?Is “Catalina” a place?

CoEM (Rosie Jones, 2005)

4646

Optimal

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Bett

er

Small

Large

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop 95 Cores 7.5 hrs

Lasso

47

L1 regularized Linear Regression

Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed

Lasso

48



Lasso

49



Finance Dataset from Kogan et al [2009].

Full Consistency

50

Optimal

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Bett

er

Dense

Sparse

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Relaxing Consistency

51Why does this work? (Open Question)

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Bett

er Optimal

Dense

Sparse

GraphLabAn abstraction tailored to Machine Learning Provides a parallel framework which compactly expresses

Data/computational dependenciesIterative computation

Achieves state-of-the-art parallel performance on a variety of problemsEasy to use

52

Future WorkDistributed GraphLab

Load BalancingMinimize CommunicationLatency HidingDistributed Data ConsistencyDistributed Scalability

GPU GraphLabMemory bus bottle neckWarp alignment

State-of-the-art performance for <Your Algorithm Here> .

53

Carnegie Mellon

Parallel GraphLab 1.0

Available Today

http://graphlab.ml.cmu.edu

54

Documentation… Code… Tutorials…

Carnegie Mellon GraphLab A New Framework for Parallel Machine Learning Yucheng Low Aapo Kyrola Carlos Guestrin Joseph Gonzalez Danny Bickson Joe Hellerstein.

Documents