Carnegie Mellon Distributed Parallel Inference on Large Factor Graphs Joseph E. Gonzalez Yucheng Low Carlos Guestrin David O’Hallaron
Nov 28, 2014
Carnegie Mellon
Distributed Parallel Inference on Large Factor
Graphs
Joseph E. GonzalezYucheng Low
Carlos GuestrinDavid O’Hallaron
2
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
0.01
0.1
1
10
Exponential Parallelism
Exponentially
Incre
asing
Sequential P
erform
ance
Constant SequentialPerformance
Pro
cess
or
Sp
eed
GH
z
Exponentially Increasing Parallel Performance
Release Date
3
Distributed Parallel Setting
Opportunities:Access to larger systems: 8 CPUs 1000 CPUsLinear Increase:
RAM, Cache Capacity, and Memory Bandwidth
Challenges:Distributed state, Communication and Load Balancing
Fast Reliable Network
Node
CPU BusMemory
Cache
Node
CPU BusMemory
Cache
4
Graphical Models and Parallelism
Graphical models provide a common language for general purpose parallel algorithms in machine learning
A parallel inference algorithm would improve:
Protein Structure Prediction
Movie Recommendation
Computer Vision
Inference is a key step in Learning Graphical Models
5
Belief Propagation (BP)Message passing algorithm
Naturally Parallel Algorithm
6
Parallel Synchronous BPGiven the old messages all new messages can be computed in parallel:
NewMessages
OldMessages
CPU 2
CPU 1
CPU 3
CPU n
Map-Reduce Ready!
7
Hidden Sequential Structure
8
Hidden Sequential Structure
9
Hidden Sequential Structure
Running Time:
EvidenceEvidence
Time for a singleparallel iteration
Number of Iterations
Optimal Sequential Algorithm
Forward-Backward
Naturally Parallel
2n2/p
p ≤ 2n
10
RunningTime
2n
Gap
p = 1
Optimal Parallel
n
p = 2
11
Parallelism by Approximation
τε represents the minimal sequential structure
True Messages
τε -Approximation
1 2 3 4 5 6 7 8 9 10
1
12
Synchronous Schedule Optimal Schedule
Optimal Parallel Scheduling
In [AIStats 09] we demonstrated that this algorithm is optimal:
Processor 1 Processor 2 Processor 3
ParallelComponent
SequentialComponent
Gap
13
The Splash OperationGeneralize the optimal chain algorithm:
to arbitrary cyclic graphs:
~
1) Grow a BFS Spanning tree with fixed size
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
14
Local State
CPU 2
Local State
CPU 3
Local State
CPU 1
Running Parallel Splashes
Partition the graphSchedule Splashes locallyTransmit the messages along the boundary of the partition
Splash SplashSplash
Key Challenges:1) How do we schedules Splashes?2) How do we partition the Graph?
Local State
Sch
edu
ling
Queu
e
Where do we Splash?Assign priorities and use a scheduling queue to select roots:
Splash
Splash
??
?
CPU 1
How do we assign priorities?
16
Message SchedulingResidual Belief Propagation [Elidan et al., UAI 06]:
Assign priorities based on change in inbound messages
1
Message
Message
Message
2
Message
Message
Message
Large ChangeSmall Change
Small Change
Large Change
Small Change:Expensive No-Op
Large Change:Informative Update
17
Problem with Message Scheduling
Small changes in messages do not imply small changes in belief:
Small change inall message
Large change inbelief
Message
Message
BeliefMessage
Message
18
Problem with Message Scheduling
Large changes in a single message do not imply large changes in belief:
Large change ina single message
Small changein belief
Message
BeliefMessageMessage
Message
19
Belief Residual Scheduling
Assign priorities based on the cumulative change in belief:
1 1
+1
+rv =
MessageChange
A vertex whose belief has changed substantially
since last being updatedwill likely produce
informative new messages.
20
Message vs. Belief Scheduling
Belief scheduling improves accuracy more quicklyBelief scheduling improves convergence
Belief Residuals
Message Residual
0%
20%
40%
60%
80%
100%
% Converged in 4Hrs
Bett
er
0 20 40 60 80 100
0.02
0.03
0.04
0.05
0.06Message Schedul-ing
Belief Scheduling
Time (Seconds)
L1
Err
or
in B
eliefs
Splash PruningBelief residuals can be used to dynamically reshape and resize Splashes:
LowBeliefs
Residual
22
Splash SizeUsing Splash Pruning our algorithm is able to dynamically select the optimal splash size
0 10 20 30 40 50 600
50
100
150
200
250
300
350Without Pruning
With Pruning
Splash Size (Messages)
Ru
nn
ing
Tim
e (
Secon
ds)
Bett
er
23
Example
Synthetic Noisy Image
Factor Graph
Vertex Updates
ManyUpdate
s
FewUpdate
s
Algorithm identifies and focuses on hidden sequential structure
24
Distributed Belief Residual Splash Algorithm
Partition factor graph over processorsSchedule Splashes locally using belief residualsTransmit messages on boundary
Local State
CPU 1
Splash
Local State
CPU 2
Local State
CPU 3
Splash
Fast Reliable Network
Splash
SchedulingQueue
SchedulingQueue
SchedulingQueue
Given a uniform partitioning of the chain graphical model, DBRSplash will run in time:
retaining optimality.
Theorem:
25
CPU 1 CPU 2
Partitioning ObjectiveThe partitioning of the factor graph determines:
Storage, Computation, and Communication
Goal: Balance Computation and Minimize Communication
EnsureBalanceComm.
cost
26
The Partitioning ProblemObjective:
Depends on:
NP-Hard METIS fast partitioning heuristic
Work:
Comm:
Minimize Communication
Ensure Balance
Update counts are not known!
27
Unknown Update Counts Determined by belief schedulingDepends on: graph structure, factors, …Little correlation between past & future update counts
Noisy Image Update CountsUninformed Cut
28
Uniformed Cuts
Greater imbalance & lower communication cost
Update Counts Uninformed Cut Optimal Cut
Denoise UW-Syst.0
1
2
3
4Work Imbalance
Denoise UW-Syst.0.5
0.7
0.9
1.1
Communication Cost
Unin-formedB
ett
er
Bett
er
Too Much Work
Too Little Work
29
Over-PartitioningOver-cut graph into k*p partitions and randomly assign CPUs
Increase balanceIncrease communication cost (More Boundary)
CPU 1
CPU 2
CPU 1 CPU 2 CPU 2
CPU 1 CPU 1 CPU 2
CPU 1 CPU 2 CPU 1
CPU 2 CPU 1 CPU 2
Without Over-Partitioning
k=6
30
Over-Partitioning ResultsProvides a simple method to trade between work balance and communication cost
-1 1 3 5 7 9 11 13 151
1.5
2
2.5
3
3.5
4
Communication Cost
Partition Factor k
-1 1 3 5 7 9 11 13 151.5
2
2.5
3
3.5
Work Imbalance
Partition Factor k
Bett
er
Bett
er
31
CPU UtilizationOver-partitioning improves CPU utilization:
0 50 100 150 200 2500
10203040506070
UW-Systems MLN
Time (Seconds)
Acti
ve C
PU
s
0 10 20 30 40 500
10203040506070
Denoise
No Over-Part
10x Over-Part
Time (Seconds)
32
DBRSplash Algorithm
Over-Partition factor graph Randomly assign pieces to processors
Schedule Splashes locally using belief residualsTransmit messages on boundary
Local State
CPU 1
Splash
Local State
CPU 2
Local State
CPU 3
Splash
Fast Reliable Network
Splash
SchedulingQueue
SchedulingQueue
SchedulingQueue
33
ExperimentsImplemented in C++ using MPICH2 as a message passing API
Ran on Intel OpenCirrus cluster: 120 processors
15 Nodes with 2 x Quad Core Intel Xeon ProcessorsGigabit Ethernet Switch
Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]
Present results on largest UW-Systems and smallest UW-Languages MLNs
34
Parallel Performance (Large Graph)
0 30 60 90 1200
20
40
60
80
100
120
No Over-Part
5x Over-Part
Number of CPUs
Sp
eed
up
UW-Systems8K Variables406K Factors
Single Processor Running Time:
1 Hour
Linear to Super-Linear up to 120 CPUs
Cache efficiency
Linear
Bett
er
35
Parallel Performance (Small Graph)
UW-Languages1K Variables27K Factors
Single Processor Running Time:
1.5 Minutes
Linear to Super-Linear up to 30 CPUs
Network costs quickly dominate short running-time
0 30 60 90 1200
10
20
30
40
50
60
No Over-Part
5x Over-Part
Number of CPUs
Sp
eed
up
Linear
Bett
er
36
SummarySplash Operation generalization of the optimal parallel schedule on chain graphs
Belief-based scheduling
Addresses message scheduling issues
Improves accuracy and convergence
DBRSplash an efficient distributed parallel inference algorithm
Over-partitioning to improve work balance
Experimental results on large factor graphs:
Linear to super-linear speed-up using up to 120 processors
Carnegie Mellon
Thank You
AcknowledgementsIntel Research Pittsburgh: OpenCirrus Cluster
AT&T LabsDARPA
38
Exponential ParallelismFrom Saman Amarasinghe:
1985 199019801970 1975 1995 2000 2005
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480
20??
# ofcores
1
2
4
8
16
32
64
128
256
512
Opteron 4PXeon MP
AmbricAM2045
4004
8008
80868080 286 386 486 Pentium P2 P3P4
Itanium
Itanium 2Athlon
39
AIStats09 Speedup3D –Video Prediction Protein Side-Chain Prediction