Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013
69
Embed
Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Carnegie Mellon
Thesis Defense
Joseph K. Bradley
Learning Large-Scale Conditional Random
Fields
CommitteeCarlos Guestrin (U. of Washington, Chair)Tom MitchellJohn Lafferty (U. of Chicago)Andrew McCallum (U. of Massachusetts at Amherst)
1 / 18 / 2013
Modeling Distributions
2
Goal: Model distribution P(X) over random variables X
Goal: Model distribution P(X) over random variables X
E.g.: Model life of a grad student.
Markov Random Fields (MRFs)
4
X2: deadline?
X1: losing sleep?
X3: sick?
X4: losing hair?
X5: overeating?
X6: loud roommate?
X7: taking classes?
X8: cold weather?X9: exercising?
X10: single?X10: gaining weight?
Goal: Model distribution P(X) over random variables X
E.g.: Model life of a grad student.
Markov Random Fields (MRFs)
5
X2
X1
X3
X4
X5
X6
X7
X8
X9
X10X10
graphical
structure
factor (parameters)
Goal: Model distribution P(X) over random variables X
Conditional Random Fields (CRFs)
6
X2
Y1
Y3
Y4
Y5
X1
X3
X4
X5
X6Y2
MRFs: P(X) CRFs: P(Y|X)(Lafferty et al., 2001)
Do not model P(X)Simpler structure (over Y only)
MRFs & CRFs
7
Benefits•Principled statistical and computational framework•Large body of literature
Applications•Natural language processing (e.g., Lafferty et al., 2001)•Vision (e.g., Tappen et al., 2007)•Activity recognition (e.g., Vail et al., 2007)•Medical applications (e.g., Schmidt et al., 2008)•...
Challenges
8
Goal: Given data, learn CRF structure and parameters.
X2
Y1
Y3
Y4
Y5
X1
X5
X6Y2
Many learning methods require inference, i.e., answering queries P(A|B)
NP hard in general(Srebro, 2003)
Big structured optimization
problem
NP hard to approximate(Roth, 1996)
Approximations often lack strong guarantees.
Thesis Statement
CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems.
We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
9
Outline
Parameter Learning Learning without
intractable inferenceS
calin
g c
ore
m
eth
od
s
10
Structure Learning Learning
tractable structures
Parallel Regression Multicore sparse
regressionPara
llel
scalin
g
solve via
Outline
Parameter Learning Learning without
intractable inferenceS
calin
g c
ore
m
eth
od
s
11
Log-linear MRFs
12
X2
X1
X3
X4
X5
X6
X7
X8
X9
X10X10
Goal: Model distribution P(X) over random variables X
Parameters Features
All results generalize to CRFs.
Parameter Learning: MLE
13
Traditional method: max-likelihood estimation (MLE)
Minimize objective:
Loss
Gold Standard: MLE is (optimally) statistically efficient.
Parameter LearningGiven structure Φ and samples from Pθ*(X),Learn parameters θ.
Parameter Learning: MLE
14
Parameter Learning: MLE
15
MLE requires inference.Provably hard for general MRFs. (Roth, 1996)
PAC = Probably Approximately Correct(Valiant, 1984)
MPLE:
(Besag, 1975)
Sample Complexity: MLE
23
# parameters (length of θ)
Λmin: min eigenvalue of Hessian of loss at θ*
probability of failure
Our Theorem: Bound on n (# training examples needed)
Recall: Requires intractable inference.
parameter error (L1)
Sample Complexity: MPLE
24
# parameters (length of θ)
Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]
probability of failureparamete
r error (L1)
Our Theorem: Bound on n (# training examples needed)
Recall: Tractable inference.
PAC learnabilityfor many MRFs!
Sample Complexity: MPLE
25
Our Theorem: Bound on n (# training examples needed)
PAC learnabilityfor many MRFs!
Related WorkRavikumar et al. (2010)• Regression Yi~X with Ising models• Basis of our theoryLiang & Jordan (2008)• Asymptotic analysis of MLE, MPLE• Our bounds match theirsAbbeel et al. (2006)• Only previous method with PAC bounds for high-treewidth
MRFs• We extend their work:
• Extension to CRFs, algorithmic improvements, analysis• Their method is very similar to MPLE.
Trade-offs: MLE & MPLE
26
Our Theorem: Bound on n (# training examples needed)
Sample — computational complexitytrade-off
MLELarger Λmin
=> Lower sample complexity
Higher computational complexity
MPLESmaller Λmin
=> Higher sample complexity
Lower computational complexity
Trade-offs: MPLE
27
X1
Joint optimization for MPLE:
X2
Disjoint optimization for MPLE:
2 estimates of Average estimates
Lower sample complexity
Data-parallel
Sample complexity — parallelismtrade-off
Synthetic CRFs
28
Random
Associative
Chains Stars Grids
Factor strength = strength of variable interactions
Predictive Power of Bounds
29
Errors should be ordered: MLE < MPLE < MPLE-disjoint
L1 p
ara
m e
rror
ε
# training examples
MLE
MPLE
MPLE-disjoint
Factors: random, fixed strength
Length-4 chains
bett
er
Predictive Power of Bounds
30
MLE & MPLE Sample Complexity:
Factors: random
Length-6 chains
10,000 train exs
MLE
Act
ual ε
bett
er
harder
Failure Modes of MPLE
31
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
Sample complexity:
Model diamet
er
Factor strengt
h
Node degree
Λmin: Model Diameter
32
Λmin ratio: MLE/MPLE(Higher = MLE better)
Model diameter
Λm
in r
ati
o
Relative MPLE performance is independent of diameter in chains.(Same for random factors)
Factors: associative, fixed strength
Chains
Λmin: Factor Strength
33
Λmin ratio: MLE/MPLE(Higher = MLE better)
Factor strength
Λm
in r
ati
o
Factors: associative
Length-8 Chains
MPLE performs poorly with strong factors.(Same for random factors, and star & grid models)
Λmin: Node Degree
34
Λmin ratio: MLE/MPLE(Higher = MLE better)
Node degree
Λm
in r
ati
o
Factors: associative, fixed strength
Stars
MPLE performs poorly with high-degree nodes.(Same for random factors)
Failure Modes of MPLE
35
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
Stochastic Coordinate Descent (SCD)While not converged,
Choose random coordinate j,Update wj (closed-form minimization)
59
Shotgun: Parallel SCDwhere
Shotgun Algorithm (Parallel SCD)
While not converged,On each of P processors,
Choose random coordinate j,Update wj (same as for Shooting)
Nice case:Uncorrelatedfeatures
Bad case:Correlatedfeatures
Is SCD inherently sequential?
60
Shotgun: Theory
Convergence Theorem
Final objective
Assume # parallel updates
iterations
where
= spectral radius of XTX
Optimal objective
Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)
61
Shotgun: Theory
Convergence Theorem
final - opt objective
Assume
iterations
# parallel updates
where = spectral radius of X ’X.
Nice case:Uncorrelatedfeatures
Bad case:Correlatedfeatures
(at worst)
where
62
Shotgun: TheoryConvergence Theorem
Assume
... linear speedups predicted.
Up to a threshold...
Experiments matchour theory!
63
Pmax=79Mug32_singlepixcam
T (
itera
tions)
P (parallel updates)
SparcoProblem7
Pmax=284
T (
itera
tions)
P (parallel updates)
Lasso Experiments
Compared many algorithmsInterior point (L1_LS)Shrinkage (FPC_AS, SpaRSA)Projected gradient (GPSR_BB)Iterative hard thresholding (Hard_IO)Also ran: GLMNET, LARS, SMIDAS
35 datasetsλ=.5, 10
ShootingShotgun P = 8 (multicore)
Single-Pixel Camera
Sparco (van den Berg et al., 2009) Sparse Compressed Imaging Large, Sparse Datasets
64
Shotgun provesmost scalable & robust
Shotgun: SpeedupAggregated results from all tests
Sp
eed
up
# cores
Optimal
Lasso Iteration Speedup
Lasso Time Speedup
Logistic Reg. Time Speedup
Not so great
But we are doingfewer iterations!
Explanation:Memory wall (Wulf & McKee, 1995)
The memory bus gets flooded.
Logistic regression uses more FLOPS/datum.Extra computation hides memory latency.Better speedups on average!
65
Summary: Parallel Regression
66
• Shotgun: parallel coordinate descent on multicore• Analysis: near-linear speedups, up to problem-dependent limit• Extensive experiments (37 datasets, 7 other methods)
• Our theory predicts empirical behavior well.• Shotgun is one of the most scalable methods.
ShotgunDecompose
computation by coordinate updates
Trade a little extra computation for a lot of
parallelism
Recall: Thesis StatementWe can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.