Max-Margin Latent Variable Models M. Pawan Kumar
Feb 23, 2016
Max-Margin Latent Variable ModelsM. Pawan Kumar
Daphne KollerBen Packer
Kevin Miller, Rafi Witten,
Tim Tang, Danny Goodman,
Haithem Turki, Dan Preston,
Dan Selsam, Andrej Karpathy
Computer Vision Data
Segmentation
Log
(Size
)
Bounding BoxImage-Level ~ 2000
~ 12000
> 14 M
“Car” “Chair”Information
Computer Vision Data
Segmentation
Log
(Size
)
Bounding BoxImage-Level
Noisy Label~ 2000
~ 12000
> 14 M
> 6 B
Learn with missing information (latent variables)
Information
• Two Types of Problems
• Latent SVM (Background)
• Self-Paced Learning
• Max-Margin Min-Entropy Models
• Discussion
Outline
Annotation MismatchLearn to classify an image
Image x
Annotation a = “Deer”
Mismatch between desired and available annotations
h
Exact value of latent variable is not “important”
Annotation MismatchLearn to classify a DNA sequence
Mismatch between desired and possible annotations
Exact value of latent variable is not “important”
Sequence x
Annotation a {+1, -1}
Latent Variables h
Output MismatchLearn to segment an image
Mismatch between desired output and available annotations
Exact value of latent variable is important
(x, a) (a, h)
Cow
Output MismatchLearn to classify actions
+“jumping”
x ha = -1hb
Mismatch between desired output and available annotations
Exact value of latent variable is important
• Two Types of Problems
• Latent SVM (Background)
• Self-Paced Learning
• Max-Margin Min-Entropy Models
• Discussion
Outline
Latent SVM
Features (x,a,h)
wT(x,a,h)
Parameters w
Image x
Annotation a = “Deer”
h
Andrews et al, 2001; Smola et al, 2005;Felzenszwalb et al, 2008; Yu and Joachims, 2009
(a(w),h(w)) = maxa,h
Parameter Learning
maxh wT(xi,ai,h)
≥
wT(x,a,h)
+ Δ(ai,a) - ξi
min ||w||2 + CΣi ξi
Annotation Mismatch
Optimization
Update hi* = argmaxh wT(xi,ai,h)
Update w by solving a convex problem
min ||w||2 + C∑i i
wT(xi,ai,hi*) - wT(xi,a,h)≥ (ai, a) - i
Repeat until convergence
• Two Types of Problems
• Latent SVM (Background)
• Self-Paced Learning
• Max-Margin Min-Entropy Models
• Discussion
Outline
Self-Paced LearningKumar, Packer and Koller, NIPS 2010
1 + 1 = 2
1/3 + 1/6 = 1/2
eiπ+1 = 0
Math is for losers !!
FAILURE … BAD LOCAL MINIMUM
Self-Paced LearningKumar, Packer and Koller, NIPS 2010
Euler wasa Genius!!
SUCCESS … GOOD LOCAL MINIMUM
1 + 1 = 2
1/3 + 1/6 = 1/2
eiπ+1 = 0
Optimization
Update hi* = argmaxh wT(xi,ai,h)
Update w by solving a convex problem
min ||w||2 + C∑i i
Repeat until convergence
vi
vi {0,1}
λ λμ
- λ∑i vi
wT(xi,ai,hi*) - wT(xi,a,h)≥ (ai, a) - i
Image Classification
Objective4.4
4.454.5
4.554.6
4.654.7
4.75
Test Error14.5
15
15.5
16
16.5
17
17.5
Kumar, Packer and Koller, NIPS 2010
CCCP
SPL
CCCP
SPL
HOG-Based Model. Dalal and Triggs, 2005
Image Classification
~ 5000 images
50/50 train/test split
5 folds
PASCAL VOC 2007 Dataset
Car vs. Not-Car
Image ClassificationWitten, Miller, Kumar, Packer and Koller, In Preparation
Objective
HOG + Dense SIFT + Dense Color SIFT
SPL+ – Different features choose different “easy” samples
Image ClassificationWitten, Miller, Kumar, Packer and Koller, In Preparation
Mean Average Precision
HOG + Dense SIFT + Dense Color SIFT
SPL+ – Different features choose different “easy” samples
Motif Finding
~ 40,000 sequences
50/50 train/test split
5 folds
UniProbe Dataset
Binding vs. Not-Binding
Motif Finding
Objective0
20406080
100120140
Test Error282930313233343536
Kumar, Packer and Koller, NIPS 2010
CCCP
SPL
CCCP
SPL
Motif + Markov Background Model. Yu and Joachims, 2009
Semantic Segmentation
+
Train - 572 imagesValidation - 53 images
Test - 90 images
Train - 1274 imagesValidation - 225 images
Test - 750 images
Stanford BackgroundVOC Segmentation 2009
Semantic SegmentationImageNetVOC Detection 2009
+
Train - 1564 images Train - 1000 imagesBounding Box Data Image-Level Data
Semantic SegmentationKumar, Turki, Preston and Koller, ICCV 2011
VOC Overlap222324252627282930
SBD Overlap52
52.553
53.554
54.555
55.5
SUP CCCP
SPL
SUPCCCP
SPL
Region-based Model. Gould, Fulton and Koller, 2009
SUP – Supervised Learning (Segmentation Data Only)
Action ClassificationPASCAL VOC 2011
Train – 3000 instances Train - 10000 imagesBounding Box Data Noisy Data
+
Test – 3000 instances
Action ClassificationPacker, Kumar, Tang and Koller, In Preparation
Mean Average Precision60.8
6161.261.461.661.8
6262.262.462.662.8
SUP
CCCP
SPL
Poselet-based Model. Maji, Bourdev and Malik, 2011
Self-Paced Multiple Kernel LearningKumar, Packer and Koller, In Preparation
1 + 1 = 2
1/3 + 1/6 = 1/2
eiπ+1 = 0
Integers
RationalNumbers
ImaginaryNumbers
USE A FIXED MODEL
Kumar, Packer and Koller, In Preparation
1 + 1 = 2
1/3 + 1/6 = 1/2
eiπ+1 = 0
Integers
RationalNumbers
ImaginaryNumbers
ADAPT THE MODEL COMPLEXITY
Self-Paced Multiple Kernel Learning
Optimization
Update hi* = argmaxh wT(xi,ai,h)
Update w by solving a convex problem
min ||w||2 + C∑i i
Repeat until convergence
vi
vi {0,1}
λ λμ
- λ∑i vi
wT(xi,ai,hi*) - wT(xi,a,h)≥ (ai, a) - i
Kij = (xi,ai,hi)T (xj,aj,hj) K = Σk ck Kk
^
and c
Image Classification
Objective0
0.2
0.4
0.6
0.8
1
Test Error02468
1012141618
Kumar, Packer and Koller, In Preparation
FIXED
SPMKL
FIXED
SPMKL
HOG-Based Model. Dalal and Triggs, 2005
Motif Finding
~ 40,000 sequences
50/50 train/test split
5 folds
UniProbe Dataset
Binding vs. Not-Binding
Motif Finding
Objective69707172737475767778
Test Error8.5
9
9.5
10
10.5
11
11.5
Kumar, Packer and Koller, NIPS 2010
FIXED
SPMKL
FIXED
SPMKL
Motif + Markov Background Model. Yu and Joachims, 2009
• Two Types of Problems
• Latent SVM (Background)
• Self-Paced Learning
• Max-Margin Min-Entropy Models
• Discussion
Outline
0.00 0.00 0.250.00 0.25 0.000.00 0.00 0.25
Pr(a1,h|x)0.00 0.00 0.010.00 0.24 0.000.00 0.00 0.00
Pr(a2,h|x)
MAP Inference
mina,h – log (Pr(a,h|x))
Value of latent variable?
Pr(a,h|x) = exp( wT(x,a,h))Z(x)
mina – log (Pr(a|x))
Min-Entropy Inference
+ Hα (Pr(h|a,x))
mina Hα(Q(a; x, w))
Q(a; x, w) = Set of all {Pr(a,h|x)}
Renyi entropy of generalized distribution
min ||w||2 + C∑i i
Hα(Q(a; x, w))- Hα(Q(ai; x, w)) ≥ (ai, a) - i
i ≥ 0
Like latent SVM, minimizes (ai, ai(w))
In fact, when α = ∞...
Max-Margin Min-Entropy ModelsMiller, Kumar, Packer, Goodman and Koller, AISTATS 2012
min ||w||2 + C∑i i
maxhwT(x,ai,h)-maxhwT(x,a,h) ≥ (ai, a) - i
i ≥ 0
In fact, when α = ∞... Latent SVM
Max-Margin Min-Entropy Models
Like latent SVM, minimizes (ai, ai(w))
Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012
Image ClassificationMiller, Kumar, Packer, Goodman and Koller, AISTATS 2012
HOG-Based Model. Dalal and Triggs, 2005
Image ClassificationMiller, Kumar, Packer, Goodman and Koller, AISTATS 2012
HOG-Based Model. Dalal and Triggs, 2005
Image ClassificationMiller, Kumar, Packer, Goodman and Koller, AISTATS 2012
HOG-Based Model. Dalal and Triggs, 2005
Motif Finding
~ 40,000 sequences
50/50 train/test split
5 folds
UniProbe Dataset
Binding vs. Not-Binding
Motif FindingMiller, Kumar, Packer, Goodman and Koller, AISTATS 2012
Motif + Markov Background Model. Yu and Joachims, 2009
• Two Types of Problems
• Latent SVM (Background)
• Self-Paced Learning
• Max-Margin Min-Entropy Models
• Discussion
Outline
Very Large Datasets
• Initialize parameters using supervised data
• Impute latent variables (inference)
• Select easy samples (very efficient)
• Update parameters using incremental SVM
• Refine efficiently with proximal regularization
Output MismatchΔ(a,h,a(w),h(w))Σh Prθ(h|a,x) + A(θ)
C. R. Rao’s Relative Quadratic Entropy
Minimize over w and θ
Output MismatchΔ(a,h,a(w),h(w))Σh Prθ(h|a,x) + A(θ)
C. R. Rao’s Relative Quadratic Entropy
Minimize over w
(a1,h) (a2,h)
Pr θ
(h,a
|x)
Output MismatchΔ(a,h,a(w),h(w))Σh Prθ(h|a,x) + A(θ)
C. R. Rao’s Relative Quadratic Entropy
Minimize over w
(a1,h)
Pr θ
(h,a
|x)
(a2,h)
Output MismatchΔ(a,h,a(w),h(w))Σh Prθ(h|a,x) + A(θ)
C. R. Rao’s Relative Quadratic Entropy
Minimize over θ
(a1,h) (a2,h)
Pr θ
(h,a
|x)
Output MismatchΔ(a,h,a(w),h(w))Σh Prθ(h|a,x) + A(θ)
C. R. Rao’s Relative Quadratic Entropy
Minimize over θ
(a1,h) (a2,h)
Pr θ
(h,a
|x)
Output MismatchΔ(a,h,a(w),h(w))Σh Prθ(h|a,x) + A(θ)
C. R. Rao’s Relative Quadratic Entropy
Minimize over θ
(a1,h) (a2,h)
Pr θ
(h,a
|x)