Loss-based Learning with Weak Supervision

Loss-based Learning with Weak Supervision

M. Pawan Kumar

Segmentation

Information

Log (

Siz

e)

~ 2000

Computer Vision Data

Segmentation

Log (

Siz

e)

~ 2000

Information

Bounding Box

~ 1 M


Segmentation

Log (

Siz

e)

Bounding BoxImage-Level~ 2000

~ 1 M> 14 M

“Car” “Chair”Information


Segmentation

Log (

Siz

e)

Image-Level

Noisy Label~ 2000

> 14 M

> 6 B

Information

Bounding Box

~ 1 M


Learn with missing information (latent variables)

Detailed annotation is expensive

Sometimes annotation is impossible

Desired annotation keeps changing


• Two Types of Problems

• Part I – Annotation Mismatch

• Part II – Output Mismatch

Outline

Annotation Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Mismatch between desired and available annotations

Exact value of latent variable is not “important”

Desired output during test time is y

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

Part I

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data

Input x

Output y {-1,+1}

Hidden h

x

y = +1

h

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = +1

h


Feature Φ(x,h)


Ψ(x,+1,h) Φ(x,h)

0

=

x

y = +1

h


Feature Φ(x,h)


Ψ(x,-1,h) 0

Φ(x,h)

=

x

y = +1

h


Feature Φ(x,h)


Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = +1

h

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Latent SVM

Parameters

Learning Latent SVM

(yi, yi(w))Σi

Empirical risk minimization

minw

No restriction on the loss function

Annotation mismatch

Training data {(xi,yi), i = 1,2,…,n}

Learning Latent SVM

(yi, yi(w))Σi

Empirical risk minimization

minw

Non-convex

Parameters cannot be regularized

Find a regularization-sensitive upper bound

Learning Latent SVM

- wT(xi,yi(w),hi(w))

(yi, yi(w))wT(xi,yi(w),hi(w)) +

Learning Latent SVM

(yi, yi(w))wT(xi,yi(w),hi(w)) +

- maxhi wT(xi,yi,hi)


Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Parameters can be regularized

Is this also convex?

Learning Latent SVM




Convex Convex-

Difference of convex (DC) program


wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction


Learning

Recap

• Latent SVM

• Optimization

• Practice

• Extensions


Learning Latent SVM




Difference of convex (DC) program

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper-bound of concave part


+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Optimize the convex upper bound


+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper-bound of concave part


+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Until Convergence


+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper bound?

Linear Upper Bound


-wT(xi,yi,hi*)

hi* = argmaxhi wt

T(xi,yi,hi)

Current estimate = wt

≥ - maxhi wT(xi,yi,hi)

CCCP for Latent SVMStart with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

• Latent SVM

• Optimization

• Practice

• Extensions


Action ClassificationInput x

Output y = “Using Computer”

PASCAL VOC 2011

80/20 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

• 0-1 loss function

• Poselet-based feature vector

• 4 seeds for random initialization

• Code + Data

• Train/Test scripts with hyperparameter settings

Setup

http://www.centrale-ponts.fr/tutorials/cvpr2013/

http://www.centrale-ponts.fr/tutorials/cvpr2013/

Objective

Train Error

Test Error

Time

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions


Start with an initial estimate w0

Update


min ||w||2 + C∑i i




Overfitting in initial iterations


ε’ = ε/K

and ε’ = ε


Update

Update wt+1 as the ε’-optimal solution of

min ||w||2 + C∑i i



Objective

Objective

Train Error

Train Error

Test Error

Test Error

Time

Time

• Latent SVM

• Optimization


• Extensions



Update


min ||w||2 + C∑i i




Overfitting in initial iterations


C’ = C x K

and C’ = C


Update


min ||w||2 + C’∑i i



Objective

Objective

Train Error

Train Error

Test Error

Test Error

Time

Time

• Latent SVM

• Optimization


• Extensions


Kumar, Packer and Koller, NIPS 2010

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Math is for losers !!

FAILURE … BAD LOCAL MINIMUM

CCCP for Human Learning

Euler wasa Genius!!

SUCCESS … GOOD LOCAL MINIMUM

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Self-Paced Learning

Start with “easy” examples, then consider “hard” ones

Easy vs. Hard

Expensive

Easy for human Easy for machine

Self-Paced Learning

Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances


Update


min ||w||2 + C∑i i



CCCP for Latent SVM

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Self-Paced Learning

min ||w||2 + C∑i vii


vi {0,1}

Trivial Solution

Self-Paced Learning

vi {0,1}

Large K Medium K Small K

min ||w||2 + C∑i vii - ∑ivi/K


Self-Paced Learning

vi [0,1]

min ||w||2 + C∑i vii - ∑ivi/K


Large K Medium K Small K

BiconvexProblem

AlternatingConvex Search

Self-Paced Learning


Update

min ||w||2 + C∑i i - ∑i vi/K



Decrease K K/

SPL for Latent SVM


Objective

Objective

Train Error

Train Error

Test Error

Test Error

Time

Time

• Latent SVM

• Optimization


• Extensions


Behl, Jawahar and Kumar, In Preparation

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1Average Precision = 0.92 Accuracy = 0.67Average Precision = 0.81

Ranking

During testing, AP is frequently used

During training, a surrogate loss is used

Contradictory to loss-based learning

Optimize AP directly

Results

Statistically significant improvement

Speed – Proximal Regularization

Start with an good initial estimate w0

Update


min ||w||2 + C∑i i + Ct ||w - wt||2




Speed – Cascades

Weiss and Taskar, AISTATS 2010Sapp, Toshev and Taskar, ECCV 2010

Accuracy – (Self) Pacing

Pacing the sample complexity – NIPS 2010

Pacing the model complexity

Pacing the problem complexity

Building Accurate Systems

Model

Inference

Learning

85%

5%

10%

Learning cannot provide huge gains without a good model

Inference cannot provide huge gains without a good model

• Latent SVM

• Optimization

• Practice

• Extensions– Latent Variable Dependent Loss– Max-Margin Min-Entropy Models


Yu and Joachims, ICML 2009

Latent Variable Dependent Loss

- wT(xi,yi(w),hi(w))

(yi, yi(w), hi(w))wT(xi,yi(w),hi(w)) +


(yi, yi(w), hi(w))wT(xi,yi(w),hi(w)) +




(yi, y, h)wT(xi,y,h) +maxy,h



Optimizing Precision@k

Input X = {xi, i = 1, …, n}

Annotation Y = {yi, i = 1, …, n} {-1,+1}n

Latent H = ranking

(Y*, Y, H)

1-Precision@k

• Latent SVM

• Optimization

• Practice

• Extensions– Latent Variable Dependent Loss– Max-Margin Min-Entropy (M3E) Models


Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Running vs. Jumping Classification

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)

0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

Running vs. Jumping Classification

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

M3E

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = 1. Shannon Entropy of Generalized Distribution

- Σh Pw(y,h|x) log(Pw(y,h|x))

Σh Pw(y,h|x)

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Rényi Entropy


Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Δ(yi,yi(w))Gα(yi(w);xi,w) +


- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E



Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

Learning M3E

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding Results

Part II

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Weakly Supervised Data

Input x

Output y {0,1,…,C}

Hidden h

x

y = 0

h

Weakly Supervised Detection

Feature Φ(x,h)


Ψ(x,y,h)

x

y = 0

h


Feature Φ(x,h)


Ψ(x,0,h) Φ(x,h)

0

=

x

y = 0

h

0

.

.

.


Feature Φ(x,h)


Ψ(x,1,h) 0

Φ(x,h)

=

x

y = 0

h

0

.

.

.


Feature Φ(x,h)


Ψ(x,C,h) 0

0

=

x

y = 0

h

Φ(x,h)

.

.

.


Feature Φ(x,h)


Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = 0

h

Scoring function

wTΨ(x,y,h)

Prediction


Linear Model

Parameters

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi



• Optimization

• Experiments


Kumar, Packer and Koller, ICML 2012

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables


Use two different distributions for the two different tasks

Solution


Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi



• Optimization

• Experiments


Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0


End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ


hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)



• Optimization

• Experiments


Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur

Statistically Significant

Questions?

http://www.centrale-ponts.fr/personnel/pawan

Loss-based Learning with Weak Supervision

Documents

h yi

h maxy

y maxhi wtxi

h maxhi wtxi

possible y

hidden h xy

hiw maxhi wtxi

maxhi linear upper