Top Banner
Loss-based Learning with Weak Supervision M. Pawan Kumar
142

Loss-based Learning with Weak Supervision

Jan 02, 2016

Download

Documents

arden-foley

Loss-based Learning with Weak Supervision. M. Pawan Kumar. Computer Vision Data. Log (Size). ~ 2000. Segmentation. Information. Computer Vision Data. ~ 1 M. Log (Size). Bounding Box. ~ 2000. Segmentation. Information. Computer Vision Data. > 14 M. ~ 1 M. Log (Size). Image-Level. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Loss-based Learning  with Weak Supervision

Loss-based Learning with Weak Supervision

M. Pawan Kumar

Page 2: Loss-based Learning  with Weak Supervision

Segmentation

Information

Log (

Siz

e)

~ 2000

Computer Vision Data

Page 3: Loss-based Learning  with Weak Supervision

Segmentation

Log (

Siz

e)

~ 2000

Information

Bounding Box

~ 1 M

Computer Vision Data

Page 4: Loss-based Learning  with Weak Supervision

Segmentation

Log (

Siz

e)

Bounding BoxImage-Level~ 2000

~ 1 M> 14 M

“Car” “Chair”Information

Computer Vision Data

Page 5: Loss-based Learning  with Weak Supervision

Segmentation

Log (

Siz

e)

Image-Level

Noisy Label~ 2000

> 14 M

> 6 B

Information

Bounding Box

~ 1 M

Computer Vision Data

Page 6: Loss-based Learning  with Weak Supervision

Learn with missing information (latent variables)

Detailed annotation is expensive

Sometimes annotation is impossible

Desired annotation keeps changing

Computer Vision Data

Page 7: Loss-based Learning  with Weak Supervision

• Two Types of Problems

• Part I – Annotation Mismatch

• Part II – Output Mismatch

Outline

Page 8: Loss-based Learning  with Weak Supervision

Annotation Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Mismatch between desired and available annotations

Exact value of latent variable is not “important”

Desired output during test time is y

Page 9: Loss-based Learning  with Weak Supervision

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Page 10: Loss-based Learning  with Weak Supervision

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

Page 11: Loss-based Learning  with Weak Supervision

Part I

Page 12: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Page 13: Loss-based Learning  with Weak Supervision

Weakly Supervised Data

Input x

Output y {-1,+1}

Hidden h

x

y = +1

h

Page 14: Loss-based Learning  with Weak Supervision

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = +1

h

Page 15: Loss-based Learning  with Weak Supervision

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,+1,h) Φ(x,h)

0

=

x

y = +1

h

Page 16: Loss-based Learning  with Weak Supervision

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,-1,h) 0

Φ(x,h)

=

x

y = +1

h

Page 17: Loss-based Learning  with Weak Supervision

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = +1

h

Page 18: Loss-based Learning  with Weak Supervision

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Latent SVM

Parameters

Page 19: Loss-based Learning  with Weak Supervision

Learning Latent SVM

(yi, yi(w))Σi

Empirical risk minimization

minw

No restriction on the loss function

Annotation mismatch

Training data {(xi,yi), i = 1,2,…,n}

Page 20: Loss-based Learning  with Weak Supervision

Learning Latent SVM

(yi, yi(w))Σi

Empirical risk minimization

minw

Non-convex

Parameters cannot be regularized

Find a regularization-sensitive upper bound

Page 21: Loss-based Learning  with Weak Supervision

Learning Latent SVM

- wT(xi,yi(w),hi(w))

(yi, yi(w))wT(xi,yi(w),hi(w)) +

Page 22: Loss-based Learning  with Weak Supervision

Learning Latent SVM

(yi, yi(w))wT(xi,yi(w),hi(w)) +

- maxhi wT(xi,yi,hi)

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Page 23: Loss-based Learning  with Weak Supervision

Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Parameters can be regularized

Is this also convex?

Page 24: Loss-based Learning  with Weak Supervision

Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Convex Convex-

Difference of convex (DC) program

Page 25: Loss-based Learning  with Weak Supervision

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Learning

Recap

Page 26: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Page 27: Loss-based Learning  with Weak Supervision

Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Difference of convex (DC) program

Page 28: Loss-based Learning  with Weak Supervision

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper-bound of concave part

Page 29: Loss-based Learning  with Weak Supervision

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Optimize the convex upper bound

Page 30: Loss-based Learning  with Weak Supervision

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper-bound of concave part

Page 31: Loss-based Learning  with Weak Supervision

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Until Convergence

Page 32: Loss-based Learning  with Weak Supervision

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper bound?

Page 33: Loss-based Learning  with Weak Supervision

Linear Upper Bound

- maxhi wT(xi,yi,hi)

-wT(xi,yi,hi*)

hi* = argmaxhi wt

T(xi,yi,hi)

Current estimate = wt

≥ - maxhi wT(xi,yi,hi)

Page 34: Loss-based Learning  with Weak Supervision

CCCP for Latent SVMStart with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Page 35: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Page 36: Loss-based Learning  with Weak Supervision

Action ClassificationInput x

Output y = “Using Computer”

PASCAL VOC 2011

80/20 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Page 37: Loss-based Learning  with Weak Supervision

• 0-1 loss function

• Poselet-based feature vector

• 4 seeds for random initialization

• Code + Data

• Train/Test scripts with hyperparameter settings

Setup

http://www.centrale-ponts.fr/tutorials/cvpr2013/

Page 38: Loss-based Learning  with Weak Supervision

Objective

Page 39: Loss-based Learning  with Weak Supervision

Train Error

Page 40: Loss-based Learning  with Weak Supervision

Test Error

Page 41: Loss-based Learning  with Weak Supervision

Time

Page 42: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Page 43: Loss-based Learning  with Weak Supervision

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Overfitting in initial iterations

Page 44: Loss-based Learning  with Weak Supervision

Repeat until convergence

ε’ = ε/K

and ε’ = ε

Start with an initial estimate w0

Update

Update wt+1 as the ε’-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Page 45: Loss-based Learning  with Weak Supervision

Objective

Page 46: Loss-based Learning  with Weak Supervision

Objective

Page 47: Loss-based Learning  with Weak Supervision

Train Error

Page 48: Loss-based Learning  with Weak Supervision

Train Error

Page 49: Loss-based Learning  with Weak Supervision

Test Error

Page 50: Loss-based Learning  with Weak Supervision

Test Error

Page 51: Loss-based Learning  with Weak Supervision

Time

Page 52: Loss-based Learning  with Weak Supervision

Time

Page 53: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Page 54: Loss-based Learning  with Weak Supervision

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Overfitting in initial iterations

Page 55: Loss-based Learning  with Weak Supervision

Repeat until convergence

C’ = C x K

and C’ = C

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C’∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Page 56: Loss-based Learning  with Weak Supervision

Objective

Page 57: Loss-based Learning  with Weak Supervision

Objective

Page 58: Loss-based Learning  with Weak Supervision

Train Error

Page 59: Loss-based Learning  with Weak Supervision

Train Error

Page 60: Loss-based Learning  with Weak Supervision

Test Error

Page 61: Loss-based Learning  with Weak Supervision

Test Error

Page 62: Loss-based Learning  with Weak Supervision

Time

Page 63: Loss-based Learning  with Weak Supervision

Time

Page 64: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Kumar, Packer and Koller, NIPS 2010

Page 65: Loss-based Learning  with Weak Supervision

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Math is for losers !!

FAILURE … BAD LOCAL MINIMUM

CCCP for Human Learning

Page 66: Loss-based Learning  with Weak Supervision

Euler wasa Genius!!

SUCCESS … GOOD LOCAL MINIMUM

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Self-Paced Learning

Page 67: Loss-based Learning  with Weak Supervision

Start with “easy” examples, then consider “hard” ones

Easy vs. Hard

Expensive

Easy for human Easy for machine

Self-Paced Learning

Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances

Page 68: Loss-based Learning  with Weak Supervision

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

CCCP for Latent SVM

Page 69: Loss-based Learning  with Weak Supervision

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Self-Paced Learning

Page 70: Loss-based Learning  with Weak Supervision

min ||w||2 + C∑i vii

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

vi {0,1}

Trivial Solution

Self-Paced Learning

Page 71: Loss-based Learning  with Weak Supervision

vi {0,1}

Large K Medium K Small K

min ||w||2 + C∑i vii - ∑ivi/K

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Self-Paced Learning

Page 72: Loss-based Learning  with Weak Supervision

vi [0,1]

min ||w||2 + C∑i vii - ∑ivi/K

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Large K Medium K Small K

BiconvexProblem

AlternatingConvex Search

Self-Paced Learning

Page 73: Loss-based Learning  with Weak Supervision

Start with an initial estimate w0

Update

min ||w||2 + C∑i i - ∑i vi/K

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Decrease K K/

SPL for Latent SVM

Update wt+1 as the ε-optimal solution of

Page 74: Loss-based Learning  with Weak Supervision

Objective

Page 75: Loss-based Learning  with Weak Supervision

Objective

Page 76: Loss-based Learning  with Weak Supervision

Train Error

Page 77: Loss-based Learning  with Weak Supervision

Train Error

Page 78: Loss-based Learning  with Weak Supervision

Test Error

Page 79: Loss-based Learning  with Weak Supervision

Test Error

Page 80: Loss-based Learning  with Weak Supervision

Time

Page 81: Loss-based Learning  with Weak Supervision

Time

Page 82: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Behl, Jawahar and Kumar, In Preparation

Page 83: Loss-based Learning  with Weak Supervision

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Page 84: Loss-based Learning  with Weak Supervision

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1Average Precision = 0.92 Accuracy = 0.67Average Precision = 0.81

Page 85: Loss-based Learning  with Weak Supervision

Ranking

During testing, AP is frequently used

During training, a surrogate loss is used

Contradictory to loss-based learning

Optimize AP directly

Page 86: Loss-based Learning  with Weak Supervision

Results

Statistically significant improvement

Page 87: Loss-based Learning  with Weak Supervision

Speed – Proximal Regularization

Start with an good initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i + Ct ||w - wt||2

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Page 88: Loss-based Learning  with Weak Supervision

Speed – Cascades

Weiss and Taskar, AISTATS 2010Sapp, Toshev and Taskar, ECCV 2010

Page 89: Loss-based Learning  with Weak Supervision

Accuracy – (Self) Pacing

Pacing the sample complexity – NIPS 2010

Pacing the model complexity

Pacing the problem complexity

Page 90: Loss-based Learning  with Weak Supervision

Building Accurate Systems

Model

Inference

Learning

85%

5%

10%

Learning cannot provide huge gains without a good model

Inference cannot provide huge gains without a good model

Page 91: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice

• Extensions– Latent Variable Dependent Loss– Max-Margin Min-Entropy Models

Outline – Annotation Mismatch

Yu and Joachims, ICML 2009

Page 92: Loss-based Learning  with Weak Supervision

Latent Variable Dependent Loss

- wT(xi,yi(w),hi(w))

(yi, yi(w), hi(w))wT(xi,yi(w),hi(w)) +

Page 93: Loss-based Learning  with Weak Supervision

Latent Variable Dependent Loss

(yi, yi(w), hi(w))wT(xi,yi(w),hi(w)) +

- maxhi wT(xi,yi,hi)

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Page 94: Loss-based Learning  with Weak Supervision

Latent Variable Dependent Loss

(yi, y, h)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Page 95: Loss-based Learning  with Weak Supervision

Optimizing Precision@k

Input X = {xi, i = 1, …, n}

Annotation Y = {yi, i = 1, …, n} {-1,+1}n

Latent H = ranking

(Y*, Y, H)

1-Precision@k

Page 96: Loss-based Learning  with Weak Supervision

• Latent SVM

• Optimization

• Practice

• Extensions– Latent Variable Dependent Loss– Max-Margin Min-Entropy (M3E) Models

Outline – Annotation Mismatch

Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Page 97: Loss-based Learning  with Weak Supervision

Running vs. Jumping Classification

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)

Page 98: Loss-based Learning  with Weak Supervision

0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

Running vs. Jumping Classification

Page 99: Loss-based Learning  with Weak Supervision

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

M3E

Page 100: Loss-based Learning  with Weak Supervision

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = 1. Shannon Entropy of Generalized Distribution

- Σh Pw(y,h|x) log(Pw(y,h|x))

Σh Pw(y,h|x)

Rényi Entropy

Page 101: Loss-based Learning  with Weak Supervision

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Page 102: Loss-based Learning  with Weak Supervision

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Rényi Entropy

Page 103: Loss-based Learning  with Weak Supervision

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Page 104: Loss-based Learning  with Weak Supervision

Δ(yi,yi(w))Gα(yi(w);xi,w) +

Training data {(xi,yi), i = 1,2,…,n}

- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E

Page 105: Loss-based Learning  with Weak Supervision

Training data {(xi,yi), i = 1,2,…,n}

minw ||w||2 + C Σiξi

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

Learning M3E

Page 106: Loss-based Learning  with Weak Supervision

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding Results

Page 107: Loss-based Learning  with Weak Supervision

Part II

Page 108: Loss-based Learning  with Weak Supervision

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

Page 109: Loss-based Learning  with Weak Supervision

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Page 110: Loss-based Learning  with Weak Supervision

Weakly Supervised Data

Input x

Output y {0,1,…,C}

Hidden h

x

y = 0

h

Page 111: Loss-based Learning  with Weak Supervision

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = 0

h

Page 112: Loss-based Learning  with Weak Supervision

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,0,h) Φ(x,h)

0

=

x

y = 0

h

0

.

.

.

Page 113: Loss-based Learning  with Weak Supervision

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,1,h) 0

Φ(x,h)

=

x

y = 0

h

0

.

.

.

Page 114: Loss-based Learning  with Weak Supervision

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,C,h) 0

0

=

x

y = 0

h

Φ(x,h)

.

.

.

Page 115: Loss-based Learning  with Weak Supervision

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = 0

h

Page 116: Loss-based Learning  with Weak Supervision

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Linear Model

Parameters

Page 117: Loss-based Learning  with Weak Supervision

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Page 118: Loss-based Learning  with Weak Supervision

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi

Page 119: Loss-based Learning  with Weak Supervision

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Kumar, Packer and Koller, ICML 2012

Page 120: Loss-based Learning  with Weak Supervision

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Page 121: Loss-based Learning  with Weak Supervision

Solution

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Page 122: Loss-based Learning  with Weak Supervision

Solution

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

Page 123: Loss-based Learning  with Weak Supervision

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Page 124: Loss-based Learning  with Weak Supervision

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

Page 125: Loss-based Learning  with Weak Supervision

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Page 126: Loss-based Learning  with Weak Supervision

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

User-defined dissimilarity measure

Page 127: Loss-based Learning  with Weak Supervision

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Page 128: Loss-based Learning  with Weak Supervision

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)

Page 129: Loss-based Learning  with Weak Supervision

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)

Page 130: Loss-based Learning  with Weak Supervision

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

Page 131: Loss-based Learning  with Weak Supervision

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Page 132: Loss-based Learning  with Weak Supervision

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Page 133: Loss-based Learning  with Weak Supervision

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Page 134: Loss-based Learning  with Weak Supervision

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Page 135: Loss-based Learning  with Weak Supervision

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Page 136: Loss-based Learning  with Weak Supervision

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Page 137: Loss-based Learning  with Weak Supervision

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)

Page 138: Loss-based Learning  with Weak Supervision

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Page 139: Loss-based Learning  with Weak Supervision

Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Page 140: Loss-based Learning  with Weak Supervision

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur

Statistically Significant

Page 141: Loss-based Learning  with Weak Supervision

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur

Statistically Significant

Page 142: Loss-based Learning  with Weak Supervision

Questions?

http://www.centrale-ponts.fr/personnel/pawan