Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Modeling Latent Variable Uncertainty for Loss-based Learning

Daphne KollerStanford University

Ben PackerStanford University

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

AimAccurate learning with weakly supervised data

Train Input xi Output yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino Object Detection

Input x

Output y = “Deer”Latent Variable h

(y(f),h(f)) = argmaxy,h f(Ψ(x,y,h))


Feature Ψ(x,y,h) (e.g. HOG)

Input x

Output y = “Deer”

Prediction

Function f : Ψ(x,y,h) (-∞, +∞)

Latent Variable h

f* = argminf Objective(f)



Input x



Learning

Latent Variable h

AimFind a suitable objective function to learn f*


Input x



Learning

Encourages accurate prediction

User-specified criterion for accuracy

f* = argminf Objective(f)

Latent Variable h

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Latent SVM

Linear function parameterized by w

Prediction (y(w), h(w)) = argmaxy,h wTΨ(x,y,h)

Learning minw Σi Δ(yi,yi(w),hi(w))

✔ Loss based learning

✖ Loss independent of true (unknown) latent variable

✖ Doesn’t model uncertainty in latent variables

User-defined loss

Expectation Maximization

Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))

Z

Prediction (y(θ), h(θ)) = argmaxy,h Pθ(y,h|x)



Z

Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)

Learning maxθ Σi log (Pθ(yi|xi))



Z

Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)

Learning maxθ Σi Σhi log (Pθ(yi,hi|xi))

✔ Models uncertainty in latent variables

✖ Doesn’t model accuracy of latent variable prediction

✖ No user-defined loss function


• Our Framework

• Optimization

• Results


Outline

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables


Use two different distributions for the two different tasks

Solution


Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

hi(w)


hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi


• Our Framework

• Optimization

• Results


Outline

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ


hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Observation: When Δ is independent of true h,our framework is equivalent to Latent SVM

Concave-Convex Procedure (CCCP)


• Our Framework

• Optimization

• Results


Outline

Object Detection

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Input x


Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds


Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss


0.1

0.2

0.3

0.4

0.5

0.6

Average Test Loss

LSVMOur

Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking


Results – 0/1 Loss


0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur


Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur



• Our Framework

• Optimization

• Results


Outline

Slides Deleted !!!

Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.

Documents

hzprediction y

latent variable hf

hog input xoutput y

deerlatent variable

hiw loss

different distributions

hixi models uncertainty

different tasksphiyi