Top Banner
Semi-supervised Learning
31

Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Aug 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised Learning

Page 2: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Introduction

β€’ Supervised learning: π‘₯π‘Ÿ , ΰ·œπ‘¦π‘Ÿ π‘Ÿ=1𝑅

β€’ E.g.π‘₯π‘Ÿ: image, ΰ·œπ‘¦π‘Ÿ: class labels

β€’ Semi-supervised learning: π‘₯π‘Ÿ , ΰ·œπ‘¦π‘Ÿ π‘Ÿ=1𝑅 , π‘₯𝑒 𝑒=𝑅

𝑅+π‘ˆ

β€’ A set of unlabeled data, usually U >> R

β€’ Transductive learning: unlabeled data is the testing data

β€’ Inductive learning: unlabeled data is not the testing data

β€’ Why semi-supervised learning?

β€’ Collecting data is easy, but collecting β€œlabelled” data is expensive

β€’ We do semi-supervised learning in our lives

Page 3: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Why semi-supervised learning helps?

Labelled data

Unlabeled data

cat dog

(Image of cats and dogs without labeling)

Page 4: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Why semi-supervised learning helps?

The distribution of the unlabeled data tell us something.

Usually with some assumptions

Who knows?

Page 5: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Outline

Semi-supervised Learning for Generative Model

Low-density Separation Assumption

Smoothness Assumption

Better Representation

Page 6: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised Learning for Generative Model

Page 7: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Supervised Generative Model

β€’ Given labelled training examples π‘₯π‘Ÿ ∈ 𝐢1, 𝐢2β€’ looking for most likely prior probability P(Ci) and class-

dependent probability P(x|Ci)

β€’ P(x|Ci) is a Gaussian parameterized by πœ‡π‘– and Ξ£

πœ‡1, Ξ£

πœ‡2, Ξ£

𝑃 𝐢1|π‘₯ =𝑃 π‘₯|𝐢1 𝑃 𝐢1

𝑃 π‘₯|𝐢1 𝑃 𝐢1 + 𝑃 π‘₯|𝐢2 𝑃 𝐢2

With 𝑃 𝐢1 , 𝑃 𝐢2 ,πœ‡1,πœ‡2, Ξ£

Decision Boundary

Page 8: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised Generative Modelβ€’ Given labelled training examples π‘₯π‘Ÿ ∈ 𝐢1, 𝐢2

β€’ looking for most likely prior probability P(Ci) and class-dependent probability P(x|Ci)

β€’ P(x|Ci) is a Gaussian parameterized by πœ‡π‘– and Ξ£

Decision Boundary

The unlabeled data π‘₯𝑒 help re-estimate 𝑃 𝐢1 , 𝑃 𝐢2 , πœ‡1,πœ‡2, Ξ£

πœ‡1, Ξ£

πœ‡2, Ξ£

Page 9: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised Generative Modelβ€’ Initialization:πœƒ = 𝑃 𝐢1 ,𝑃 𝐢2 ,πœ‡1,πœ‡2, Ξ£

β€’ Step 1: compute the posterior probability of unlabeled data

β€’ Step 2: update model

π‘ƒπœƒ 𝐢1|π‘₯𝑒

𝑃 𝐢1 =𝑁1 +Οƒπ‘₯𝑒 𝑃 𝐢1|π‘₯

𝑒

𝑁

πœ‡1 =1

𝑁1

π‘₯π‘ŸβˆˆπΆ1

π‘₯π‘Ÿ +1

Οƒπ‘₯𝑒 𝑃 𝐢1|π‘₯𝑒

π‘₯𝑒

𝑃 𝐢1|π‘₯𝑒 π‘₯𝑒 ……

Back to step 1

Depending on model πœƒ

𝑁: total number of examples𝑁1: number of examples belonging to C1

The algorithm converges eventually, but the initialization influences the results.

E

M

Page 10: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Why?

β€’ Maximum likelihood with labelled data

β€’ Maximum likelihood with labelled + unlabeled data

π‘™π‘œπ‘”πΏ πœƒ =

π‘₯π‘Ÿ

π‘™π‘œπ‘”π‘ƒπœƒ π‘₯π‘Ÿ , ΰ·œπ‘¦π‘Ÿ

π‘™π‘œπ‘”πΏ πœƒ =

π‘₯π‘Ÿ

π‘™π‘œπ‘”π‘ƒπœƒ π‘₯π‘Ÿ , ΰ·œπ‘¦π‘Ÿ +

π‘₯𝑒

π‘™π‘œπ‘”π‘ƒπœƒ π‘₯𝑒

π‘ƒπœƒ π‘₯𝑒 = π‘ƒπœƒ π‘₯𝑒|𝐢1 𝑃 𝐢1 + π‘ƒπœƒ π‘₯𝑒|𝐢2 𝑃 𝐢2

(π‘₯𝑒 can come from either C1 and C2)

Closed-form solution

Solved iteratively

πœƒ = 𝑃 𝐢1 ,𝑃 𝐢2 ,πœ‡1,πœ‡2, Ξ£

= π‘ƒπœƒ π‘₯π‘Ÿ| ΰ·œπ‘¦π‘Ÿ 𝑃 ΰ·œπ‘¦π‘Ÿπ‘ƒπœƒ π‘₯π‘Ÿ , ΰ·œπ‘¦π‘Ÿ

Page 11: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised LearningLow-density Separation

ιžι»‘ε³η™½β€œBlack-or-white”

Page 12: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Self-training

β€’ Given: labelled data set = π‘₯π‘Ÿ , ΰ·œπ‘¦π‘Ÿ π‘Ÿ=1𝑅 , unlabeled data set

= π‘₯𝑒 𝑒=𝑙𝑅+π‘ˆ

β€’ Repeat:

β€’ Train model π‘“βˆ— from labelled data set

β€’ Apply π‘“βˆ— to the unlabeled data set

β€’ Obtain π‘₯𝑒, 𝑦𝑒 𝑒=𝑙𝑅+π‘ˆ

β€’ Remove a set of data from unlabeled data set, and add them into the labeled data set

Independent to the model

How to choose the data set remains open

Regression?

Pseudo-label

You can also provide a weight to each data.

Page 13: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Self-training

β€’ Similar to semi-supervised learning for generative model

β€’ Hard label v.s. Soft label

Considering using neural network

New target for π‘₯𝑒 is 10

Class 1

70% Class 130% Class 2

πœƒβˆ— (network parameter) from labelled data

New target for π‘₯𝑒 is 0.70.3

Doesn’t work …

It looks like class 1, then it is class 1.0.70.3

πœƒβˆ—π‘₯𝑒

Hard

Soft

Page 14: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Entropy-based Regularization

πœƒβˆ—π‘₯𝑒 𝑦𝑒

Distribution

𝑦𝑒

1 2 3 4 5

𝑦𝑒

1 2 3 4 5

𝑦𝑒

1 2 3 4 5

Good!

Good!

Bad!

𝐸 𝑦𝑒 = βˆ’

π‘š=1

5

π‘¦π‘šπ‘’ 𝑙𝑛 π‘¦π‘š

𝑒

Entropy of 𝑦𝑒 : Evaluate how concentrate the distribution 𝑦𝑒 is

𝐸 𝑦𝑒 = 0

𝐸 𝑦𝑒 = 0

𝐸 𝑦𝑒

= 𝑙𝑛5

= βˆ’π‘™π‘›1

5

As small as possible

𝐿 =

π‘₯π‘Ÿ

𝐢 π‘¦π‘Ÿ , ΰ·œπ‘¦π‘Ÿ

+πœ†

π‘₯𝑒

𝐸 𝑦𝑒

labelled data

unlabeled data

Page 15: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Outlook: Semi-supervised SVM

Find a boundary that can provide the largest margin and least error

Enumerate all possible labels for the unlabeled data

Thorsten Joachims, ”Transductive Inference for Text Classification using Support Vector Machines”, ICML, 1999

Page 16: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised LearningSmoothness Assumption

θΏ‘ζœ±θ€…θ΅€οΌŒθΏ‘ε’¨θ€…ι»‘β€œYou are known by the company you keep”

Page 17: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Smoothness Assumption

β€’ Assumption: β€œsimilar” π‘₯ has the same ΰ·œπ‘¦

β€’ More precisely:

β€’ x is not uniform.

β€’ If π‘₯1 and π‘₯2 are close in a high density region, ΰ·œπ‘¦1 and ΰ·œπ‘¦2 are the same.

connected by a high density path

Source of image: http://hips.seas.harvard.edu/files/pinwheel.png

π‘₯1

π‘₯2

π‘₯3

π‘₯1 and π‘₯2 have the same label

π‘₯2 and π‘₯3 have different labels

Page 18: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Smoothness Assumption

β€œindirectly” similarwith stepping stones

(The example is from the tutorial slides of Xiaojin Zhu.)

similar?Not similar?

Source of image: http://www.moehui.com/5833.html/5/

Page 19: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Smoothness Assumption

β€’ Classify astronomy vs. travel articles

(The example is from the tutorial slides of Xiaojin Zhu.)

Page 20: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Smoothness Assumption

β€’ Classify astronomy vs. travel articles

(The example is from the tutorial slides of Xiaojin Zhu.)

Page 21: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Cluster and then Label

Class 1

Class 2

Cluster 1

Cluster 2

Cluster 3

Class 1

Class 2

Class 2

Using all the data to learn a classifier as usual

Page 22: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Graph-based Approach

β€’ How to know π‘₯1 and π‘₯2 are close in a high density region (connected by a high density path)

Represented the data points as a graph

E.g. Hyperlink of webpages, citation of papers

Graph representation is nature sometimes.

Sometimes you have to construct the graph yourself.

Page 23: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Graph-based Approach- Graph Constructionβ€’ Define the similarity 𝑠 π‘₯𝑖 , π‘₯𝑗 between π‘₯𝑖 and π‘₯𝑗

β€’ Add edge:

β€’ K Nearest Neighbor

β€’ e-Neighborhood

β€’ Edge weight is proportional to s π‘₯𝑖 , π‘₯𝑗

𝑠 π‘₯𝑖 , π‘₯𝑗 = 𝑒π‘₯𝑝 βˆ’π›Ύ π‘₯𝑖 βˆ’ π‘₯𝑗2

Gaussian Radial Basis Function:

The image is from the tutorial slides of Amarnag Subramanyaand Partha Pratim Talukdar

Page 24: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Graph-based Approach

Class 1

Class 1

Class 1

Class 1

Class 1

Propagate through the graph

The labelled data influence their neighbors.

x

Page 25: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Graph-based Approach

β€’ Define the smoothness of the labels on the graph

𝑆 =1

2

𝑖,𝑗

𝑀𝑖,𝑗 𝑦𝑖 βˆ’ 𝑦𝑗2

Smaller means smoother

x1

x2x3

x4

23

1

1

𝑦1 = 1

𝑦2 = 1

𝑦3 = 1

𝑦4 = 0

x1

x2x3

x4

23

1

1

𝑦1 = 0

𝑦2 = 1

𝑦3 = 1

𝑦4 = 0𝑆 = 0.5 𝑆 = 3

For all data (no matter labelled or not)

Page 26: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Graph-based Approach

β€’ Define the smoothness of the labels on the graph

= π’šπ‘‡πΏπ’š

L: (R+U) x (R+U) matrix

Graph Laplacian

𝐿 = 𝐷 βˆ’π‘Š

y: (R+U)-dim vector

π’š = ⋯𝑦𝑖⋯𝑦𝑗⋯𝑻

π‘Š =

0 22 0

3 01 0

3 10 0

0 11 0

D =

5 00 3

0 00 0

0 00 0

5 00 1

𝑆 =1

2

𝑖,𝑗

𝑀𝑖,𝑗 𝑦𝑖 βˆ’ 𝑦𝑗2

Page 27: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Graph-based Approach

β€’ Define the smoothness of the labels on the graph

= π’šπ‘‡πΏπ’šπ‘† =1

2

𝑖,𝑗

𝑀𝑖,𝑗 𝑦𝑖 βˆ’ 𝑦𝑗2 Depending on

network parameters

𝐿 =

π‘₯π‘Ÿ

𝐢 π‘¦π‘Ÿ , ΰ·œπ‘¦π‘Ÿ +πœ†π‘†

J. Weston, F. Ratle, and R. Collobert, β€œDeep learning via semi-supervised embedding,” ICML, 2008

As a regularization term

smoothsmooth

smooth

Page 28: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Semi-supervised LearningBetter Representation

去θ•ͺε­˜θοΌŒεŒ–ηΉη‚Ίη°‘

Page 29: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Looking for Better Representation

β€’ Find the latent factors behind the observation

β€’ The latent factors (usually simpler) are better representations

observation

Better representation (Latent factor)

(In unsupervised learning part)

Page 30: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Reference

http://olivier.chapelle.cc/ssl-book/

Page 31: Semi-supervised Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/semi (v3).pdfΒ Β· Introduction β€’Supervised learning: π‘Ÿ, π‘Ÿ π‘Ÿ=1 𝑅 β€’E.g. π‘Ÿ: image, π‘Ÿ:

Acknowledgement

β€’ζ„Ÿθ¬εŠ‰θ­°ιš†εŒε­ΈζŒ‡ε‡ΊζŠ•ε½±η‰‡δΈŠηš„ιŒ―ε­—

β€’ζ„Ÿθ¬δΈε‹ƒι›„εŒε­ΈζŒ‡ε‡ΊζŠ•ε½±η‰‡δΈŠηš„ιŒ―ε­—