Semi-supervised Learning
Semi-supervised Learning
Introduction
β’ Supervised learning: π₯π , ΰ·π¦π π=1π
β’ E.g.π₯π: image, ΰ·π¦π: class labels
β’ Semi-supervised learning: π₯π , ΰ·π¦π π=1π , π₯π’ π’=π
π +π
β’ A set of unlabeled data, usually U >> R
β’ Transductive learning: unlabeled data is the testing data
β’ Inductive learning: unlabeled data is not the testing data
β’ Why semi-supervised learning?
β’ Collecting data is easy, but collecting βlabelledβ data is expensive
β’ We do semi-supervised learning in our lives
Why semi-supervised learning helps?
Labelled data
Unlabeled data
cat dog
(Image of cats and dogs without labeling)
Why semi-supervised learning helps?
The distribution of the unlabeled data tell us something.
Usually with some assumptions
Who knows?
Outline
Semi-supervised Learning for Generative Model
Low-density Separation Assumption
Smoothness Assumption
Better Representation
Semi-supervised Learning for Generative Model
Supervised Generative Model
β’ Given labelled training examples π₯π β πΆ1, πΆ2β’ looking for most likely prior probability P(Ci) and class-
dependent probability P(x|Ci)
β’ P(x|Ci) is a Gaussian parameterized by ππ and Ξ£
π1, Ξ£
π2, Ξ£
π πΆ1|π₯ =π π₯|πΆ1 π πΆ1
π π₯|πΆ1 π πΆ1 + π π₯|πΆ2 π πΆ2
With π πΆ1 , π πΆ2 ,π1,π2, Ξ£
Decision Boundary
Semi-supervised Generative Modelβ’ Given labelled training examples π₯π β πΆ1, πΆ2
β’ looking for most likely prior probability P(Ci) and class-dependent probability P(x|Ci)
β’ P(x|Ci) is a Gaussian parameterized by ππ and Ξ£
Decision Boundary
The unlabeled data π₯π’ help re-estimate π πΆ1 , π πΆ2 , π1,π2, Ξ£
π1, Ξ£
π2, Ξ£
Semi-supervised Generative Modelβ’ Initialization:π = π πΆ1 ,π πΆ2 ,π1,π2, Ξ£
β’ Step 1: compute the posterior probability of unlabeled data
β’ Step 2: update model
ππ πΆ1|π₯π’
π πΆ1 =π1 +Οπ₯π’ π πΆ1|π₯
π’
π
π1 =1
π1
π₯πβπΆ1
π₯π +1
Οπ₯π’ π πΆ1|π₯π’
π₯π’
π πΆ1|π₯π’ π₯π’ β¦β¦
Back to step 1
Depending on model π
π: total number of examplesπ1: number of examples belonging to C1
The algorithm converges eventually, but the initialization influences the results.
E
M
Why?
β’ Maximum likelihood with labelled data
β’ Maximum likelihood with labelled + unlabeled data
ππππΏ π =
π₯π
πππππ π₯π , ΰ·π¦π
ππππΏ π =
π₯π
πππππ π₯π , ΰ·π¦π +
π₯π’
πππππ π₯π’
ππ π₯π’ = ππ π₯π’|πΆ1 π πΆ1 + ππ π₯π’|πΆ2 π πΆ2
(π₯π’ can come from either C1 and C2)
Closed-form solution
Solved iteratively
π = π πΆ1 ,π πΆ2 ,π1,π2, Ξ£
= ππ π₯π| ΰ·π¦π π ΰ·π¦πππ π₯π , ΰ·π¦π
Semi-supervised LearningLow-density Separation
ιι»ε³η½βBlack-or-whiteβ
Self-training
β’ Given: labelled data set = π₯π , ΰ·π¦π π=1π , unlabeled data set
= π₯π’ π’=ππ +π
β’ Repeat:
β’ Train model πβ from labelled data set
β’ Apply πβ to the unlabeled data set
β’ Obtain π₯π’, π¦π’ π’=ππ +π
β’ Remove a set of data from unlabeled data set, and add them into the labeled data set
Independent to the model
How to choose the data set remains open
Regression?
Pseudo-label
You can also provide a weight to each data.
Self-training
β’ Similar to semi-supervised learning for generative model
β’ Hard label v.s. Soft label
Considering using neural network
New target for π₯π’ is 10
Class 1
70% Class 130% Class 2
πβ (network parameter) from labelled data
New target for π₯π’ is 0.70.3
Doesnβt work β¦
It looks like class 1, then it is class 1.0.70.3
πβπ₯π’
Hard
Soft
Entropy-based Regularization
πβπ₯π’ π¦π’
Distribution
π¦π’
1 2 3 4 5
π¦π’
1 2 3 4 5
π¦π’
1 2 3 4 5
Good!
Good!
Bad!
πΈ π¦π’ = β
π=1
5
π¦ππ’ ππ π¦π
π’
Entropy of π¦π’ : Evaluate how concentrate the distribution π¦π’ is
πΈ π¦π’ = 0
πΈ π¦π’ = 0
πΈ π¦π’
= ππ5
= βππ1
5
As small as possible
πΏ =
π₯π
πΆ π¦π , ΰ·π¦π
+π
π₯π’
πΈ π¦π’
labelled data
unlabeled data
Outlook: Semi-supervised SVM
Find a boundary that can provide the largest margin and least error
Enumerate all possible labels for the unlabeled data
Thorsten Joachims, βTransductive Inference for Text Classification using Support Vector Machinesβ, ICML, 1999
Semi-supervised LearningSmoothness Assumption
θΏζ±θ θ΅€οΌθΏε’¨θ ι»βYou are known by the company you keepβ
Smoothness Assumption
β’ Assumption: βsimilarβ π₯ has the same ΰ·π¦
β’ More precisely:
β’ x is not uniform.
β’ If π₯1 and π₯2 are close in a high density region, ΰ·π¦1 and ΰ·π¦2 are the same.
connected by a high density path
Source of image: http://hips.seas.harvard.edu/files/pinwheel.png
π₯1
π₯2
π₯3
π₯1 and π₯2 have the same label
π₯2 and π₯3 have different labels
Smoothness Assumption
βindirectlyβ similarwith stepping stones
(The example is from the tutorial slides of Xiaojin Zhu.)
similar?Not similar?
Source of image: http://www.moehui.com/5833.html/5/
Smoothness Assumption
β’ Classify astronomy vs. travel articles
(The example is from the tutorial slides of Xiaojin Zhu.)
Smoothness Assumption
β’ Classify astronomy vs. travel articles
(The example is from the tutorial slides of Xiaojin Zhu.)
Cluster and then Label
Class 1
Class 2
Cluster 1
Cluster 2
Cluster 3
Class 1
Class 2
Class 2
Using all the data to learn a classifier as usual
Graph-based Approach
β’ How to know π₯1 and π₯2 are close in a high density region (connected by a high density path)
Represented the data points as a graph
E.g. Hyperlink of webpages, citation of papers
Graph representation is nature sometimes.
Sometimes you have to construct the graph yourself.
Graph-based Approach- Graph Constructionβ’ Define the similarity π π₯π , π₯π between π₯π and π₯π
β’ Add edge:
β’ K Nearest Neighbor
β’ e-Neighborhood
β’ Edge weight is proportional to s π₯π , π₯π
π π₯π , π₯π = ππ₯π βπΎ π₯π β π₯π2
Gaussian Radial Basis Function:
The image is from the tutorial slides of Amarnag Subramanyaand Partha Pratim Talukdar
Graph-based Approach
Class 1
Class 1
Class 1
Class 1
Class 1
Propagate through the graph
The labelled data influence their neighbors.
x
Graph-based Approach
β’ Define the smoothness of the labels on the graph
π =1
2
π,π
π€π,π π¦π β π¦π2
Smaller means smoother
x1
x2x3
x4
23
1
1
π¦1 = 1
π¦2 = 1
π¦3 = 1
π¦4 = 0
x1
x2x3
x4
23
1
1
π¦1 = 0
π¦2 = 1
π¦3 = 1
π¦4 = 0π = 0.5 π = 3
For all data (no matter labelled or not)
Graph-based Approach
β’ Define the smoothness of the labels on the graph
= πππΏπ
L: (R+U) x (R+U) matrix
Graph Laplacian
πΏ = π· βπ
y: (R+U)-dim vector
π = β―π¦πβ―π¦πβ―π»
π =
0 22 0
3 01 0
3 10 0
0 11 0
D =
5 00 3
0 00 0
0 00 0
5 00 1
π =1
2
π,π
π€π,π π¦π β π¦π2
Graph-based Approach
β’ Define the smoothness of the labels on the graph
= πππΏππ =1
2
π,π
π€π,π π¦π β π¦π2 Depending on
network parameters
πΏ =
π₯π
πΆ π¦π , ΰ·π¦π +ππ
J. Weston, F. Ratle, and R. Collobert, βDeep learning via semi-supervised embedding,β ICML, 2008
As a regularization term
smoothsmooth
smooth
Semi-supervised LearningBetter Representation
ε»θͺεθοΌεηΉηΊη°‘
Looking for Better Representation
β’ Find the latent factors behind the observation
β’ The latent factors (usually simpler) are better representations
observation
Better representation (Latent factor)
(In unsupervised learning part)
Reference
http://olivier.chapelle.cc/ssl-book/
Acknowledgement
β’ζθ¬εθ°ιεεΈζεΊζε½±ηδΈηι―ε
β’ζθ¬δΈειεεΈζεΊζε½±ηδΈηι―ε