ORIGINAL ARTICLE Semi-supervised classification with privileged information Zhiquan Qi 1 • Yingjie Tian 1 • Lingfeng Niu 1 • Bo Wang 1 Received: 25 December 2014 / Accepted: 12 June 2015 / Published online: 30 June 2015 Ó Springer-Verlag Berlin Heidelberg 2015 Abstract The privileged information that is available only for the training examples and not available for test examples, is a new concept proposed by Vapnik and Vashist (Neural Netw 22(5–6):544–557, 2009). With the help of the privileged information, learning using privi- leged information (LUPI) (Neural Netw 22(5–6):544–557, 2009) can significantly accelerate the speed of learning. However, LUPI is a standard supervised learning method. In fact, in many real-world problems, there are also a lot of unlabeled data. This drives us to solve problems under a semi-supervised learning framework. In this paper, we propose a semi-supervised learning using privileged information (called Semi-LUPI), which can exploit both the distribution information in unlabeled data and privi- leged information to improve the efficiency of the learning. Furthermore, we also compare the relative importance of both types of information for the learning model. All experiments verify the effectiveness of the proposed method, and simultaneously show that Semi-LUPI can obtain superior performances over traditional supervised and semi-supervised methods. Keywords Classification Support vector machine Privileged information 1 Introduction For the classical learning model on the training data [2] fðx 1 ; y 1 Þ; ...; ðx l ;; y l Þg; x i 2X R n ; y i 2Y¼f1; 1g; ð1Þ where x i denotes the ith training data, and y i is the class label of the ith training data. The leaner’s aim is to select a suit- able classifier from a given collections f ðx; aÞ; a 2 K, which can minimize the number of misclassified points [3]. However, in humans learning process, teachers play an important role. They teach students knowledge through all kinds of information such as include comments, compar- ison, explanation, logic, emotional or metaphorical reason- ing, and so on. Also, during the machine learning process, a teacher may describe training examples with this additional information. Vapnik et al. [1, 4–6] called this kind of additional information as the privileged information, which is only available at the training stage but is never available for test samples, and then gave a new learning model: learning using privileged information (called LUPI), which has been proven to be able to significantly increase the speed of learning through the statistical learning theory [1, 4, 5]. Recently, semi-supervised learning has attracted an increasing amounts of interests [7–11]. One important rea- son is that the labeled examples are always rare but there are large amounts of unlabeled examples available in many practical problems. Graph based methods are a very important branch in this field, where nodes in the graph are the labeled and unlabeled points, and weighted edges reflect the similarities of nodes. The initial assumption of these methods is that all points are located in a low dimensional & Yingjie Tian [email protected]Zhiquan Qi [email protected]Lingfeng Niu [email protected]Bo Wang [email protected]1 Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China 123 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 DOI 10.1007/s13042-015-0390-1
10
Embed
Semi-supervised classification with privileged …it.uibe.edu.cn/docs/20180314203707008850.pdfORIGINAL ARTICLE Semi-supervised classification with privileged information Zhiquan Qi1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ORIGINAL ARTICLE
Semi-supervised classification with privileged information
ing kfk2H by (15), introducing 0-1 loss function and the the
corresponding oracle function, the formulation of Semi-
LUPI can be expressed as
mina;a�;b;b�
c1a>Kaþ c2a
�>K�a� þ1
le>K�a� þb� þ c3kfk2
M;
s.t. yiXlþu
j¼1
ajKðxi;xjÞþb
" #�1�
Xlj¼1
a�j K�ðx�i ;x�j Þþb�
" #;
Xlj¼1
a�j K�ðx�i ;x�j Þþb��0; i¼ 1; . . .; l:
ð17Þ
An important premise of this kind of approach is to assume
that the probability distribution of data has the geometric
structure of a Riemannian manifold M. The labels of two
points that are close in the intrinsic geometry of PX should
be the same or similar. Literature [12] applied the intrinsic
regularizer kfk2M to describe the constraint above,
kfk2M ¼ 1
ðlþ uÞ2
Xlþu
i;j¼1
Wi;jðf ðxiÞ � f ðxjÞÞ2 ¼ f>Lf ; ð18Þ
where L is the graph Laplacian. In practise, a data adja-
cency graph WðlþuÞ�ðlþuÞ is defined by nodes Wi;j, which
represents the similarity of every pair of input samples. The
weight matrix W may be defined by k nearest neighbor or
graph kernels as follows [12]:
Wij ¼expð�kxi � xjk2
2=2r2Þ; if xi; xj are neighbor ;
0; Otherwise ;
(
ð19Þ
where kxi � xjk22 denotes the Euclidean norm in Rn. L ¼
D�W is the graph Laplacian, D is a diagonal matrix with its ith
diagonalDii ¼Plþu
j¼1 Wij, and f ¼ ½f ðx1Þ; . . .; f ðxlþuÞ�> ¼Ka.
When (18) is used as a penalty item of the Eq. (17), we can
understand them by these means: if the neighbor of xi; xj has the
higher similarity(Wij is larger), the difference of fðxiÞ; fðxjÞwill obtain a big punishment. More intuitively, smaller jf ðxiÞ �f ðxjÞj is, more smooth f(x) in the data adjacency graph is. So
(17) can be translated to the following optimization
Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 669
123
mina;a�;b;b�
c1a>Kaþ c2a
�>K�a� þ 1
le>K�a�
þ b� þ c3
ðlþ uÞ2a>KLKa;
s.t. yiXlþu
j¼1
ajKðxi;xjÞþ b
" #�1�
Xlj¼1
a�j K�ðx�i ;x�j Þþ b�
" #;
Xlj¼1
a�j K�ðx�i ;x�j Þþ b��0; i¼ 1; . . .; l: ð20Þ
The Lagrangian corresponding to the problem (20) is given
by
LðHÞ ¼c1a>Kaþ c2a
�>K�a� þ 1
le>K�a� þ b� þ c3
ðlþ uÞ2a>
KLKa�Xli¼1
bi yiXlþu
j¼1
ajKðxi;xjÞþ b
" #� 1
!
þXlj¼1
a�j K�ðx�i ;x�j Þþ b�
" # !
�Xli¼1
giXlj¼1
a�j K�ðx�i ;x�j Þþ b�
!; ð21Þ
where H¼ fa;a�;b;b�;b;gg, b¼ ðb1; . . .;blÞ>, g¼ðg1; . . .;glÞ> are the Lagrange multipliers. So the dual
problem can be formulated as
maxH
LðHÞ
s.t. ra;a�;b;b�LðHÞ ¼ 0;
b; g� 0:
ð22Þ
From Eq. (22), we get
raL ¼ 2c1K þ 2c3
ðlþ uÞ2KLK
!a� KJ>Yb ¼ 0; ð23Þ
ra�L ¼ 2c2K�a� þ 1
lK�e� K�ðbþ gÞ ¼ 0; ð24Þ
rbL ¼Xli¼1
yibi ¼ 0; ð25Þ
rb�L ¼ 1 �Xli¼1
bi �Xli¼1
gi ¼ 0; ð26Þ
where J ¼ ½I0� is an l� ðlþ uÞ matrix with I as the l� l
identity matrix and Y is a diagonal matrix composed by the
labels as Y ¼ diagðy1; . . .; ylÞ. Now, substituting (23)–(26)
into the dual (22) , we can obtain the Wolfe dual of the
problem (20) as follows
maxb;g
Xli¼1
bi�1
2b>Qb� 1
4c2
b�g�1
le
� �>K� b�g�1
le
� �
s.t.Xli¼1
yibi ¼ 0;
1�Xli¼1
bi�Xli¼1
gi ¼ 0;
b�0; g�0;
ð27Þ
where
Q ¼ YJK 2c1I þ 2c3
ðlþ uÞ2LK
!�1
J>Y : ð28Þ
From (27), it is easy to find that this is a standard convex
quadratic programming problem and we do not need to
solve the additional variable a� and b�.Finally, Semi-LUPI can be summarized as the following
Algorithm 1:
4 Other extensions of the semi-LUPI
In this section, we give some extensions of Semi-LUPI.
4.1 Mixture model of slacks
Modeling slacks by values of some smooth function is
not always the best choice [1]. Let us model slacks by
a mixture of values of some smooth function. Consider
a model by a mixture of values of some smooth func-
tion /ðx�i Þ ¼Pl
j¼1 a�j K
�ðx�j ; x�i Þ þ b� and n�i ; i ¼ 1; . . .; l.
The primal optimization problem (20) can be changed
as
Algorithm 1 Semi-LUPI
• Input the training set T given by (1) and (13);• Choose two appropriate kernels K(, ) and K∗(, ), and parameters γ1, γ2, γ3 > 0;• Construct and solve the convex quadratic programming problem (28),
obtaining the solution β∗, η∗;• Select a component index j such that β∗
j > 0, η∗j > 0, and compute
b = yj − ∑l+ui=1 α∗
i yiK(xi, xj),
where α∗ = (2γ1I + 2γ3
(l + u)2LK)−1J�Y β∗.
• Construct the decision function:
f(x) = sgn(g(x)),
where g(x) =∑l+u
i=1 yiα∗i K(xi, x) + b.
670 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676