1
1
NB – Naïve Bayes
SSL – Semi-Supervised Learning
TC – Text Classification
EM – Expectation Maximization
SVM – Support Vector Machine
2
As number of training documents increases, accuracy of
Text Classification increases. But traditional classifiers use
labeled data to train.
Labeled instances however are often difficult, expensive, or
time consuming to obtain, as they require the efforts of
experienced human annotators.
Meanwhile unlabeled data may be relatively easy to collect.
Semi-Supervised Learning make use of both labeled and
unlabeled documents for classification. So how to use
Semi-Supervised Learning when labeled documents are
less is a problem of research.
3
In the field of machine learning, semi-supervised learning
(SSL) occupies the middle ground, between supervised
learning (in which all training examples are labeled) and
unsupervised learning (in which no label data are given).
Interest in SSL has increased in recent years, particularly
because of application domains in which unlabeled data are
plentiful, such as images, text, and bioinformatics.
4
7
8
9
India and china are joining
WTO.
{ India and china are joining
WTO }
{ India china joining WTO }
{ India china join WTO }
10
1. Term Frequency, TFij = nij / di
TF of ‘chinese’ in d1 = 2/3,
d2 = 2/3
d3 = 1/2
d4 = 1/3
2. Doc Frequency, DF = nj / n
DF of ‘chinese’ = 4 / 4 = 1
11
12
13
Precision = TP / TP + FP = 1/2
Recall = TP / TP + FN = 1/1+0=
1
F1 = (2 * Precision * Recall /
Precision + Recall) * 100 %
F1 = 0.66 * 100 %
= 66%
14
Doc Words in Doc Actual Label
Predicted Label
d10 India India India
Delhi Mumbai
Chinese
N Y
d11 Chinese Beijing
Chinese
Y Y
Y N
Y TP = 1 FP = 1
N FN = 0 TN = 0
Training set Semi-Supervised
Learning
Training Set
after SSL
Test Set
Doc
Id
Class Doc
Id
Class Doc
Id
Class
Labeled
Documents
D1 India D1 India D7 India
D2 China D2 China D8 China
Unlabeled
Documents
D3 ? D3 India D9 China
D4 ? D4 China D10 India
D5 ? D5 India D11 India
D6 ? D6 India D12 India
15
Low Density Separation (SVM) Graph based Methods Co –Training (Multi-View Approach) Generative Method
16
In probability and statistics, a generative modelis a model for randomly generating observabledata, typically given some hidden parameters.
Generative models are used in machine learningfor either modeling data directly (i.e., modelingobserved draws from a probability densityfunction), or as an intermediate step to forming aconditional probability density function. Aconditional distribution can be formed from agenerative model through the use of Bayes' rule.
17
In statistics, an expectation–maximization (EM)
algorithm is an iterative method for finding
maximum likelihood or maximum a posteriori
(MAP) estimates of parameters in statistical models.
Widely used for learning in the presence of
unobserved variables, e.g., missing features, class
labels.
18
19
Algorithm
N = No of Labeled Doc, U = No of Unlabeled Doc
Inputs : Collections Dl of labeled documents and Du of unlabeled documents.
Method :
Build an initial naive Bayes classifier, , from the labeled documents, Dl , only. Use
maximum a posteriori parameter estimation to find
Loop while classifier parameters improve,
(the complete log probability of the labeled and unlabeled data)
(E-step) Use the current classifier, , to estimate component membership of each
unlabeled document, i.e., the probability that each mixture component (and class)
generated each document.
(M-step) Re-estimate the classifier , , given the estimated component membership of
each document. Use maximum a posteriori parameter estimation to find
Output : A classifier, , that takes an unlabeled document and predicts a class label.
)()/(maxargˆ PDP ),/( zDlc )ˆ;/( ij dcP )()/(maxargˆ PDP
20
The algorithm first trains a classifier with only the available
labeled documents, and assigns probabilistically-weighted
class labels to each unlabeled document by using the
classifier to calculate their expectation.
It then trains a new classifier using all the documents both
the originally labeled and the formerly unlabeled and
iterates.
21
Tool, Technology, Language Version used
1 RapidMiner 5.1.001
2 Eclipse Ganymede
3 Java JDK 1.6
Dataset Detail
22
Training – Testing spilt Class1 Class2
Religion Politics
Training Set No of Labeled
Documents
10 to 600 10 to 600
No of
Unlabeled
Documents
500 500
Testing Set No of
Documents for
testing
100 100
Implementation Setup
We have used 20 newsgroup dataset [11]
1) Creating an operator:package com.RapidMiner.operator.learner;
import com.RapidMiner.operator.learner;import com.RapidMiner.operator.Operator;import com.RapidMiner.operator.OperatorDescription;public class SemiSupervisedLarnerextends Operator {
public SemiSupervisedLarner(OperatorDescription description){
super(description);}
}
23
2) Adding Ports to Operator for Input and Output :
private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");
private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");
private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");
private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");
private OutputPort modelOutput = getOutputPorts().createPort("model");
24
3) Writing Logic for Implementation of Semi-Supervised Algorithm
public void doWork() throws OperatorException {
ExampleSet labeledExampleSet = labeledExampleSetInput.getData();ExampleSet unLabeledExampleSet = nLabeledExampleSetInput.getData();ExampleSet testExampleSet = testExampleSetInput.getData();
/* logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm */
exampleSetOutput.deliver(model);exampleSetOutput.deliver(exampleSet);}
25
package com.Rapid Miner.operator.learner;
import com.Rapid Miner.operator.Operator;
import com.Rapid Miner.operator.OperatorDescription;
public class SemiSupervisedLarner extends Operator {
private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");
private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");
private OutputPort modelOutput = getOutputPorts().createPort("model");
public SemiSupervisedLarner (OperatorDescription description) {
super(description);
}
public void doWork() throws OperatorException {
ExampleSet labeledExampleSet = labeledExampleSetInput.getData();
ExampleSet unLabeledExampleSet = unLabeledExampleSetInput.getData();
ExampleSet testExampleSet = testExampleSetInput.getData();
// logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm
exampleSetOutput.deliver(model);
exampleSetOutput.deliver(exampleSet);
}
}
No Package Classes
1 com.rapidminer.operator.learne
r
SemiSupervisedAbstractLearner
2.1 com.rapidminer.operator.learne
r.bayes
SSNaiveBayes
2.2 SemiSupervisedDistributionModel
3 com.rapidminer.ExampleSet ExampleSetUtils
No Method Name Description
1 doWork() Takes input Examplesets from input ports,
calls learn method and supplies output to
output port.
1 learn() Creates an instance of SemiSupervised
distributionModel. Returns Learned Model.
2.2 SemiSupervisedDist
ributionModel()
This is a constructor where all variables to
find prior and posterior probability are
initialized and all methods to perform semi-
supervised learning are called from here.
2.2 update() Finds weight of each attribute(feature) in
each class.
2.2 updateDistribution
Properties()
Finds posterior probability.
2.2 performPrediction() Predicts Labels of all unlabeled Documents
using NB.Returns PredictedExampleSet.
2.2 updateAfter
Prediction()
Updates weight of each attribute(feature) in
each class.
2.2 updateDistribution
PropertiesAfterPred
iction()
Updates posterior probability.
2.2 performTest() Predicts class labels of all documents in test
set and calculates precision, recall and
accuracy of each class and average accuracy.
3 merge() Merges labeled and predicted unlabeled
documents.Returns merged Example27
28
Performing Pre-processing in RapidMiner
29
SemiSupervised Operator Complete Classification Process
30
Analysis (Limitation)
Improvement in accuracy is lesswith SSL as compared toSupervised Learner NB as someunlabeled samples aremisclassified by the currentclassifier because the initiallabeled samples are not enough[10] and these misclassifiedsamples are directly consideredfor training.
31
Results shows improvement
in accuracy
Algorithm
No of Labeled
Documents
NB(Naïve Bayes) SSL (Basic
EM)
20 0 22.38
40 47.91 47.91
60 20.00 44.86
80 40.00 46.38
100 44.44 40.00
200 48 47.38
400 49.96 46.79
600 45.23 60.28
Reference papers
[2] [6] [5] [4] [3]
Dataset Used 1) 20 News Group
2) WebKb
3)Reuters
Chinese Text Chinese Short Text Text document from
public forums in
Chinese internet
Reuter 21578
Distribution of
Dataset uniform
NS Yes Yes NS No
Training, Testing NS NS NS 3/4, ¼ 2/3, 1/3
Parameters compared
for Accuracy
1)No of labeled
Documents vs.
Accuracy
2)No of Unlabeled
Documents vs.
Accuracy
Times of iteration vs.
Macro F1
Times of iterations vs.
Macro F1
No of iterations vs.
Accuracy
Feature Selection
methods vs. Accuracy
Measures of
evaluation used
Accuracy 1) Macro F1
2)New measure,
IR = (IS – IL)/IL
NS Macro F1 Macro average
Accuracy
Method used for
initial distribution of
EM
Naïve Bayesian Naïve Bayesian Naïve Bayesian Random Sub-Space
method
Naïve Bayesian
Feature Selection
method used
NS TF-IDF in each
iteration
Chi-Square in each
iteration
NS DF * ICIF
Uses more than one
classifier
No Yes No Yes No
32
We have proposed an algorithm in [10] in which we consider
votes of both Naive Bayes and Support Vector Machine
(SVM), and only those unlabeled documents for which both
NB and SVM predict the same label are considered in the next
iteration and the remaining unlabeled documents are
discarded.
This improved algorithm is also implemented in RapidMiner
as an extension. It gives better accuracy as compared to the
standard SSL algorithm for the same dataset [12].
33
Semi-Supervised Learning with EM can beeffectively used for improving performance of TextClassification when limited number of labeleddocuments are available for training and it isimplemented in RapidMiner as an extension.
To implement other variants of SSL algorithm inRapidMiner proposed by different researchers [7] inorder to overcome limitation of classic EM basedSSL algorithm and to perform experiments on real-time dataset like SMS, e-mail etc, are our futuregoals.
34
THANK YOU
35
[1] Kamal Nigam, Andrew Kachites Mccallum,“ Text classification from Labeled and Unlabeled Data using EM”,
Machine Learning, Kluwer Academic Publishers, Boston. Manufactured in The Netherlands, 2002.
[2] Xiaojin Zhu, “Semi-Supervised Learning Literature Survey”, Computer Sciences TR 1530, University of
Wisconsin – Madison, 2005.
[3] Wen Han, Xiao Nan-feng, “An Enhanced EM Method of Semi-supervised Classification Based on Naive
Bayesian”, Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 15- Sep-
2011.
[4] YueHong Cai, Qian Zhu; “Semi-Supervised Short Text Categorization based on Random Subspace”-
Computer Science and Information Technology (ICCSIT), 3rd IEEE International Conference on Page(s): 470
– 473 , 2010.
[5] Xinghua Fan, Zhiyi Guo; “A semi-supervised Text Classification Method based on Incremental EM
Algorithm”, WASE International Conference on Information Engineering, Page(s): 211 - 214, 2010.
[6] Xinghua Fan, Zhiyi Guo, Houfeng Ma. “An improved EM-based Semi-supervised Learning Method”
,International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, page(s): 529 -
532, August - 2009.
[7] Purvi Rekh, Amit Thakkar, Amit Ganatra, “A Survey and Comparative analysis of Expectation Maximization
based Semi-Supervised Text Classification”, International Journal of Engineering and Advanced Technology,
Vol 1, Issue- 3, page(s): 141 - 146, February – 2012.
[8] Zhu, Xiaojin. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences,
University of Wisconsin-Madison, 2008.
[9] Approaching Vega: The final descent How to extend RapidMiner 5.0
[10] Purvi Rekh,Amit Thakkar, Amit Ganatra, “An Improved Expectation Maximization based Semi-Supervised
Text Classification using Naïve Bayes and Support Vector Machine”, CiiT International Journal of Artificial
Intelligent Systems and Machine Learning, May -2012.
[11] Twenty News Group Data Set:
http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
[12] Purvi Rekh, Amit Thankkar, Semi-Supervised Text Classification using Naïve Bayes and Support Vector
Machine, Second International Conference on Emerging Research in Computing, Information,
Communication and applications, in press with Elsevier proceedings, 2014 36