RM World 2014: Semi supervised text classification operator

1

NB – Naïve Bayes

SSL – Semi-Supervised Learning

TC – Text Classification

EM – Expectation Maximization

SVM – Support Vector Machine

2

As number of training documents increases, accuracy of

Text Classification increases. But traditional classifiers use

labeled data to train.

Labeled instances however are often difficult, expensive, or

time consuming to obtain, as they require the efforts of

experienced human annotators.

Meanwhile unlabeled data may be relatively easy to collect.

Semi-Supervised Learning make use of both labeled and

unlabeled documents for classification. So how to use

Semi-Supervised Learning when labeled documents are

less is a problem of research.

3

In the field of machine learning, semi-supervised learning

(SSL) occupies the middle ground, between supervised

learning (in which all training examples are labeled) and

unsupervised learning (in which no label data are given).

Interest in SSL has increased in recent years, particularly

because of application domains in which unlabeled data are

plentiful, such as images, text, and bioinformatics.

4

7

8

9

India and china are joining

WTO.

{ India and china are joining

WTO }

{ India china joining WTO }

{ India china join WTO }

10

1. Term Frequency, TFij = nij / di

TF of ‘chinese’ in d1 = 2/3,

d2 = 2/3

d3 = 1/2

d4 = 1/3

2. Doc Frequency, DF = nj / n

DF of ‘chinese’ = 4 / 4 = 1

11

12

13

Precision = TP / TP + FP = 1/2

Recall = TP / TP + FN = 1/1+0=

1

F1 = (2 * Precision * Recall /

Precision + Recall) * 100 %

F1 = 0.66 * 100 %

= 66%

14

Doc Words in Doc Actual Label

Predicted Label

d10 India India India

Delhi Mumbai

Chinese

N Y

d11 Chinese Beijing

Chinese

Y Y

Y N

Y TP = 1 FP = 1

N FN = 0 TN = 0

Training set Semi-Supervised

Learning

Training Set

after SSL

Test Set

Doc

Id

Class Doc

Id

Class Doc

Id

Class

Labeled

Documents

D1 India D1 India D7 India

D2 China D2 China D8 China

Unlabeled

Documents

D3 ? D3 India D9 China

D4 ? D4 China D10 India

D5 ? D5 India D11 India

D6 ? D6 India D12 India

15

Low Density Separation (SVM) Graph based Methods Co –Training (Multi-View Approach) Generative Method

16

In probability and statistics, a generative modelis a model for randomly generating observabledata, typically given some hidden parameters.

Generative models are used in machine learningfor either modeling data directly (i.e., modelingobserved draws from a probability densityfunction), or as an intermediate step to forming aconditional probability density function. Aconditional distribution can be formed from agenerative model through the use of Bayes' rule.

17

In statistics, an expectation–maximization (EM)

algorithm is an iterative method for finding

maximum likelihood or maximum a posteriori

(MAP) estimates of parameters in statistical models.

Widely used for learning in the presence of

unobserved variables, e.g., missing features, class

labels.

18

19

Algorithm

N = No of Labeled Doc, U = No of Unlabeled Doc

Inputs : Collections Dl of labeled documents and Du of unlabeled documents.

Method :

Build an initial naive Bayes classifier, , from the labeled documents, Dl , only. Use

maximum a posteriori parameter estimation to find

Loop while classifier parameters improve,

(the complete log probability of the labeled and unlabeled data)

(E-step) Use the current classifier, , to estimate component membership of each

unlabeled document, i.e., the probability that each mixture component (and class)

generated each document.

(M-step) Re-estimate the classifier , , given the estimated component membership of

each document. Use maximum a posteriori parameter estimation to find

Output : A classifier, , that takes an unlabeled document and predicts a class label.

)()/(maxargˆ PDP ),/( zDlc )ˆ;/( ij dcP )()/(maxargˆ PDP

20

The algorithm first trains a classifier with only the available

labeled documents, and assigns probabilistically-weighted

class labels to each unlabeled document by using the

classifier to calculate their expectation.

It then trains a new classifier using all the documents both

the originally labeled and the formerly unlabeled and

iterates.

21

Tool, Technology, Language Version used

1 RapidMiner 5.1.001

2 Eclipse Ganymede

3 Java JDK 1.6

Dataset Detail

22

Training – Testing spilt Class1 Class2

Religion Politics

Training Set No of Labeled

Documents

10 to 600 10 to 600

No of

Unlabeled

Documents

500 500

Testing Set No of

Documents for

testing

100 100

Implementation Setup

We have used 20 newsgroup dataset [11]

1) Creating an operator:package com.RapidMiner.operator.learner;

import com.RapidMiner.operator.learner;import com.RapidMiner.operator.Operator;import com.RapidMiner.operator.OperatorDescription;public class SemiSupervisedLarnerextends Operator {

public SemiSupervisedLarner(OperatorDescription description){

super(description);}

}

23

2) Adding Ports to Operator for Input and Output :

private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");

private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");

private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");

private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");

private OutputPort modelOutput = getOutputPorts().createPort("model");

24

3) Writing Logic for Implementation of Semi-Supervised Algorithm

public void doWork() throws OperatorException {

ExampleSet labeledExampleSet = labeledExampleSetInput.getData();ExampleSet unLabeledExampleSet = nLabeledExampleSetInput.getData();ExampleSet testExampleSet = testExampleSetInput.getData();

/* logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm */

exampleSetOutput.deliver(model);exampleSetOutput.deliver(exampleSet);}

25

package com.Rapid Miner.operator.learner;

import com.Rapid Miner.operator.Operator;

import com.Rapid Miner.operator.OperatorDescription;

public class SemiSupervisedLarner extends Operator {

private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");

private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");

private OutputPort modelOutput = getOutputPorts().createPort("model");

public SemiSupervisedLarner (OperatorDescription description) {

super(description);

}

public void doWork() throws OperatorException {

ExampleSet labeledExampleSet = labeledExampleSetInput.getData();

ExampleSet unLabeledExampleSet = unLabeledExampleSetInput.getData();

ExampleSet testExampleSet = testExampleSetInput.getData();

// logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm

exampleSetOutput.deliver(model);

exampleSetOutput.deliver(exampleSet);

}

}

No Package Classes

1 com.rapidminer.operator.learne

r

SemiSupervisedAbstractLearner

2.1 com.rapidminer.operator.learne

r.bayes

SSNaiveBayes

2.2 SemiSupervisedDistributionModel

3 com.rapidminer.ExampleSet ExampleSetUtils

No Method Name Description

1 doWork() Takes input Examplesets from input ports,

calls learn method and supplies output to

output port.

1 learn() Creates an instance of SemiSupervised

distributionModel. Returns Learned Model.

2.2 SemiSupervisedDist

ributionModel()

This is a constructor where all variables to

find prior and posterior probability are

initialized and all methods to perform semi-

supervised learning are called from here.

2.2 update() Finds weight of each attribute(feature) in

each class.

2.2 updateDistribution

Properties()

Finds posterior probability.

2.2 performPrediction() Predicts Labels of all unlabeled Documents

using NB.Returns PredictedExampleSet.

2.2 updateAfter

Prediction()

Updates weight of each attribute(feature) in

each class.

2.2 updateDistribution

PropertiesAfterPred

iction()

Updates posterior probability.

2.2 performTest() Predicts class labels of all documents in test

set and calculates precision, recall and

accuracy of each class and average accuracy.

3 merge() Merges labeled and predicted unlabeled

documents.Returns merged Example27

28

Performing Pre-processing in RapidMiner

29

SemiSupervised Operator Complete Classification Process

30

Analysis (Limitation)

Improvement in accuracy is lesswith SSL as compared toSupervised Learner NB as someunlabeled samples aremisclassified by the currentclassifier because the initiallabeled samples are not enough[10] and these misclassifiedsamples are directly consideredfor training.

31

Results shows improvement

in accuracy

Algorithm

No of Labeled

Documents

NB(Naïve Bayes) SSL (Basic

EM)

20 0 22.38

40 47.91 47.91

60 20.00 44.86

80 40.00 46.38

100 44.44 40.00

200 48 47.38

400 49.96 46.79

600 45.23 60.28

Reference papers

[2] [6] [5] [4] [3]

Dataset Used 1) 20 News Group

2) WebKb

3)Reuters

Chinese Text Chinese Short Text Text document from

public forums in

Chinese internet

Reuter 21578

Distribution of

Dataset uniform

NS Yes Yes NS No

Training, Testing NS NS NS 3/4, ¼ 2/3, 1/3

Parameters compared

for Accuracy

1)No of labeled

Documents vs.

Accuracy

2)No of Unlabeled

Documents vs.

Accuracy

Times of iteration vs.

Macro F1

Times of iterations vs.

Macro F1

No of iterations vs.

Accuracy

Feature Selection

methods vs. Accuracy

Measures of

evaluation used

Accuracy 1) Macro F1

2)New measure,

IR = (IS – IL)/IL

NS Macro F1 Macro average

Accuracy

Method used for

initial distribution of

EM

Naïve Bayesian Naïve Bayesian Naïve Bayesian Random Sub-Space

method

Naïve Bayesian

Feature Selection

method used

NS TF-IDF in each

iteration

Chi-Square in each

iteration

NS DF * ICIF

Uses more than one

classifier

No Yes No Yes No

32

We have proposed an algorithm in [10] in which we consider

votes of both Naive Bayes and Support Vector Machine

(SVM), and only those unlabeled documents for which both

NB and SVM predict the same label are considered in the next

iteration and the remaining unlabeled documents are

discarded.

This improved algorithm is also implemented in RapidMiner

as an extension. It gives better accuracy as compared to the

standard SSL algorithm for the same dataset [12].

33

Semi-Supervised Learning with EM can beeffectively used for improving performance of TextClassification when limited number of labeleddocuments are available for training and it isimplemented in RapidMiner as an extension.

To implement other variants of SSL algorithm inRapidMiner proposed by different researchers [7] inorder to overcome limitation of classic EM basedSSL algorithm and to perform experiments on real-time dataset like SMS, e-mail etc, are our futuregoals.

34

THANK YOU

35

[1] Kamal Nigam, Andrew Kachites Mccallum,“ Text classification from Labeled and Unlabeled Data using EM”,

Machine Learning, Kluwer Academic Publishers, Boston. Manufactured in The Netherlands, 2002.

[2] Xiaojin Zhu, “Semi-Supervised Learning Literature Survey”, Computer Sciences TR 1530, University of

Wisconsin – Madison, 2005.

[3] Wen Han, Xiao Nan-feng, “An Enhanced EM Method of Semi-supervised Classification Based on Naive

Bayesian”, Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 15- Sep-

2011.

[4] YueHong Cai, Qian Zhu; “Semi-Supervised Short Text Categorization based on Random Subspace”-

Computer Science and Information Technology (ICCSIT), 3rd IEEE International Conference on Page(s): 470

– 473 , 2010.

[5] Xinghua Fan, Zhiyi Guo; “A semi-supervised Text Classification Method based on Incremental EM

Algorithm”, WASE International Conference on Information Engineering, Page(s): 211 - 214, 2010.

[6] Xinghua Fan, Zhiyi Guo, Houfeng Ma. “An improved EM-based Semi-supervised Learning Method”

,International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, page(s): 529 -

532, August - 2009.

[7] Purvi Rekh, Amit Thakkar, Amit Ganatra, “A Survey and Comparative analysis of Expectation Maximization

based Semi-Supervised Text Classification”, International Journal of Engineering and Advanced Technology,

Vol 1, Issue- 3, page(s): 141 - 146, February – 2012.

[8] Zhu, Xiaojin. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences,

University of Wisconsin-Madison, 2008.

[9] Approaching Vega: The final descent How to extend RapidMiner 5.0

[10] Purvi Rekh,Amit Thakkar, Amit Ganatra, “An Improved Expectation Maximization based Semi-Supervised

Text Classification using Naïve Bayes and Support Vector Machine”, CiiT International Journal of Artificial

Intelligent Systems and Machine Learning, May -2012.

[11] Twenty News Group Data Set:

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

[12] Purvi Rekh, Amit Thankkar, Semi-Supervised Text Classification using Naïve Bayes and Support Vector

Machine, Second International Conference on Emerging Research in Computing, Information,

Communication and applications, in press with Elsevier proceedings, 2014 36