Top Banner
1
36

RM World 2014: Semi supervised text classification operator

Dec 05, 2014

Download

Documents

RapidMiner

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RM World 2014: Semi supervised text classification operator

1

Page 2: RM World 2014: Semi supervised text classification operator

NB – Naïve Bayes

SSL – Semi-Supervised Learning

TC – Text Classification

EM – Expectation Maximization

SVM – Support Vector Machine

2

Page 3: RM World 2014: Semi supervised text classification operator

As number of training documents increases, accuracy of

Text Classification increases. But traditional classifiers use

labeled data to train.

Labeled instances however are often difficult, expensive, or

time consuming to obtain, as they require the efforts of

experienced human annotators.

Meanwhile unlabeled data may be relatively easy to collect.

Semi-Supervised Learning make use of both labeled and

unlabeled documents for classification. So how to use

Semi-Supervised Learning when labeled documents are

less is a problem of research.

3

Page 4: RM World 2014: Semi supervised text classification operator

In the field of machine learning, semi-supervised learning

(SSL) occupies the middle ground, between supervised

learning (in which all training examples are labeled) and

unsupervised learning (in which no label data are given).

Interest in SSL has increased in recent years, particularly

because of application domains in which unlabeled data are

plentiful, such as images, text, and bioinformatics.

4

Page 5: RM World 2014: Semi supervised text classification operator
Page 6: RM World 2014: Semi supervised text classification operator
Page 7: RM World 2014: Semi supervised text classification operator

7

Page 8: RM World 2014: Semi supervised text classification operator

8

Page 9: RM World 2014: Semi supervised text classification operator

9

Page 10: RM World 2014: Semi supervised text classification operator

India and china are joining

WTO.

{ India and china are joining

WTO }

{ India china joining WTO }

{ India china join WTO }

10

Page 11: RM World 2014: Semi supervised text classification operator

1. Term Frequency, TFij = nij / di

TF of ‘chinese’ in d1 = 2/3,

d2 = 2/3

d3 = 1/2

d4 = 1/3

2. Doc Frequency, DF = nj / n

DF of ‘chinese’ = 4 / 4 = 1

11

Page 12: RM World 2014: Semi supervised text classification operator

12

Page 13: RM World 2014: Semi supervised text classification operator

13

Page 14: RM World 2014: Semi supervised text classification operator

Precision = TP / TP + FP = 1/2

Recall = TP / TP + FN = 1/1+0=

1

F1 = (2 * Precision * Recall /

Precision + Recall) * 100 %

F1 = 0.66 * 100 %

= 66%

14

Doc Words in Doc Actual Label

Predicted Label

d10 India India India

Delhi Mumbai

Chinese

N Y

d11 Chinese Beijing

Chinese

Y Y

Y N

Y TP = 1 FP = 1

N FN = 0 TN = 0

Page 15: RM World 2014: Semi supervised text classification operator

Training set Semi-Supervised

Learning

Training Set

after SSL

Test Set

Doc

Id

Class Doc

Id

Class Doc

Id

Class

Labeled

Documents

D1 India D1 India D7 India

D2 China D2 China D8 China

Unlabeled

Documents

D3 ? D3 India D9 China

D4 ? D4 China D10 India

D5 ? D5 India D11 India

D6 ? D6 India D12 India

15

Page 16: RM World 2014: Semi supervised text classification operator

Low Density Separation (SVM) Graph based Methods Co –Training (Multi-View Approach) Generative Method

16

Page 17: RM World 2014: Semi supervised text classification operator

In probability and statistics, a generative modelis a model for randomly generating observabledata, typically given some hidden parameters.

Generative models are used in machine learningfor either modeling data directly (i.e., modelingobserved draws from a probability densityfunction), or as an intermediate step to forming aconditional probability density function. Aconditional distribution can be formed from agenerative model through the use of Bayes' rule.

17

Page 18: RM World 2014: Semi supervised text classification operator

In statistics, an expectation–maximization (EM)

algorithm is an iterative method for finding

maximum likelihood or maximum a posteriori

(MAP) estimates of parameters in statistical models.

Widely used for learning in the presence of

unobserved variables, e.g., missing features, class

labels.

18

Page 19: RM World 2014: Semi supervised text classification operator

19

Page 20: RM World 2014: Semi supervised text classification operator

Algorithm

N = No of Labeled Doc, U = No of Unlabeled Doc

Inputs : Collections Dl of labeled documents and Du of unlabeled documents.

Method :

Build an initial naive Bayes classifier, , from the labeled documents, Dl , only. Use

maximum a posteriori parameter estimation to find

Loop while classifier parameters improve,

(the complete log probability of the labeled and unlabeled data)

(E-step) Use the current classifier, , to estimate component membership of each

unlabeled document, i.e., the probability that each mixture component (and class)

generated each document.

(M-step) Re-estimate the classifier , , given the estimated component membership of

each document. Use maximum a posteriori parameter estimation to find

Output : A classifier, , that takes an unlabeled document and predicts a class label.

)()/(maxargˆ PDP ),/( zDlc )ˆ;/( ij dcP )()/(maxargˆ PDP

20

Page 21: RM World 2014: Semi supervised text classification operator

The algorithm first trains a classifier with only the available

labeled documents, and assigns probabilistically-weighted

class labels to each unlabeled document by using the

classifier to calculate their expectation.

It then trains a new classifier using all the documents both

the originally labeled and the formerly unlabeled and

iterates.

21

Page 22: RM World 2014: Semi supervised text classification operator

Tool, Technology, Language Version used

1 RapidMiner 5.1.001

2 Eclipse Ganymede

3 Java JDK 1.6

Dataset Detail

22

Training – Testing spilt Class1 Class2

Religion Politics

Training Set No of Labeled

Documents

10 to 600 10 to 600

No of

Unlabeled

Documents

500 500

Testing Set No of

Documents for

testing

100 100

Implementation Setup

We have used 20 newsgroup dataset [11]

Page 23: RM World 2014: Semi supervised text classification operator

1) Creating an operator:package com.RapidMiner.operator.learner;

import com.RapidMiner.operator.learner;import com.RapidMiner.operator.Operator;import com.RapidMiner.operator.OperatorDescription;public class SemiSupervisedLarnerextends Operator {

public SemiSupervisedLarner(OperatorDescription description){

super(description);}

}

23

Page 24: RM World 2014: Semi supervised text classification operator

2) Adding Ports to Operator for Input and Output :

private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");

private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");

private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");

private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");

private OutputPort modelOutput = getOutputPorts().createPort("model");

24

Page 25: RM World 2014: Semi supervised text classification operator

3) Writing Logic for Implementation of Semi-Supervised Algorithm

public void doWork() throws OperatorException {

ExampleSet labeledExampleSet = labeledExampleSetInput.getData();ExampleSet unLabeledExampleSet = nLabeledExampleSetInput.getData();ExampleSet testExampleSet = testExampleSetInput.getData();

/* logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm */

exampleSetOutput.deliver(model);exampleSetOutput.deliver(exampleSet);}

25

Page 26: RM World 2014: Semi supervised text classification operator

package com.Rapid Miner.operator.learner;

import com.Rapid Miner.operator.Operator;

import com.Rapid Miner.operator.OperatorDescription;

public class SemiSupervisedLarner extends Operator {

private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");

private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");

private OutputPort modelOutput = getOutputPorts().createPort("model");

public SemiSupervisedLarner (OperatorDescription description) {

super(description);

}

public void doWork() throws OperatorException {

ExampleSet labeledExampleSet = labeledExampleSetInput.getData();

ExampleSet unLabeledExampleSet = unLabeledExampleSetInput.getData();

ExampleSet testExampleSet = testExampleSetInput.getData();

// logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm

exampleSetOutput.deliver(model);

exampleSetOutput.deliver(exampleSet);

}

}

Page 27: RM World 2014: Semi supervised text classification operator

No Package Classes

1 com.rapidminer.operator.learne

r

SemiSupervisedAbstractLearner

2.1 com.rapidminer.operator.learne

r.bayes

SSNaiveBayes

2.2 SemiSupervisedDistributionModel

3 com.rapidminer.ExampleSet ExampleSetUtils

No Method Name Description

1 doWork() Takes input Examplesets from input ports,

calls learn method and supplies output to

output port.

1 learn() Creates an instance of SemiSupervised

distributionModel. Returns Learned Model.

2.2 SemiSupervisedDist

ributionModel()

This is a constructor where all variables to

find prior and posterior probability are

initialized and all methods to perform semi-

supervised learning are called from here.

2.2 update() Finds weight of each attribute(feature) in

each class.

2.2 updateDistribution

Properties()

Finds posterior probability.

2.2 performPrediction() Predicts Labels of all unlabeled Documents

using NB.Returns PredictedExampleSet.

2.2 updateAfter

Prediction()

Updates weight of each attribute(feature) in

each class.

2.2 updateDistribution

PropertiesAfterPred

iction()

Updates posterior probability.

2.2 performTest() Predicts class labels of all documents in test

set and calculates precision, recall and

accuracy of each class and average accuracy.

3 merge() Merges labeled and predicted unlabeled

documents.Returns merged Example27

Page 28: RM World 2014: Semi supervised text classification operator

28

Page 29: RM World 2014: Semi supervised text classification operator

Performing Pre-processing in RapidMiner

29

Page 30: RM World 2014: Semi supervised text classification operator

SemiSupervised Operator Complete Classification Process

30

Page 31: RM World 2014: Semi supervised text classification operator

Analysis (Limitation)

Improvement in accuracy is lesswith SSL as compared toSupervised Learner NB as someunlabeled samples aremisclassified by the currentclassifier because the initiallabeled samples are not enough[10] and these misclassifiedsamples are directly consideredfor training.

31

Results shows improvement

in accuracy

Algorithm

No of Labeled

Documents

NB(Naïve Bayes) SSL (Basic

EM)

20 0 22.38

40 47.91 47.91

60 20.00 44.86

80 40.00 46.38

100 44.44 40.00

200 48 47.38

400 49.96 46.79

600 45.23 60.28

Page 32: RM World 2014: Semi supervised text classification operator

Reference papers

[2] [6] [5] [4] [3]

Dataset Used 1) 20 News Group

2) WebKb

3)Reuters

Chinese Text Chinese Short Text Text document from

public forums in

Chinese internet

Reuter 21578

Distribution of

Dataset uniform

NS Yes Yes NS No

Training, Testing NS NS NS 3/4, ¼ 2/3, 1/3

Parameters compared

for Accuracy

1)No of labeled

Documents vs.

Accuracy

2)No of Unlabeled

Documents vs.

Accuracy

Times of iteration vs.

Macro F1

Times of iterations vs.

Macro F1

No of iterations vs.

Accuracy

Feature Selection

methods vs. Accuracy

Measures of

evaluation used

Accuracy 1) Macro F1

2)New measure,

IR = (IS – IL)/IL

NS Macro F1 Macro average

Accuracy

Method used for

initial distribution of

EM

Naïve Bayesian Naïve Bayesian Naïve Bayesian Random Sub-Space

method

Naïve Bayesian

Feature Selection

method used

NS TF-IDF in each

iteration

Chi-Square in each

iteration

NS DF * ICIF

Uses more than one

classifier

No Yes No Yes No

32

Page 33: RM World 2014: Semi supervised text classification operator

We have proposed an algorithm in [10] in which we consider

votes of both Naive Bayes and Support Vector Machine

(SVM), and only those unlabeled documents for which both

NB and SVM predict the same label are considered in the next

iteration and the remaining unlabeled documents are

discarded.

This improved algorithm is also implemented in RapidMiner

as an extension. It gives better accuracy as compared to the

standard SSL algorithm for the same dataset [12].

33

Page 34: RM World 2014: Semi supervised text classification operator

Semi-Supervised Learning with EM can beeffectively used for improving performance of TextClassification when limited number of labeleddocuments are available for training and it isimplemented in RapidMiner as an extension.

To implement other variants of SSL algorithm inRapidMiner proposed by different researchers [7] inorder to overcome limitation of classic EM basedSSL algorithm and to perform experiments on real-time dataset like SMS, e-mail etc, are our futuregoals.

34

Page 35: RM World 2014: Semi supervised text classification operator

THANK YOU

35

Page 36: RM World 2014: Semi supervised text classification operator

[1] Kamal Nigam, Andrew Kachites Mccallum,“ Text classification from Labeled and Unlabeled Data using EM”,

Machine Learning, Kluwer Academic Publishers, Boston. Manufactured in The Netherlands, 2002.

[2] Xiaojin Zhu, “Semi-Supervised Learning Literature Survey”, Computer Sciences TR 1530, University of

Wisconsin – Madison, 2005.

[3] Wen Han, Xiao Nan-feng, “An Enhanced EM Method of Semi-supervised Classification Based on Naive

Bayesian”, Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 15- Sep-

2011.

[4] YueHong Cai, Qian Zhu; “Semi-Supervised Short Text Categorization based on Random Subspace”-

Computer Science and Information Technology (ICCSIT), 3rd IEEE International Conference on Page(s): 470

– 473 , 2010.

[5] Xinghua Fan, Zhiyi Guo; “A semi-supervised Text Classification Method based on Incremental EM

Algorithm”, WASE International Conference on Information Engineering, Page(s): 211 - 214, 2010.

[6] Xinghua Fan, Zhiyi Guo, Houfeng Ma. “An improved EM-based Semi-supervised Learning Method”

,International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, page(s): 529 -

532, August - 2009.

[7] Purvi Rekh, Amit Thakkar, Amit Ganatra, “A Survey and Comparative analysis of Expectation Maximization

based Semi-Supervised Text Classification”, International Journal of Engineering and Advanced Technology,

Vol 1, Issue- 3, page(s): 141 - 146, February – 2012.

[8] Zhu, Xiaojin. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences,

University of Wisconsin-Madison, 2008.

[9] Approaching Vega: The final descent How to extend RapidMiner 5.0

[10] Purvi Rekh,Amit Thakkar, Amit Ganatra, “An Improved Expectation Maximization based Semi-Supervised

Text Classification using Naïve Bayes and Support Vector Machine”, CiiT International Journal of Artificial

Intelligent Systems and Machine Learning, May -2012.

[11] Twenty News Group Data Set:

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

[12] Purvi Rekh, Amit Thankkar, Semi-Supervised Text Classification using Naïve Bayes and Support Vector

Machine, Second International Conference on Emerging Research in Computing, Information,

Communication and applications, in press with Elsevier proceedings, 2014 36