Top Banner
Text Classification and Clustering with WEKA A guided example by Sergio Jiménez
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Weka

Text Classification and Clustering

with

WEKAWEKA

A guided example by

Sergio Jiménez

Page 2: Weka

The Task

Building a model for movies revisions in English

for classifying it into positive or negative.

Page 3: Weka

Sentiment Polarity Dataset Version 2.0

1000 positive movie review and 1000 negative review texts from:

Thumbs up? Sentiment Classification using Machine Learning Techniques. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Proceedings of EMNLP, pp. 79--86, 2002.

“Our data source was the Internet Movie Database (IMDb) archive of “Our data source was the Internet Movie Database (IMDb) archive of the rec.arts.movies.reviews newsgroup.3 We selected only reviews where the author rating was expressed either with stars or some numerical value (other conventions varied too widely to allow for automatic processing). Ratings were automatically extracted and converted into one of three categories: positive, negative, or neutral. For the work described in this paper, we concentrated onlyon discriminating between positive and negative sentiment.”

http://www.cs.cornell.edu/people/pabo/movie-review-data/

Page 4: Weka

The Data (1/2)

Page 5: Weka

The Data (2/2)

0

50

100

150

200

250

300

350

# D

ocu

me

nts

1000 negative revisions histogram

0

# characters

0

50

100

150

200

250

300

#D

ocu

me

nts

# characters

1000 positive revisions histogram

Page 6: Weka

What WEKA is?

• “Weka is a collection of machine learning algorithms for data mining tasks”.

• “Weka contains tools for:

– data pre-processing,

– classification,

– regression,

– clustering,

– association rules,

– and visualization”

Page 7: Weka

Where to start?

Page 8: Weka

Getting WEKA

Page 9: Weka

Before Running WEKAIncreasing available memory for Java in RunWeka.ini

Change

maxheap=256m

to

maxheap=1024m

Page 10: Weka

Running WEKA

using

“RunWeka.bat”

Page 11: Weka

Creating a .arff dataset

Page 12: Weka

Saving the .arff dataset

Page 13: Weka

From text to vectors

],,,,,[ 321 classvvvvV nL=

review1=“great movie”

review2=“excellent film”

review3=“worst film ever”

review4=“sucks”

exce

lle

nt

],0,0,1,1,0,0,0[1 +=V],0,0,0,0,1,1,0[2 +=V],1,0,0,0,1,0,1[3 −=V],0,1,0,0,0,0,0[4 −=V

ev

er

exce

lle

nt

film

gre

at

mo

vie

suck

s

wo

rst

Page 14: Weka

Converting to Vector Space Model

Edit “movie_reviews.arff”

and change “class” to

“class1”. Apply the filter

again after the change.

Page 15: Weka

Visualize the vector data

Page 16: Weka

StringToWordVector filter options

lowerCase convertion

TF-IDF weigthing

Stopwords removal using a list

of words in a file

Stemming

Use frequencies instead of

single presence

Page 17: Weka

Generating datasets for experiments

dataset file name Stopwords StemmingPresence or

freq.

movie_reviews_1.arff no presence

movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency

movie_reviews_3.arff yes presence

movie_reviews_4.arff yes frequency

movie_reviews_5.arff removed no presence

movie_reviews_6.arff removed no frequency

movie_reviews_7.arff removed yes presence

movie_reviews_8.arff removed yes frequency

Page 18: Weka

Classifying ReviewsClick!

Select number

Select a

classifier

Select class

attribute

Select number

of folds

Start !

Page 19: Weka

Results

Page 20: Weka

Results Correctly Classified Reviews

dataset name Stopwords StemmingPresence

or freq.

Naive

Bayes 3-

fold

NaiveBayes

Multinomial

3-fold

movie_reviews_1.arff no presence 80.65% 83.80%

movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency 69.30% 78.65%

movie_reviews_3.arff yes presence 79.40% 82.15%

movie_reviews_4.arff yes frequency 68.10% 79.70%

movie_reviews_5.arff removed no presence 81.80% 84.35%

movie_reviews_6.arff removed no frequency 69.40% 81.75%

movie_reviews_7.arff removed yes presence 78.90% 82.40%

movie_reviews_8.arff removed yes frequency 68.30% 80.50%

Page 21: Weka

Attribute (word) Selecction

Choose an Attribute

Selection Algorithm

Select the

class attribute

Page 22: Weka

Selected Attributes (words)also

awful

bad

boring

both

dull

fails

pointless

poor

ridiculous

script

seagal

sometimes

stupid

deserves

effective

flaws

greatest

hilarious

memorable

overallgreat

joke

lame

life

many

maybe

mess

nothing

others

perfect

performances

stupid

tale

terrible

true

visual

waste

wasted

world

worst

animation

definitely

overall

perfectly

realistic

share

solid

subtle

terrific

unlike

view

wonderfully

Page 23: Weka

Pruned movie_reviews_1.arff dataset

Page 24: Weka

Naïve Bayes with the pruned dataset

Page 25: Weka

Clustering

Correctly clustered instances: 65.25%

Page 26: Weka

Other results

Results of Pang et al. (2002) with version 1.0 of the dataset with 700+ and 700-

Page 27: Weka

Thanks