Top Banner
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank
72

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstOctober 30, 2006

Some slides by Preslav Nakov and Eibe Frank 

 

Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

2

Today

The Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

3

Source: originally collected by Ken LangContent and structure:

approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates

partitioned evenly across 20 different newsgroups we are only using a subset (6 newsgroups)

Some categories are strongly related (and thus hard to discriminate):

20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

sci.cryptsci.electronicssci.medsci.space

misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast

talk.religion.miscalt.atheismsoc.religion.christian

computers

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

4

Sample Posting: “talk.politics.guns”From: [email protected] (C. D. Tavares)Subject: Re: Congress to review ATF's status

In article <[email protected]>, [email protected] (Larry Cipriani) writes:

> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.

Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.

"Why Hillary! Your government smells so... FRESH!"--

[email protected] --If you believe that I speak for my company,OR [email protected] write today for my special Investors' Packet...

reply

from

subject

signature

Need special handling during

feature extraction…

… writes:

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

5

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

6Slide adapted from Eibe Frank's

WEKA: The Bird

Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand

Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

7Slide by Eibe Frank

WEKA: the softwareWaikato Environment for Knowledge AnalysisCollection of state-of-the-art machine learning algorithms and data processing tools implemented in Java

Released under the GPL

Support for the whole process of experimental data miningPreparation of input dataStatistical evaluation of learning schemesVisualization of input data and the result of learning

Used for education, research and applicationsComplements “Data Mining” by Witten & Frank

Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

8Slide by Eibe Frank

Main Features49 data preprocessing tools76 classification/regression algorithms8 clustering algorithms15 attribute/subset evaluators + 10 search algorithms for feature selection3 algorithms for finding association rules3 graphical user interfaces

“The Explorer” (exploratory data analysis)“The Experimenter” (experimental environment)“The KnowledgeFlow” (new process model inspired interface)

Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

9Slide by Eibe Frank

Projects based on WEKAIncorporate/wrap WEKA

GRB Tool Shed - a tool to aid gamma ray burst researchYALE - facility for large scale ML experimentsGATE - NLP workbench with a WEKA interfaceJudge - document clustering and classification

Extend/modify WEKABioWeka - extension library for knowledge discovery in biologyWekaMetal - meta learning extension to WEKAWeka-Parallel - parallel processing for WEKAGrid Weka - grid computing using WEKAWeka-CG - computational genetics tool library

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

10Slide by Eibe Frank

The WEKA Project Today (2006)

Funding for the next two yearsGoal of the project remains the samePeople

6 staff2 postdocs3 PhD students3 MSc students2 research programmers

Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

11Slide adapted from Eibe Frank's

WEKA: The Software Toolkit

Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:

data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms

http://www.cs.waikato.ac.nz/ml/weka

http://sourceforge.net/projects/weka/

Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

12

WEKA: Terminology

Some synonyms/explanations for the terms used by WEKA, which may differ from what we use:

Attribute: feature Relation: collection of examples Instance: collection in use Class: category

Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

13Slide adapted from Eibe Frank's

WEKA GUI Chooser java -Xmx1000M -jar weka.jar

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

14Slide adapted from Eibe Frank's

Our Toy Example

We demonstrate WEKA on a simple example:

3 categories from “Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics

20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

15Slide adapted from Eibe Frank's

Explorer: Pre-Processing The Data

WEKA can import data from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)

Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.

Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

16

List of attributes (last: class variable)

Frequency and categories for the selected

attribute

Statistics about the values of the selected attribute

Classification

Filter selection

Manual attribute selection

Statistical attribute selection

Preprocessing

The Preprocessing Tab

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

17Slide adapted from Eibe Frank's

Explorer: Building “Classifiers”

Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)

Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.

Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

18

Choice of classifier

The attribute whose value is to be predicted from the values of the remaining ones.

Default is the last attribute.

Here (in our toy example) it is

named “class”.

Cross-validation: split the data into e.g. 10 folds and

10 times train on 9 folds and test on the remaining one

The Classification Tab

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

19

Choosing a classifier

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

20

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

21

False: Gaussian

True: kernels (better)

displays synopsis and options

numerical to nominal

conversion by discretization

outputs additional information

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

22

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

23

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

24

all other numbers can be obtained from it

different/easy class

accuracy

Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

25

Contains information about the actual and the predicted classification

All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)

These extend for more than 2 classes: see previous lecture slides for details

Confusion matrix

predicted

– +

true

– a b

+ c d

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

26

Outputs the probability

distribution for each example

Predictions Output

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

27

Probability distribution for

a wrong example:

predicted 1 instead of 3

Naïve Bayes makes incorrect

conditional independence assumptions

and typically is over-confident in its prediction regardless of whether it is

correct or not.

Predictions Output

Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

28

Error Visualization

Page 29: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

29

Error Visualization

Little squares designate errors

Axes show example number

Page 30: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

30

Running on Test Set

Page 31: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

31Slide adapted from Eibe Frank's

Find which attributes are the most predictive ones

Two parts: search method: – best-first, forward selection, random, exhaustive, genetic

algorithm, ranking

evaluation method: – information gain, chi-squared, etc.

Very flexible: WEKA allows (almost) arbitrary combinations of these two

Explorer: Attribute Selection

Page 32: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

32

Individual Features Ranking

Page 33: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

33

misc.forsale

comp.graphics

rec.sport.hockey

Individual Features Ranking

Page 34: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

34

misc.forsale

comp.graphics

rec.sport.hockey

???

random number

seed

Individual Features Ranking

Page 35: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

40

Saving the Selected Features

All we can do from this tab is to save the buffer in a text file. Not very useful...

But we can also perform feature selection during the pre-processing step...(the following slides)

Page 36: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

41

Features Selection on Preprocessing

Page 37: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

42

Features Selection on Preprocessing

Page 38: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

43

Features Selection on Preprocessing

679 attributes: 678 + 1 (for the class)

Page 39: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

44

Features Selection on Preprocessing

Just 22 attributes remain:

21 + 1 (for the class)

Page 40: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

45

Run Naïve Bayes With the 21 Features

higher accuracy

21 Attributes

Page 41: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

46

different/easy class

accuracy

(AGAIN) Naïve Bayes With All Features

ALL 679 Attributes(repeated slide)

Page 42: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

47

WEKA has weird naming for some algorithms

Here are some translations: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk

Some of these are more sophisticated versions of the classic algorithms

e.g. the classic Naïve Bayes seems to be missing A good alternative is the Multinomial Naïve Bayes model

Some Important Algorithms

Page 43: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

48

The 20 Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 44: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

49Slide adapted from Eibe Frank's

Experimenter makes it easy to compare the performance of different learning schemes

Problems: classification regression

Results: written into file or databaseEvaluation options:

cross-validation learning curve hold-out

Can also iterate over different parameter settingsSignificance-testing built in!

Performing Experiments

Page 45: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

50

Experiments Setup

Page 46: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

51

Experiments Setup

Page 47: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

52

Experiments Setup

CSV file: can be open in Exceldatasets

algorithms

Page 48: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

53

Experiments Setup

Page 49: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

54

Experiments Setup

Page 50: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

55

Experiments Setup

Page 51: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

56

Experiments Setup

Page 52: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

57

Experiments Setup

accuracy

SVM is the best

Decision tree is the

worst

SVM is statistically better than Naïve Bayes

Decision tree is statistically worse than Naïve Bayes

Page 53: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

58

Experiments: Excel

Results are output into an CSV file, which can

be read in Excel!

Page 54: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

59

The Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

Page 55: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

60Slide adapted from Eibe Frank's

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA File Format: ARFF

Other attribute types:

• String

• Date

Numerical attribute

Nominal attribute

Missing value

Page 56: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

61

Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different

Instead of @data

0, X, 0, Y, "class A"0, 0, W, 0, "class B"

We have

@data

{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This is especially useful for textual data (why?)

WEKA File Format: Sparse ARFF

Page 57: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

62

Python Interface to WEKA

This is just to get you startedAssumes the newsgroups collectionExtracts simple features

currently just single word features– Uses a simple tokenizer which removes punctuation

uses a stoplist lowercases the words

Includes filtering code currently eliminates numbers

Features are weighted by frequency within document

Produces a sparse ARFF file to be used by WEKA

Page 58: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

63

Python Interface to WEKA

Allows you to specify: Which directory to read files from which newsgroups to use the number of documents for training each newsgroup the number of features to retain

Page 59: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

64

Python Interface to WEKA

Things to (optionally) add or change: an option to not use stopwords an option to retain capitalization regular expression pattern a feature should match other non-word-based features morphological normalization a minimum threshold for the number of time a term

occurs before it can be counted as a feature tf.idf weighting on terms your idea goes here

Page 60: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

65

Python Interface to WEKA

TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j

– this is how features are currently weighted

IDF: log(N/ni)

– ni: number of documents containing term i

– N: total number of documents

Page 61: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

66

Python Weka Code

Page 62: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

67

Python Weka Code

Page 63: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

68

Python Weka Code

Page 64: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

69

Python Weka Code

Page 65: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

70

Python Weka Code

Page 66: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

71

Python Weka Code

Page 67: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

72

Python Weka Code

Page 68: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

73

Python Weka Code

Page 69: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

74

ARFF file

Page 70: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

75

ARFF file…

Page 71: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

76

Assignment

Due November 13.Work individually on this oneObjective is to use the training set to get the best features and learning model you can.FREEZE this.Then run one time only on the test set.This is a realistic way to see how well your algorithm does on unseen data.

Page 72: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank.

77

Next Time

Machine learning algorithms