Matwin 1999 1 Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa [email protected].

Matwin 19991

Text classification: In Search of a Representation

Stan Matwin

School of Information Technology and Engineering

University of [email protected]

Matwin 19992

Outline

Supervised learning=classificationML/DM at U of OClassical approachAttempt at a linguistic representationN-grams – how to get them?Labelling and co-learningNext steps?…

Matwin 19993

Supervised learning (classification)

Given:a set of training instances T={et}, where each

t is a class label : one of the classes C1,…Ck

a concept with k classes C1,…Ck (but the definition of the concept is NOT known)

Find: a description for each class which will perform

well in determining (predicting) class membership for unseen instances

Matwin 19994

Classification

Prevalent practice:

examples are represented as vectors of values of attributes

Theoretical wisdom,

confirmed empirically: the more examples, the better predictive accuracy

Matwin 19995

ML/DM at U of O

Learning from imbalanced classes: applications in remote sensing

a relational, rather than propositional representation: learning the maintainability concept

Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB

Matwin 19996

Why text classification?

Automatic file savingInternet filtersRecommendersInformation extraction…

Matwin 19997

Bag of words

Text classification: standard approach

1. Remove stop words and markings2. remaining words are all attributes3. A document becomes a vector

<word, frequency>

4. Train a boolean classifier for each class

5. Evaluate the results on an unseen sample

Matwin 19998

Text classification: tools

RIPPERA “covering”learnerWorks well with large sets of binary

featuresNaïve Bayes

Efficient (no search)Simple to programGives “degree of belief”

Matwin 19999

“Prior art”

Yang: best results using k-NN: 82.3% microaveraged accuracy

Joachim’s results using Support Vector Machine + unlabelled data

SVM insensitive to high dimensionality, sparseness of examples

Matwin 199910

SVM in Text classification

SVM

Transductive SVMMaximum separationMargin for test set

Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training

Matwin 199911

Problem 1: aggressive feature selection

“Machine”: 50%“Learning”: 75%“Machine Learning”: 50%

AI


EP


MT

RIPPER (B.O.W.): machine & learning = AI �

FLIPPER (Cohen): machine & learning & near & after = AI �

RIPPER (Phrases): “machine learning” = AI �

Matwin 199912

Problem 2: semantic relationships are missed

knife gundagger sword

rifle slingshot

weapon� Semantically related words may

be sparsely distributed throughmany documents

� Statistical learner may be able topick up these correlations

� Rule-based learner isdisadvantaged

Matwin 199913

Proposed solution (Sam Scott)

Get noun phrases and/or key phrases (Extractor) and add to the feature list

Add hypernyms

Matwin 199914

Hypernyms - WordNet

“synset” => SYNONYM“is a” => HYPERNYM“instance of” => HYPONYM

“is a”

“instance of”

“Synset”

weapon

gun

pistol,revolver

knife

Matwin 199915

Evaluation (Lewis)

•Vary the “loss ratio” parameter

• For each parameter value

• Learn a hypothesis for each class (binary classification)

• Micro-average the confusion matrices (add component-wise)

• Compute precision and recall

• Interpolate (or extrapolate) to find the point where microaveraged precision and recall are equal

Matwin 199916

Results

No gain over BW in alternative representations

But…

Comprehensibility…

Micro-averaged b.e.Reuters DigiTrad

BW .821 .359BWS .810 .360NP .827 .357NPS .819 .356KP .817 .288e

KPS .816 .297e

H0 .741e .283H1 .734e .281NPW .823 N/A

Matwin 199917

Combining classifiers

Comparable to best known results (Yang)

Reuters DigiTrad# representations b.e. representations b.e.1 NP .827 BWS .3603 BW, NP, NPS .845 BW, BWS, NP .404e

5 BW, NP, NPS, KP, KPS .849 BW, BWS, NP, KPS, KP .422e

Matwin 199918

Other possibilities

Using hypernyms with a small training set (avoids ambiguous words)

Use Bayes+Ripper in a cascade scheme (Gama)

Other representations:

Matwin 199919

Collocations

Do not need to be noun phrases, just pairs of words possibly separated by stop words

Only the well discriminating ones are chosen

These are added to the bag of words, and…

Ripper

Matwin 199920

N-grams

N-grams are substrings of a given lengthGood results in Reuters [Mladenic, Grobelnik]

with Bayes; we try RIPPER

A different task: classifying text filesAttachments

Audio/video

Coded

From n-grams to relational features

Matwin 199921

How to get good n-grams?

We use Ziv-Lempel for frequent substring detection (.gz!)

abababaa ba a

b

b

a

Matwin 199922

N-grams

Counting Pruning:

substring occurrence ratio < acceptance threshold

Building relations: string A almost always precedes string B

Feeding into relational learner (FOIL)

Matwin 199923

Using grammar induction (text files)

Idea: detect patterns of substringsPatterns are regular languagesMethods of automata induction: a

recognizer for each class of filesWe use a modified version of RPNI2

[Dupont, Miclet]

Matwin 199924

What’s new…

Work with marked up text (Word, Web)

XML with semantic tags: mixed blessing for DM/TM

Co-learningText mining

Matwin 199925

Co-learning

How to use unlabelled data? Or How to limit the number of examples that need be labelled?

Two classifiers and two redundantly sufficient representations

Train both, run both on test set, add best predictions to training set

Matwin 199926

Co-learning

Training set grows as……each learner predicts independently

due to redundant sufficiency (different representations)

would also work with our learners if we used Bayes?

Would work with classifying emails

Matwin 199927

Co-learning

Mitchell experimented with the task of classifying web pages (profs, students, courses, projects) – a supervised learning task

Used Anchor textPage contents

Error rate halved (from 11% to 5%)

Matwin 199928

Cog-sci?

Co- learning seems to be cognitively justified

Model: students learning in groups (pairs)

What other social learning mechanisms could provide models for supervised learning?

Matwin 199929

Conclusion

A practical task, needs a solutionNo satisfactory solution so farFruitful ground for research

Matwin 1999 1 Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa [email protected].

Documents