Importance of Semantic Representation: Dataless Classification

Importance of Semantic Representation:

Dataless Classification

Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar

University of Illinois, Urbana-Champaign

Slide 2

Text Categorization

Classify the following sentence:

Syd Millar was the chairman of the International Rugby Board in

2003.

Pick a label:

Class1 vs. Class2

Traditionally, we need annotated data to train a classifier

Slide 3

Text Categorization

Humans don’t seem to need labeled data

Syd Millar was the chairman of the International Rugby Board in 2003.

Pick a label:

Sports vs. Finance

Label names carry a lot of information!

Slide 4

Text Categorization

Do we really always need labeled data?

Slide 5

Contributions

We can often go quite far without annotated data … if we “know” the meaning of text

This works for text categorization ….and is consistent across different domains

Slide 6

Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains

Slide 7

Outline



Datasets



Slide 8


One common representation is the Bag of Words representation

All text is a vector in the space of words.

Slide 9


Explicit Semantic Analysis [Gabrilovich & Markovitch, 2006, 2007]

Text is a vector in the space of concepts

Concepts are defined by Wikipedia articles

Slide 10

Explicit Semantic Analysis: Example

Monetary Policy

International Monetary Fund

Monetary policy

Economic and Monetary Union

Hong Kong Monetary Authority

Monetarism

Central bank

ESA representation

IPod mini

IPod photo

IPod nano

Apple Computer

IPod shuffle

ITunes

Apple IPod

ESA representation

Wikipedia article titles

Slide 11


Two semantic representations

Bag of words

ESA

Slide 12

Outline



Datasets



Slide 13

Traditional Text Categorization

Sports Finance

Labeled corpus

Semantic space

A classifier

Slide 14


Sports Finance

Labeled corpusLabels

What can we do using just the labels?

Slide 15

But labels are text too!

Slide 16


Sports Finance

Semantic space

LabelsNew unlabeled

document

Slide 17

What is Dataless Classification?

Humans don’t need training for classification

Annotated training data not always needed

Look for the meaning of words

Slide 18

What is Dataless Classification?

Humans don’t need training for classification

Annotated training data not always needed

Look for the meaning of words

Slide 19


Sports Finance

Semantic space

LabelsNew unlabeled

document

Slide 20


No training data needed

We know the meaning of label names

Pick the label that is closest in meaning to the

document

Nearest neighbors

Slide 21


Hockey Baseball

Semantic space

New labels

New unlabeled

document

Slide 22


No need to even know labels before hand

Compare with traditional classification Annotated training data for each label

Slide 23

Outline



Datasets



Slide 24

Dataset 1: Twenty Newsgroups

Posts to newsgroups Newsgroups have descriptive names

sci.electronics = Science Electronicsrec.motorbikes = Motorbikes

Slide 25

Dataset 2: Yahoo Answers

Posts to Yahoo! Answers Posts categorized into a two level hierarchy 20 top level categories Totally 280 categories at the second level

Arts and Humanities, Theater ActingSports, Rugby League

Slide 26

Experiments

20 Newsgroups 10 binary problems (from [Raina et al, ‘06])

Religion vs. Politics.guns

Motorcycles vs. MS Windows

Yahoo! Answers 20 binary problems

Health, Diet fitness vs. Health Allergies

Consumer Electronics DVRs vs. Pets Rodents

Slide 27

Results: On-the-fly classification

Dataset Supervised Baseline

Bag of Words

ESA

Newsgroup 71.7 65.7 85.3

Yahoo! 84.3 66.8 88.6

Naïve Bayes classifier

Uses annotated data,

Ignores labels

Nearest neighbors,

Uses labels,

No annotated data

Slide 28

Outline



Datasets



Slide 29

Using Unlabeled Data

Knowing the data collection helps We can learn specific biases of the dataset

Potential for semi-supervised learning

Slide 30

Bootstrapping Each label name is a “labeled” document

One “example” in word or concept space

Train initial classifier Same as the on-the-fly classifier

Loop: Classify all documents with current classifier Retrain classifier with highly confident predictions

Slide 31

Co-training Words and concepts are two independent “views”

Each view is a teacher for the other

[Blum & Mitchell ‘98]

Slide 32

Co-training

Train initial classifiers in word space and concept space

Loop Classify documents with current classifiers Retrain with highly confident predictions of both

classifiers

Slide 33

Using unlabeled data

Three approaches

Bootstrapping with labels using Bag of Words

Bootstrapping with labels using ESA

Co-training

Slide 34

More Results

No annotated data

Co-training using just labels does as well as supervision with 100 examples

Slide 35

Outline



Datasets



Slide 36

Domain Adaptation

Classifiers trained on one domain and tested on another

Performance usually decreases across domains

Slide 37

But the label names are the same Label names don’t depend on the domain

Label names are robust across domains On-the-fly classifiers are domain independent

Slide 38

ExampleBaseball vs. Hockey

Slide 39

Conclusion

Sometimes, label names are tell us more about a class than annotated examples Standard learning practice of treating labels as unique

identifiers loses information

The right semantic representation helps What is the right one?

Importance of Semantic Representation: Dataless Classification

Documents

Importance of Semantic Representation: Dataless Classification