Top Banner
Advanced Multimedia Text Classification Tamara Berg
82
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Advanced Multimedia Text Classification Tamara Berg.

Advanced Multimedia

Text ClassificationTamara Berg

Page 2: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 3: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 4: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Today!

Page 5: Advanced Multimedia Text Classification Tamara Berg.

What does categorization/classification mean?

Page 6: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 7: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 8: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 9: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 10: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 11: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 12: Advanced Multimedia Text Classification Tamara Berg.

Slide from Dan Klein

Page 13: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 14: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 15: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 16: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

• Machine Learning - how to select a model on the basis of data / experience

Learning parameters (e.g. probabilities)Learning structure (e.g. dependencies)Learning hidden concepts (e.g. clustering)

Page 17: Advanced Multimedia Text Classification Tamara Berg.
Page 18: Advanced Multimedia Text Classification Tamara Berg.

Classifiers

• Today we’ll talk about 2 simple kinds of classifiers– Nearest Neighbor Classifier– Naïve Bayes Classifier

Page 19: Advanced Multimedia Text Classification Tamara Berg.

Classifiers

• Today we’ll talk about 2 simple kinds of classifiers– Nearest Neighbor Classifier– Naïve Bayes Classifier

Page 20: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors

Page 21: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors Represent documents as a “bags of words”

Page 22: Advanced Multimedia Text Classification Tamara Berg.

Example

• Doc1 = “the quick brown fox jumped”• Doc2 = “brown quick jumped fox the”

Page 23: Advanced Multimedia Text Classification Tamara Berg.

Example

• Doc1 = “the quick brown fox jumped”• Doc2 = “brown quick jumped fox the”

Would a bag of words model represent these two documents differently?

Page 24: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally

• Each vector holds a place for every term in the collection• Therefore, most vectors are sparse

Slide from Mitch Marcus

Page 25: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally

• Each vector holds a place for every term in the collection• Therefore, most vectors are sparse

Slide from Mitch Marcus

Lexicon – the vocabulary set that you consider to be valid words in your documents.

Usually stemmed (e.g. running->run)

Page 26: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors:One location for each word.

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Slide from Mitch Marcus

Page 27: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7

9 10 5 10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Slide from Mitch Marcus

Page 28: Advanced Multimedia Text Classification Tamara Berg.

Vector Space Model

Documents are represented as vectors in term space• Terms are usually stems• Documents represented by vectors of terms

Queries represented the same as documents

Slide from Mitch Marcus

Page 29: Advanced Multimedia Text Classification Tamara Berg.

Vector Space Model

Documents are represented as vectors in term space• Terms are usually stems• Documents represented by vectors of terms

Queries represented the same as documents A vector distance measure between the query and

documents is used to rank retrieved documents• Query and Document similarity is based on length and direction of

their vectors• Terms in a vector can be “weighted” in many ways

Slide from Mitch Marcus

Page 30: Advanced Multimedia Text Classification Tamara Berg.

Document Vectors

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7

9 10 5 10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Slide from Mitch Marcus

Page 31: Advanced Multimedia Text Classification Tamara Berg.

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Page 32: Advanced Multimedia Text Classification Tamara Berg.

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Treat the vectors as binary = number of words in common.

Sb(A,G) = ?Sb(A,E) = ?Sb(G,E) = ?

Which pair of documents are the most similar?

Page 33: Advanced Multimedia Text Classification Tamara Berg.

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Sum of Squared Distances (SSD) =

SSD(A,G) = ?SSD(A,E) = ?SSD(G,E) = ?Which pair of documents are the most similar?

(X ii=1

n

∑ −Yi)2

Page 34: Advanced Multimedia Text Classification Tamara Berg.

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Angle between vectors: Cos(θ) =

a ⋅ba b

Dot Product:

Length (Euclidean norm):

a = a21 + a2

2 + ...+ a2n

Page 35: Advanced Multimedia Text Classification Tamara Berg.

Some words give more information than others

• Does the fact that two documents both contain the word “the” tell us anything? How about “and”?

• Stop words:

Page 36: Advanced Multimedia Text Classification Tamara Berg.

Assigning Weights to Terms Binary Weights Raw term frequency tf x idf

• Want to weight terms highly if they are—frequent in relevant documents … BUT—infrequent in the collection as a whole

Slide from Mitch Marcus

Page 37: Advanced Multimedia Text Classification Tamara Berg.

TF x IDF Weights

tf x idf measure:• Term Frequency (tf) – how often a term appears in a document• Inverse Document Frequency (idf) -- a way to deal with terms that

are frequent across many documents Goal: Assign a tf * idf weight to each term in each

document

Slide from Mitch Marcus

Page 38: Advanced Multimedia Text Classification Tamara Berg.

TF x IDF Calculation

)/log(* kikik nNtfw

Tk = term k in document Ditf ik = frequency of term Tk in document Diidfk = inverse document frequency of term Tk in C

N = total number of documents in the collection C

nk = the number of documents in C that contain Tk

idfk = log Nkn( )

Slide from Mitch Marcus

Page 39: Advanced Multimedia Text Classification Tamara Berg.

TF x IDF Calculation

)/log(* kikik nNtfw

Tk = term k in document Ditf ik = frequency of term Tk in document Diidfk = inverse document frequency of term Tk in C

N = total number of documents in the collection C

nk = the number of documents in C that contain Tk

idfk = log Nkn( )

Slide from Mitch Marcus

Page 40: Advanced Multimedia Text Classification Tamara Berg.

TF x IDF Calculation

)/log(* kikik nNtfw

Tk = term k in document Ditf ik = frequency of term Tk in document Diidfk = inverse document frequency of term Tk in C

N = total number of documents in the collection C

nk = the number of documents in C that contain Tk

idfk = log Nkn( )

Slide from Mitch Marcus

Page 41: Advanced Multimedia Text Classification Tamara Berg.

Inverse Document Frequency

IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

Slide from Mitch Marcus

Page 42: Advanced Multimedia Text Classification Tamara Berg.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

TF x IDF Normalization

Normalize the term weights (so longer documents are not unfairly given more weight)• The longer the document, the more likely it is for a given term to

appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document.

Slide from Mitch Marcus

Page 43: Advanced Multimedia Text Classification Tamara Berg.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

TF x IDF Normalization

Normalize the term weights (so longer documents are not unfairly given more weight)• The longer the document, the more likely it is for a given term to

appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document.

Slide from Mitch Marcus

What’s this?

Page 44: Advanced Multimedia Text Classification Tamara Berg.

Pair-wise Document Similarity

D1 = w11,w12,...,w1n

D2 = w21,w22,...,w2n

sim(D1,D2) =

w1i ∗w2i

i=1

n

(w1i)2 ∗

i=1

n

∑ (w2i)2

i=1

n

Documents now represented as vectors of TFxIDF weights

Similarity can be computed as usual on these new weight vectors (e.g. cos(θ) here)

Slide from Mitch Marcus

nova galaxy heat h’wood film rolediet fur

1 3 1

5 2 2 1 5 4 1

ABCD

Page 45: Advanced Multimedia Text Classification Tamara Berg.

Pair-wise Document Similarity

D1 = w11,w12,...,w1n

D2 = w21,w22,...,w2n

sim(D1,D2) =

w1i ∗w2i

i=1

n

(w1i)2 ∗

i=1

n

∑ (w2i)2

i=1

n

Documents now represented as vectors of TFxIDF weights

Similarity can be computed as usual on these new weight vectors (e.g. cos(θ) here)

Slide from Mitch Marcus

Page 46: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?

Page 47: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Query document – which class should you label it with?

Page 48: Advanced Multimedia Text Classification Tamara Berg.

Classification by Nearest Neighbor

Classify the test document as the class of the document “nearest” to the query document (use vector similarity to find most similar doc)

Slide from Min-Yen Kan

Page 49: Advanced Multimedia Text Classification Tamara Berg.

Classification by kNN

Classify the test document as the majority class of the k documents “nearest” to the query document. Slide from Min-Yen Kan

Page 50: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 51: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 52: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 53: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 54: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

Page 55: Advanced Multimedia Text Classification Tamara Berg.
Page 56: Advanced Multimedia Text Classification Tamara Berg.

Slide from Min-Yen Kan

What are the features? What’s the training data? Testing data? Parameters?

Classification by kNN

Page 57: Advanced Multimedia Text Classification Tamara Berg.

kNN demo

http://archive.ics.uci.edu/ml/datasets/Spambase

Page 58: Advanced Multimedia Text Classification Tamara Berg.

Classifiers

• Today we’ll talk about 2 simple kinds of classifiers– Nearest Neighbor Classifier– Naïve Bayes Classifier

Nearest neighbor treats all features equally whether or not they matter for the classification task (even tfxIDF weights are independent of document class). Naïve Bayes lets us learn which features are most indicative for classification.

Page 59: Advanced Multimedia Text Classification Tamara Berg.
Page 60: Advanced Multimedia Text Classification Tamara Berg.

R

A random variable is some aspect of the world about which we (may) have uncertainty.

Random variables can be:Binary (e.g. {true,false}, {spam/ham}), Take on a discrete set of values

(e.g. {Spring, Summer, Fall, Winter}), Or be continuous (e.g. [0 1]).

Page 61: Advanced Multimedia Text Classification Tamara Berg.
Page 62: Advanced Multimedia Text Classification Tamara Berg.
Page 63: Advanced Multimedia Text Classification Tamara Berg.
Page 64: Advanced Multimedia Text Classification Tamara Berg.
Page 65: Advanced Multimedia Text Classification Tamara Berg.
Page 66: Advanced Multimedia Text Classification Tamara Berg.
Page 67: Advanced Multimedia Text Classification Tamara Berg.
Page 68: Advanced Multimedia Text Classification Tamara Berg.

Arrows encode conditional independence:Toothache is independent of catch given cavity

Page 69: Advanced Multimedia Text Classification Tamara Berg.
Page 70: Advanced Multimedia Text Classification Tamara Berg.
Page 71: Advanced Multimedia Text Classification Tamara Berg.

Y class (e.g. spam/ham)

F1,F2,...,Fn Features (e.g. words)

Page 72: Advanced Multimedia Text Classification Tamara Berg.
Page 73: Advanced Multimedia Text Classification Tamara Berg.
Page 74: Advanced Multimedia Text Classification Tamara Berg.
Page 75: Advanced Multimedia Text Classification Tamara Berg.
Page 76: Advanced Multimedia Text Classification Tamara Berg.

Percentage of documents in training set labeled as spam/ham

Page 77: Advanced Multimedia Text Classification Tamara Berg.

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Page 78: Advanced Multimedia Text Classification Tamara Berg.

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Page 79: Advanced Multimedia Text Classification Tamara Berg.

Classification

The class that maximizes:

How would you do this?€

=argmaxC P(C) P(W i |C)i

Page 80: Advanced Multimedia Text Classification Tamara Berg.

In Practice:

Cmap = argmaxC log(P(C)) + log(P(W i |C))i

Page 81: Advanced Multimedia Text Classification Tamara Berg.

Naïve Bayes

• Is this a “bag of words” model?

Page 82: Advanced Multimedia Text Classification Tamara Berg.

Naive Bayes is Not So Naive Naïve Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of

the art algorithmsGoal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting results

Very good in Domains with many equally important features A good dependable baseline for text classification (but not the best)! Optimal if the Independence Assumptions hold: If assumed independence is correct, then it

is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass over the data; testing linear in the number of attributes, and

document collection size Low Storage requirements

Slide from Mitch Marcus