1 Categorization/ Classification • Given: – A description of an instance, x X, where X is the instance language or instance space. – A fixed set of classes: C = {c 1 , c 2 ,…, c J } • Determine: – The category of x: c(x)C, where c(x) is a classification function whose domain is X and whose range is C. • We want to know how to build classification functions (“classifiers”).
56
Embed
1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Categorization/Classification• Given:
– A description of an instance, x X, where X is the instance language or instance space.
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:– The category of x: c(x)C, where c(x) is a
classification function whose domain is X and whose range is C.
• We want to know how to build classification functions (“classifiers”).
2
More Text Classification Examples:Many search engine functionalities use
classification
Assign labels to each document or web-page:• Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business"• Labels may be opinion on a person/product
e.g., “like”, “hate”, “neutral”• Labels may be domain-specific
– Very accurate when job is done by experts– Consistent when the problem size and team is
small– Difficult and expensive to scale
• Means we need automatic classification methods for big problems
4
Classification Methods• Supervised learning of a document-label
assignment function– Many systems partly rely on machine learning (MSN,
Verity, Yahoo!, …)• k-Nearest Neighbors (simple, powerful)• Naive Bayes (simple, common method)• … plus many other methods• No free lunch: requires hand-classified training data
• Note that many commercial systems use a mixture of methods
5
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes– Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or mathematical formulae
6
Classification—A Two-Step Process
• Model usage: for classifying future or unknown objects– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will occur
– If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
7
Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
8
Process (2): Using the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
9
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
10
The goal of the course
• Study supervised learning specially for text and hypertext documents
• Text– Has a very large number of potential features,
of which many are irrelevant.• If vector space model is used, each term is a
potential feature.
– The number of distinct class labels is much larger than structured leaning scenarios.
11
Topics including in the course
• Evaluating text classifiers
• Classifiers– NN learners– Bayesian learners– Hypertext classification
• Feature selection methods
12
Evaluating text classifiers• Accuracy
– The ability to predict the correct class labels
– This is based on comparing the classifier-assigned labels with human-assigned labels
• Speed– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Simplicity, speed, and scalability for document insertion, deletion and modification
• Scalability: efficiency in disk-resident databases • Interpretability
– understanding and insight provided by the model
13
Benchmarks
• Reuters– Labeled documents : 10700– Number of terms : 30000– Number of categories : 135
• 20NG– Labeled documents : 18800– Number of terms : 94000– Number of categories : 20
• WebKB– Labeled documents : 8300– Number of categories : 7
14
Measures of accuracy
• Each document is associated with a subset of classes– To avoid searching over the power set of
class labels, many systems create a two-class problem for every class
• Two-way ensemble or one-vs.-rest technique
– Ensemble classifiers are evaluated on the basis of recall and precision
15
Classifier Accuracy Measures
(guess)~C1 C1
(true)
~C1
True negative False positive
C1 False negative True positive
16
A combined measure: F
• Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):
• People usually use balanced F1 measure
– i.e., with = 1 or = ½ ( 2 = 1-/ )
RP
PR
RP
F
2
2 )1(1
)1(1
1
17
F: Example
• precision?
• recall?
• F1?
18
F: Why harmonic mean?
• The simple (arithmetic) mean is 50% for “return-everything” search engine, which is too high.
• Desideratum: Punish really bad performance on either precision or recall.– Taking the minimum achieves this.– But minimum is not smooth and hard to weight.– F (harmonic mean) is a kind of smooth minimum.
19
Nearest Neighbor Learner
• Basic idea– Similar documents are expected to be
assigned the same class label.
• Vector space model and cosine measure for similarity let us formalize the idea.
20
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space
• The nearest neighbor are defined in terms of cosine similarity
• k-NN returns the most common value among the k training examples nearest to xq
• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples
.
_+
_ xq
+
_ _+
_
_
+
.
..
. .
21
Discussion on the k-NN Algorithm
• Distance-weighted nearest neighbor algorithm– Weight the contribution of each of the k neighbors
according to their distance to the query xq
• Give greater weight to closer neighbors
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes – To overcome it, elimination of the least relevant attributes
2),(1
ixqxdw
22
Bayesian Methods• Learning and classification methods based on
probability theory.• Bayes theorem plays a critical role in probabilistic
learning and classification.• Build a generative model that approximates how
data is produced• Uses prior probability of each category given no
information about an item.• Categorization produces a posterior probability
distribution over the possible categories given a description of an item.
23
Bayes’ Rule
P (C , D) P (C | D)P (D) P (D | C )P (C )
P(C | D) P(D | C)P(C)
P(D)
24
Naive Bayes ClassifiersTask: Classify a new instance D based on a tuple of attribute
values into one of the classes cj CnxxxD ,,, 21
),,,|(argmax 21 njCc
MAP xxxcPcj
),,,(
)()|,,,(argmax
21
21
n
jjn
Cc xxxP
cPcxxxP
j
)()|,,,(argmax 21 jjnCc
cPcxxxPj
25
Naïve Bayes Assumption• P(cj)
– Can be estimated from the frequency of classes in the training examples.
• P(x1,x2,…,xn|cj) – O(|X|n•|C|) parameters– Could only be estimated if a very, very large number of
training examples was available.
Naïve Bayes Conditional Independence Assumption:
• Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).
26
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
The Naïve Bayes Classifier
• Conditional Independence Assumption: features detect term presence and are independent of each other given the class:
• This model is appropriate for binary variables– Multivariate Bernoulli model
)|()|()|()|,,( 52151 CXPCXPCXPCXXP
27
Learning the Model
• First attempt: maximum likelihood estimates– simply use the frequencies in the data
Classification• Multinomial vs Multivariate Bernoulli
• Multinomial model is almost always more effective in text applications!
47
Naive Bayes is Not So Naive• Naïve Bayes: First and Second place in KDD-CUP 97 competition,
among 16 (then) state of the art algorithmsGoal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.
• Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting resultsInstead Decision Trees can heavily suffer from this.
• Very good in domains with many equally important featuresDecision Trees suffer from fragmentation in such cases – especially if little data
• A good dependable baseline for text classification (but not the best)!• Optimal if the Independence Assumptions hold: If assumed
independence is correct, then it is the Bayes Optimal Classifier for problem
• Very Fast: Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size
• Low Storage requirements
48
Hypertext classification
• Search engines assign heuristic weights to terms that occur in specific HTML tags
• Paying special attention to tags can help with supervised learning as well
49
Hypertext classification
• It is important to distinguish between the two occurrences of the word “surfing”– resume.publication.title.surfing– resume.hobbies.item.surfing
• Relations provide a uniform way to codify hypertextual features.– Ex: contains-text(resume.hobbies.item, wind-surfing)– Ex: links-to(source, destination)
50
Rule Induction
51
Rule Induction
• The outer loop learns new rules one at a time, removing positive examples covered by any rule generated thus far.– When a new empty rule is initialized, its free variables
can be bound in all possible ways
• The inner loop adds conjunctive literals to the new rule until no negative example is covered by the new rule.– A heuristic is to pick a literal that rapidly increases the
ratio of surviving positive to negative bindings.
52
Feature Selection: Why?• Text collections have a large number of
features– 10,000 – 1,000,000 unique words … and more
• May make using a particular classifier feasible– Some classifiers can’t deal with 100,000 of features
• Reduces training time– Training time for some methods is quadratic or
worse in the number of features
• Can improve generalization (performance)– Eliminates noise features– Avoids overfitting
53
Feature selection: how?• An easy one
– Ignoring terms that are “too frequent” or “too rare” according to empirically chosen threshold.
• General idea:– Hypothesis testing statistics:
• Are we confident that the value of one categorical variable is associated with the value of another
• Chi-square test
54
2 statistic (CHI)• 2 is interested in (fo – fe)2/fe summed over all table
entries: is the observed number what you’d expect given the marginals?
• The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence).
)001.(9.129498/)94989500(502/)502500(
75.4/)75.43(25./)25.2(/)(),(22
2222
p
EEOaj
9500
500
(4.75)
(0.25)
(9498)3Class auto
(502)2Class = auto
Term jaguarTerm = jaguar expected: fe
observed: fo
55
There is a simpler formula for 2x2 2:
2 statistic (CHI)
N = A + B + C + D
D = #(¬t, ¬c)B = #(t,¬c)
C = #(¬t,c)A = #(t,c)
56
Feature Selection• Chi-square
– Statistical foundation– May select very slightly informative frequent