Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Feature SelectionAdvanced Statistical Methods in NLP

Ling 572January 24, 2012

RoadmapFeature representations:

Features in attribute-value matricesMotivation: text classification

Managing featuresGeneral approachesFeature selection techniques

Feature scoring measures Alternative feature weighting

Chi-squared feature selection

Representing Input:Attribute-Value Matrix

Currency

Country

… fm

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates

Currency

Country

… fm

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

Choosing features:• Define features – i.e. with feature templates• Instantiate features

Currency

Country

… fm

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reduction

Currency

Country

… fm

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import

Currency

Country

… fm

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import• Global feature weighting: weight whole column• Local feature weighting: weight cell, conditions

Feature Selection ExampleTask: Text classification

Feature template definition:

Feature template definition:Word – just one template

Feature instantiation:

Feature instantiation:Words from training (and test?) data

Feature selection:

Feature instantiation:Words from training (and test?) data

Feature selection:Stopword removal: remove top K (~100) highest freq

Words like: the, a, have, is, to, for,…

Feature weighting:

Feature Selection Example Task: Text classification

Feature template definition: Word – just one template

Feature instantiation: Words from training (and test?) data

Feature selection: Stopword removal: remove top K (~100) highest freq

Words like: the, a, have, is, to, for,…

Feature weighting: Apply tf*idf feature weighting

tf = term frequency; idf = inverse document frequency

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Number of features potentially enormous# words in corpus continues to increase w/corpus size

High dimensionality problematic:

High dimensionality problematic:Leads to data sparseness

Hard to create valid model Hard to predict and generalize – think kNN

Leads to high computational cost

Think of the instances as vectors of features # of features = # of dimensions

Number of features potentially enormous # words in corpus continues to increase w/corpus size

High dimensionality problematic: Leads to data sparseness

Leads to high computational cost Leads to difficulty with estimation/learning

More dimensions more samples needed to learn model

Breaking the CurseDimensionality reduction:

Produce a representation with fewer dimensions But with comparable performance

More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

Functionally,Many ML algorithms do not scale well

Expensive: Training cost, training costPoor prediction: overfitting, sparseness

Dimensionality ReductionGiven an initial feature set r,

Create a feature set r’ s.t. |r| < |r’|

Approaches:

Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)

Feature selection/filtering, vsFeature mapping (aka extraction)

Feature SelectionFeature selection:

r’ is a subset of rHow can we pick features?

Extrinsic ‘wrapper’ approaches:

Extrinsic ‘wrapper’ approaches: For each subset of features:

Build, evaluate classifier for some task Pick subset of features with best performance

Extrinsic ‘wrapper’ approaches: For each subset of features:

Build, evaluate classifier for some task Pick subset of features with best performance

Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

Feature SelectionWrapper approach:

Pros: Easy to understand, implementClear relationship b/t selected features and task perf.

Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov

Filtering approach:Pros

Filtering approach:Pros: theoretical basis, less task, classifier specificCons:

Filtering approach:Pros: theoretical basis, less task, classifier specificCons: Doesn’t always boost task performance

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in r

Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as

unrelated

unrelated Map to new concept representing all

big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:

unrelated Map to new concept representing all

big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:Term classes: e.g. class-based n-grams

Derived from term clusters

Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as unrelated

Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:Term classes: e.g. class-based n-grams

Derived from term clusters

Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix

Produces ‘closest’ rank r’ approximation of original

Feature MappingPros:

Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space

Feature MappingPros:

Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space

Cons:Some ad-hoc factors:

e.g. # of dimensionsResulting feature space can be hard to interpret

Feature FilteringFiltering approaches:

Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

Fairly fast and classifier-independent

Many different measures:Mutual informationInformation gainChi-squared etc…

Feature Scoring Measures

Basic Notation, Distributions

Assume binary representation of terms, classes

tk : term in T; ci: class in C

P(tk): proportion of documents in which tk appears

P(ci): proportion of documents of class ci

Binary so have

P(tk): proportion of documents in which tk appears

P(ci): proportion of documents of class ci

Binary so have

Setting Up!ci ci

!tk a b

tk c d

Setting Up!ci ci

!tk a b

tk c d

Setting Up!ci ci

!tk a b

tk c d

Setting Up!ci ci

!tk a b

tk c d

Setting Up!ci ci

!tk a b

tk c d

Setting Up!ci ci

!tk a b

tk c d

Setting Up!ci ci

!tk a b

tk c d

Feature Selection Functions

Question: What makes a good features?

Perspective: Best features:

Features that are most DIFFERENTLY distributed across classes

Perspective: Best features:

Features that are most DIFFERENTLY distributed across classes

I.e. features are best that most effectively differentiate between classes

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears

Applying DF: Remove terms with DF below some threshold

Intuition:Very rare terms: won’t help with categorization

Or not useful globally

Pros: Easy to implement, scalable

Cons: Ad-hoc, low DF terms ‘topical’

Term Selection Functions: MI

Pointwise Mutual Information (MI)

MI=0 if t,c independent

Issue: Can be heavily influenced by marginalProblem comparing terms of differing frequencies

Term Selection Functions: IG

Information Gain: Intuition: Transmitting Y, how many bits can we

save if we know X? IG(Y,Xi) = H(Y)-H(Y|X)

Information Gain: Derivation

From F. Xia, ‘11

More Feature SelectionGSS coefficient:

From F. Xia, ‘11

NGL coefficient: N : # of docs

From F. Xia, ‘11

Chi-square:

From F. Xia, ‘11

Chi-square:

From F. Xia, ‘11

More Term SelectionRelevancy score:

From F. Xia, ‘11

More Term SelectionRelevancy score:

Odds Ratio:

From F. Xia, ‘11

Global SelectionPrevious measures compute class-specific

selection

What if you want to filter across ALL classes?Compute an aggregate measure across classes

Average:

From F. Xia, ‘11

What’s the best?Answer:

It depends on ….ClassifiersType of data…

According to (Yang and Pedersen, 1997):{OR,NGL,GSS} > {X2

max,Igsum}> {#avg}>>{MI}On text classification tasks

Using kNN

From F. Xia, ‘11

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

Chi SquareTests for presence/absence of relation random

variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

Compute X2

For 2x2 TableO:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic: Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Sufficient values per cell: > 5

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

vectors of features

feature templates

data feature selection

dimensions number of

template feature instantiation

dimensionality reduction

feature selection example

curse of dimensionality

Documents

NLP Practitioner 'plus' Infomappe · NLP • NLP...

Conditional Random Fields Advanced Statistical Methods in...

Features & Unification Ling 571 Deep Processing Techniques.....

Decision tree LING 572 Fei Xia 1/16/06. Outline Basic...

Empirical Methods in Natural Language Processing Lecture 1.....

Systems & Applications: Introduction Ling 573 NLP Systems...

Feature-based Grammar Ling 571 Deep Techniques for NLP...

Shallow & Deep QA Systems Ling 573 NLP Systems and...

Decision List LING 572 Fei Xia 1/18/06. Outline Basic...

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Transformation-Based Learning Advanced Statistical Methods.....

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

Introduction to information theory LING 572 Fei Xia, Dan...

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in.....

Introduction to Semantics and Pragmatics. LING 2000 - 2006.....

Decision List LING 572 Fei Xia 1/12/06. Outline Basic...