Top Banner
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides are based on Tom Mitchell’s slides
27

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Jan 01, 2016

Download

Documents

Tracey Adams
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

1

CS546: Machine Learning and Natural Language

Discriminative vs Generative Classifiers

This lecture is based on (Ng & Jordan, 02) paper and some slides are based on Tom Mitchell’s slides

Page 2: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

2

Outline

Reminder: Naive Bayes and Logistic Regression (MaxEnt)

Asymptotic Analysis What is better if you have an infinite dataset?

Non-asymptotic Analysis What is the rate of convergence of parameters? More important: convergence of the expected error

Empirical evaluation Why this lecture?

Nice and simple application of Large Deviation bounds we considered before

We will analyze specifically NB vs LogRegression, but “hope” it generalizes to other models (e.g, models for sequence labeling or parsing)

= 5

Page 3: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

3

Discriminative vs Generative Training classifiers involves estimating f: X Y,

or P(Y|X)

Discriminative classifiers (conditional models) Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from

training data

Generative classifiers (joint models) Assume some functional form for P(X|Y), P(X) Estimate parameters of P(X|Y), P(X) directly from

training data Use Bayes rule to calculate P(Y|X= xi)

= 5

Page 4: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

4

Naive Bayes

= 5

Example: assume Y boolean, X = <x1, x2, …, xn>, where xi are binary

Generative model: Naive Bayes

Classify new example x based on ratio

You can do it in log-scale

s indicates size of set.

l is smoothing parameter

Page 5: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

5

Naive Bayes vs Logistic Regression

= 5

Generative model: Naive Bayes

Classify new example x based on ratio

Logistic Regression:

Recall: both classifiers are linear

Page 6: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

What is the difference asymptotically?

Notation: let denote error of hypothesis learned via algorithm A, from m examples If the Naive Bayes model is true: Otherwise

Logistic regression estimator is consistent:

• ² (hDis,m) converges to

• H is the class of all linear classifers

• Therefore, it is asymptotically better than the linear classifier selected by the NB algorithm

Page 7: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Rate of covergence: logistic regression

Convergences to best linear classifier, in order of n examples

follows from Vapnik’s structural risk bound (VC-dimension of n dimensional linear separators is n+1 )

Page 8: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Rate of covergence: Naive Bayes

We will proceed in 2 stages: Consider how fast parameters

converge to their optimal values (we do not care about it, actually)

We care: Derive how it corresponds to the convergence of the error to the asymptotical error

The authors consider a continous case (where input is continious) but it is not very interesting for NLP

However, similar techniques apply

Page 9: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

9

Convergence of Parameters

= 5

Page 10: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

10

Recall: Chernoff Bound

= 5

Page 11: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

11

Recall: Union Bound

= 5

Page 12: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

12

Proof of Lemma (no smoothing for simplicity)

= 5

By the Chernoff’s bound, with probability at least :

the fraction of positive examples will be within of :

Therefore we have at least positive and negative examples

By the Chernoff’s bound for every feature and class label (2n cases) with probability

We have one event with probability and 2n events with probabilities , there joint probability is not greater than sum:

Solve this for m, and you get

Page 13: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Implications

With a number of samples logarithmic in n (not linear as for the logistic regression!) the parameters of approach parameters of

Are we done? Not really: this does not automatically

imply thatthe error approaches

with the same rate

Page 14: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Implications

We need to show that and “often” agree if their parameters are close

We compare log-scores given by the models: and

I.e.:

Page 15: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

15

Convergence of Classifiers

= 5

G – defines the fraction of points very close to the decision boundary

What is this fraction? See later

Page 16: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Proof of Theorem (sketch)

By the lemma (with high probability) the parameters of are within : of those of

It implies that every term in the sum is also within

of the term in and hence

Let So and can have different predictions only if

Probability of this event is

Page 17: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

17

Convergence of Classifiers

= 5

G -- What is this fraction? This is somewhat more difficult

Page 18: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

18

Convergence of Classifiers

= 5

G -- What is this fraction? This is somewhat more difficult

Page 19: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

What to do with this theorem

This is easy to prove, no proof but intuition: A fraction of terms in have large

expectation Therefore, the sum has also large expectation

Page 20: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

What to do with this theorem

But this is weaker then what we need: We have that the expectation is “large” We need that the probability of small values is low

What about Chebyshev inequality?

They are not independent ... How to deal with it?

Page 21: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Corollary from the theorem

Is this condition realistic? Yes (e.g., we can show it for rather realistic

conditions)

Page 22: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Empirical Evaluation (UCI dataset)

Dashed line is logistic regression Solid line is Naives Bayes

Page 23: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Empirical Evaluation (UCI dataset)

Dashed line is logistic regression Solid line is Naives Bayes

Page 24: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Empirical Evaluation (UCI dataset)

Dashed line is logistic regression Solid line is Naives Bayes

Page 25: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Summary

Logistic regression has lower asymptotic error

... But Naive Bayes needs less data to approach its asymptotic error

Page 26: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

First Assignment

I am still checking it, I will let you know by/on Friday

Note though: Do not perform multiple tests (model

selection) on the final test set! It is a form of cheating

Page 27: 1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Term Project / Substitution

This Friday I will distribute the first phase -- due after the Spring break

I will be away for the next 2 weeks: the first week (Mar, 9 – Mar, 15): I will

be slow to respond to email I will be substituted for this week by:

Active Learning (Kevin Small) Indirect Supervision (Alex

Klementiev) Presentation by Ryan Cunningham on

Friday week Mar, 16 - Mar, 23 – no lectures:

work on the project, send questions if needed