Top Banner
Statistical Methods for NLP Statistical Inference Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(27)
27

Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Jul 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Methods for NLP

Statistical Inference

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

[email protected]

Statistical Methods for NLP 1(27)

Page 2: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Stochastic Variables

I A stochastic variable (or random variable) X is a functionfrom a sample space Ω (the domain of X ) to a value spaceΩX (the range of X ).

I Examples:1. The part-of-speech of an arbitrary word from a corpus is a

stochastic variable X with Ω = w |w is a corpus token andΩX = noun, verb,adjective, . . ..

2. The sum of two dice is a stochastic variable Y withΩ = (x , y)|1 ≤ x , y ≤ 6 and ΩY = z|2 ≤ z ≤ 12.

Statistical Methods for NLP 2(27)

Page 3: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Types of Variables

I If ΩX is a subset of the set of real numbers, then X is saidto to be numerical; otherwise it is categorical.

I If ΩX is finite or countably infinite, then X is said to bediscrete.

I Examples:1. The part-of-speech X of an arbitrary word from a corpus is

a discrete, categorical variable, since ΩX is finite and notnumerical.

2. The sum Y of two dice is a discrete numerical variable,since ΩY is finite and numerical.

Statistical Methods for NLP 3(27)

Page 4: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Frequency Functions

I The probability P(X = x) of variable X assuming value x isgiven by the frequency function fX :

fX (x) = P(X = x)

I For discrete variables, this is equivalent to summing theprobability of all outcomes in Ω that are mapped to x by X :

fX (x) = P(u ∈ Ω|X (u) = x) =∑

u:X(u)=x

P(u)

I Example:I The probability of sampling a noun from a corpus is the

sum of the probabilities of sampling each noun.

Statistical Methods for NLP 4(27)

Page 5: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Expectation

I Let X be a discrete numerical variable with value spaceΩX . The expectation of X , E [X ], is defined as follows:

E [X ] =∑

x∈ΩX

x · fX (x)

I Example: The expectation of the sum Y of two dice:

E [Y ] =12∑

y=2

y · fY (y) =25236

= 7

Statistical Methods for NLP 5(27)

Page 6: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Variance

I Let X be a discrete stochastic variable with expectation µ.The variance of X , Var [X ], is defined as follows:

Var [X ] = E [(X − µ)2] =∑

x∈ΩX

(x − µ)2 · fX (x)

I Example: The variance of the sum Y of two dice:

Var [Y ] =12∑

y=2

(y − 7)2 · fY (y) =21036≈ 5.83

I If X is a variable with variance σ2, then σ =√σ2 is the

standard deviation of X .

Statistical Methods for NLP 6(27)

Page 7: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Entropy

I Let X be a discrete stochastic variable. The entropy of X ,H[X ], is defined as follows:

H[X ] = E [− log2 fX ] = −∑

x∈ΩX

fX (x) log2 fX (x)

I Example: The entropy of the sum Y of two dice:

H[Y ] = −12∑

y=2

fY (y) log2 fY (y) ≈ 3.27

Statistical Methods for NLP 7(27)

Page 8: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

More on Entropy

I The entropy of a variable X can be interpreted as theexpected amount of information (measured in bits) whenlearning the value of X :

IX (x) = − log2 fX (x)

I Given a finite value space ΩX of size n, entropy ismaximized if fX (x) = 1

n for all x ∈ ΩX .I Example: Entropy of the outcome Z of an 11-sided die

(2–12):

H[Z ] = −12∑

z=2

111

log21

11≈ 3.46

Statistical Methods for NLP 8(27)

Page 9: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Joint and Conditional Probability

I Let X and Y be stochastic variables with sample spacesΩ1 and Ω2 and value spaces ΩX and ΩY , respectively.

1. The joint probability of X and Y is given by their jointprobability function f(X ,Y ):

f(X ,Y )(x , y) = P(X = x ,Y = y)= P((u, v) ∈ Ω1 × Ω2|X (u) = x ,Y (v) = y)

2. The conditional probability of X given Y is given by theconditional probability function fX |Y :

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ,Y = y)

P(Y = y)

Statistical Methods for NLP 9(27)

Page 10: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Independence

I Stochastic variables X and Y (defined on the sameunderlying sample space) are independent if and only ifthe following holds for all x ∈ ΩX and y ∈ ΩY :

P(X = x ,Y = y) = P(X = x)P(Y = y)

I Corollary: If X and Y are independent variables then thefollowing conditions hold (for all x ∈ ΩX and y ∈ ΩY ):

1. P(X = x |Y = y) = P(X = x)2. P(Y = y |X = x) = P(Y = y)

Statistical Methods for NLP 10(27)

Page 11: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Part-of-Speech Bigrams 1

I Let (X1,X2) be the parts-of-speech of an arbitrary bigramand let the following probabilities be given:

1. P(X1 = noun) = P(X2 = noun) = 0.22. P(X1 = adj) = P(X2 = adj) = 0.053. P(X1 = det|X2 = noun) = 0.34. P(X1 = det|X2 = adj) = 0.65. P(X1 = det|X2 6∈ noun,adj) = 0

I Question: What is P(X2 = noun|X1 = det)?

Statistical Methods for NLP 11(27)

Page 12: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Part-of-Speech Bigrams 2

I Using Bayes’ law:

P(X2 = noun) · P(X1 = det|X2 = noun)

P(X1 = det)

I Using the law of total probability:

P(X2 = noun) · P(X1 = det|X2 = noun)

P(X1 = d|X2 = n) · P(X2 = n) + P(X1 = d|X2 = a) · P(X2 = a)

I Putting in the numbers:

0.2 · 0.30.3 · 0.2 + 0.6 · 0.05

= 0.67

Statistical Methods for NLP 12(27)

Page 13: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Stochastic Variables

Part-of-Speech Bigrams 3

I Consider:1. P(X1 = det) = 0.092. P(X2 = noun) = 0.23. P(X1 = d,X2 = n) = P(X1 = d) · P(X2 = n|X1 = d) = 0.064. P(X1 = det) · P(X2 = noun) = 0.2 · 0.09 = 0.0185. 0.018 6= 0.06

I Conclusion: X1 and X2 are not independent variables.

Statistical Methods for NLP 13(27)

Page 14: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Statistical Inference

I Statistical inference is the science of making predictions orinferences from finite sets of observations (samples) to(potentially infinite) sets of new observations (populations).

I Two main kinds of statistical inference:1. Estimation: Use samples and sample variables to predict

population variables.2. Hypothesis testing: Use samples and sample variables to

test hypotheses about population variables.I Note: In statistical modeling, we often talk about models

instead of populations.

Statistical Methods for NLP 14(27)

Page 15: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Sampling

I Let X be a stochastic variable.1. A vector (X1, . . . ,Xn) of independent variables Xi with the

same distribution as X is said to be a random sample of X.2. A value vector (x1, . . . , xn) such that X1 = x1, . . . ,Xn = xn in

a particular experiment is called a statistical material.I Example:

I Consider a corpus C consisting of words (w1, . . . ,wn).I Can we regard C as a statistical material resulting from a

sample (W1, . . . ,Wn) of the word variable W?I Why (not)?

Statistical Methods for NLP 15(27)

Page 16: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Sample Variables

I Given a random sample of a variable X , we can define newstochastic variables that are functions of the sample, calledsample variables:

1. The sample mean: X n =1n

n∑i=1

Xi

2. The sample variance: sn2 =

1(n − 1)

n∑i=1

(Xi − X n)2

I These variables are called sample variables to distinguishthem from the expectation µ and (true) variance σ2 of X ,which are called population variables or model parameters.

Statistical Methods for NLP 16(27)

Page 17: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Estimation

I Two kinds of estimation:1. Point estimation: Use sample variable f (X1, . . . ,Xn) to

estimate parameter φ.2. Interval estimation: Use sample variables f1(X1, . . . ,Xn) and

f2(X1, . . . ,Xn) to construct an interval such thatP(f1(X1, . . . ,Xn) < φ < f2(X1, . . . ,Xn)) = p, where p is theconfidence level adopted.

Statistical Methods for NLP 17(27)

Page 18: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Maximum Likelihood Estimation (MLE)

I Given a statistical material x1, . . . , xn and a set ofparameters θ, the likelihood function L is:

L(x1, . . . , xn, θ) =n∏

i=1

Pθ(xi)

where Pθ(xi) is the probability that the variable Xi assumesthe value xi given a set of values for the parameters in θ.

I Maximum likelihood estimation means choosing θ so thatthe likelihood function is maximized:

maxθ

L(x1, . . . , xn, θ)

Statistical Methods for NLP 18(27)

Page 19: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

MLE: Example 1

I Given a random sample (X1, . . . ,Xn) of a numericalvariable X , the sample mean X n is a maximum likelihoodestimate of the expectation E [X ].

I The average sentence length X in a certain type of textcan be estimated with the mean sentence length in arepresentative sample:

E [X ] = X n

Statistical Methods for NLP 19(27)

Page 20: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

MLE: Example 2

I Given a random sample (X1, . . . ,Xn) of a categoricalvariable X , the relative frequency of the value x , fn(x), is amaximum likelihood estimate of the probability P(X = x).

I The probability of an arbitrary word from a text being anoun can be estimated with the relative frequency of nounsin a suitable corpus of texts:

P(noun) = fn(noun)

Statistical Methods for NLP 20(27)

Page 21: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

The Rationale of MLE

I We want to choose the most probable model given thedata:

P(θ|x1, . . . , xn) =P(x1, . . . , xn|θ)P(θ)

P(x1, . . . , xn)

argmaxθ

P(θ|x1, . . . , xn) = argmaxθ

P(x1, . . . , xn|θ)P(θ)

I If we assume a uniform distribution for P(θ), then

argmaxθ

P(θ|x1, . . . , xn) = argmaxθ

P(x1, . . . , xn|θ)

I The status of P(θ) is controversial in statistical theory(Bayesians vs. Frequentists)

Statistical Methods for NLP 21(27)

Page 22: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

MLE and Smoothing

I MLE is usually a good solution to the estimation problem ifthe statistical material is large enough.

I For language data, MLE is often suboptimal because ofsparse data and requires smoothing (or regularization).

I Example:I Additive smoothing:

Padd(X = x) =fn(x) + m

n + m · |ΩX |

where m is a constant (usually m ≤ 1).

I Note: Padd(X = x) 6= 0.

Statistical Methods for NLP 22(27)

Page 23: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Interval Estimation

I In general, we can derive a 95% confidence interval for ourmaximum likelihood estimate µ of a mean as follows:

µ± 1.96

√σ2

n

I Examples:

1. Sentence length: E [X ] = X n ± 1.96√

σ2

n

2. Noun probability: P(noun) = fn(noun)± 1.96√

σ2

n

I Note:I The interval grows with increasing variance σ2.I The interval shrinks with the sample size n.

Statistical Methods for NLP 23(27)

Page 24: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

More on Interval Estimation

I Where does the number 1.96 come from?

I Assumptions:1. Parameter has a normal distribution – okay for large n.2. True variance σ2 is known – usually not the case.

Statistical Methods for NLP 24(27)

Page 25: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Hypothesis Testing

I When are differences in sample variables f (x1, . . . , xm) andf (y1, . . . , yn) significant?

I Do they reflect differences in population variables ξ and υ?I Null hypothesis H0: ξ = υ

I Test procedure:I Choose a test statistic f (X ,Y ) whose distribution is known

when H0 is true.I Calculate the probability p of f (x , y) given H0.I If p < α, reject H0 at significance level α.

Statistical Methods for NLP 25(27)

Page 26: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Example: Z-test

I Data:I Mean sentence length in 50 novels from 1950s: X 1 = 19.3.I Mean sentence length in 50 novels from 2000s: X 2 = 16.4.I X is normally distributed with variance σ2 = 134.2.

I Test statistic:

Z =X 1 − X 2√

σ2

n1+n2

=19.3− 16.9√

134.250+50

= 2.28

I Probability calculation:I Z = 2.28 corresponds to p = P(Z = 2.28|H0) = 0.0226.I Reject H0 at α = 0.05 (but not α = 0.01).I Note: This does not mean that P(H1) > 0.95.

Statistical Methods for NLP 26(27)

Page 27: Statistical Methods for NLP - cl.lingfil.uu.senivre/master/StatMetLecture2.pdf · Y is finite and numerical. Statistical Methods for NLP 3(27) Stochastic Variables Frequency Functions

Statistical Inference

Tips and Tricks

I What to do in real life?I When variance is not known – use sample variance and a

t-test (instead of Z -test):

t =X 1 − X 2√

s21

n1+

s22

n2

I When distribution is not normal – use a non-parametric test.I The special case of proportions:

I If X is binary with P(X = 1) = p, then Var [X ] = p(1− p):

p = fn(1)±√

p(1−p)n Z = p1−p2r

p1(1−p1)

n1+

p2(1−p2)

n2

Statistical Methods for NLP 27(27)