Top Banner
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas
24

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Statistical techniques in NLP

Vasileios Hatzivassiloglou

University of Texas at Dallas

Page 2: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Learning

• Central to statistical NLP

• In most cases, supervised methods are used, with a separate training set

• Unsupervised methods (clustering) recalculate the entire model on new data

Page 3: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Parameterized models

• Assume that the observed (training) data D is described by a given distribution

• This distribution, possibly with some parameters , is our model .

• We want to maximize the likelihood function, P(D|) or P(D|).

Page 4: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

• Find the that maximizes P(D|), i.e.,

• Example: Binomial distribution• P(D|m) =

• Therefore, m=D/N

Maximum likelihood estimation

0

P

DND mm )1(

0)]1()1)([()1( 11

DNDDND mDNmmDmP

0

0)()1(

mND

DNmmD

Page 5: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Smoothing

• MLE assigns zero probability to unseen events

• Example: trigrams in part of speech tagging (23% unseen)

• Solution: smoothing (small probabilities for unseen data)

Page 6: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Bayesian learning

• It is often impossible to solve

• Bayes decision rule: choose that maximizes P(|D) (minimum error rate)

• But it may be hard to calculate P(|D)

• Use Bayes’ rule:

• Naïve Bayes:

)(

)()|()|(

DP

PDPDP

0

P

)|()|(1

i

N

i

wPDP

Page 7: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Examples• Gale et al 1992, 90% sense disambiguation

accuracy (choose between “bank/money” and “bank/river”)

• Hanks and Rooth 1990, prepositional phrase attachment– He ate pasta with cheese– He ate pasta with a fork

• Both rely on observable features (nearby words, the verb)

Page 8: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Markov models

• A stochastic process follows a sequence of states over time with some transition probabilities

• If the process is stationary and with limited memory, we have a Markov chain

• The model can be visible, or with hidden states (HMM)

ij

Page 9: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Example: N-gram language models

• Result for a word depends only on the word and a limited number of neighbors

• Part-of-speech tagging: maximize• With Bayes rule, chain rule, and

independence assumptions

• Use HMM for automatically adjusting back-off smoothing

)|( 11 nn wtP

)|()|()()|( 11

111 iiii

n

innn ttPtwPtPtwP

Page 10: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Example: Speech recognition

• Need to find correct sequence of words given aural signal

• Language model (N-gram) accounts for dependencies between words

• Acoustic model maps from visible (phonemes) to hidden (words) level

• HMM combines both• Viterbi algorithm will find optimal solution

Page 11: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Estimation-Maximization

• In general, we can iteratively estimate complex models with hidden parameters

• Define a quality function Q as the conditional likelihood of the model on all parameters

• Estimate Q from an initial choice for • Choose new that maximizes Q

Page 12: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Example: PCFG parsing

• Probabilistic context-free grammars• Likelihood of each rule (e.g., VP V or VP

V NP) is a basic parameter• Combined probability of the entire tree gives

the quality function• Forward-backward algorithm gives the

solution• Lexicalization (Collins, 1996, 1997)

Page 13: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Example: Machine Translation

• The noisy channel model (Brown et al., 1991)– Input in one language (e.g., English) is garbled

into another (e.g., French)– Estimate probabilities of each word or phrase

generating words or phrases in the other language and how many of them (fertility)

• A similar approach: Transliteration (Knight, 1998)

Page 14: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Linear regression

• Predict output as a linear combination of input variables

• Choose weights that minimize the sum of residual square error (least squares)

• Can be computed efficiently via a matrix decomposition and inversion

ii

iVwwR 0

Page 15: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Log-linear regression

• Ideal output is 0 or 1

• Because the distribution changes from normal to binomial, a transformed LS fit is not accurate

• Solution: Use an intermediate predictor ,

• Can be approximated with iterative reweighted least squares

η

η

e1

e

R

Page 16: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Examples

• Text categorization for information retrieval (Yang, 1998)

• Many types of sentence/word classification– cue words (Passonneau and Litman, 1993)– prosodic features (Pan and McKeown, 1999)

Page 17: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

• A technique for reducing dimensionality; data points are projected

• Given matrix A (nm), find matrices T (nk), S (kk), and D (km) so that their product is A

• S is the top k singular values of A

• Projection is achieved by multiplying and A

• Application: Latent Semantic Indexing

Singular-value decomposition

TT

Page 18: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Methods without an explicit probability model

• Use empirical techniques to directly provide output without calculating a model

• Decision trees: Each node is associated with a decision on one of the input features

• The tree is built incrementally by choosing features with the most discriminatory power

Page 19: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Variations on decision trees

• Shrinking to prevent over-training

• Decision lists (Yarowsky 1997) use only the top feature for accent restoration

Page 20: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Rule induction

• Similar to decision trees, but the rules are allowed to vary and contain different operators

• Examples: RIPPER (Cohen 1996), transformation-based learning (Brill 1996), genetic algorithms (Siegel 1998)

Page 21: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Methods without explicit model

• k-Nearest Neighbor classification

• Neural networks

• Genetic algorithms

Page 22: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Support vector machines

• Find hyperplane that maximizes distance from support vectors

• Non-linear transformation: From original space to separable space via kernel function

• Text categorization (Joachims, 1997), OCR (Burges and Vapnik, 1996), Speech recognition (Schmidt, 1996)

Page 23: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Classification issues

• Two or many classes

• Classifier confidence, probability of membership in each class

• Training / test set distributions

• Balance of training data across classes

Page 24: Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

When to use each method?

• Probabilistic models depend on distributional assumptions

• Linear models (and SVD) assume a normal data distribution, and generalized linear models a Poisson, binomial, or negative binomial

• Markov models capture limited dependencies• Rule-based models allow for multi-way classification

easier than linear/log-linear ones• For many applications, it is important to get a

confidence estimate