Top Banner
Statistical Methods Traditional grammars may be “brittle” Statistical methods are built on formal theories Vary in complexity from simple trigrams to conditional random fields Can be used for language identification, text classification, information retrieval, and information extraction
30

Statistical Methods

Jan 01, 2016

Download

Documents

hillary-nixon

Statistical Methods. Traditional grammars may be “brittle” Statistical methods are built on formal theories Vary in complexity from simple trigrams to conditional random fields Can be used for language identification, text classification, information retrieval, and information extraction. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Methods

Statistical Methods

Traditional grammars may be “brittle” Statistical methods are built on formal theories Vary in complexity from simple trigrams to

conditional random fields Can be used for language identification, text

classification, information retrieval, and

information extraction

Page 2: Statistical Methods

N-Grams

Text is comprised of characters (or words or

phonemes) An N-gram is a sequence of n consecutive

characters (or words ...) unigram, bigram, trigram

Technically, it is a Markov chain of order n-1 P(c

i | c

1:i-1) = P(c

i|c

i-n+1:ci-1)

Calculate N-grams by looking at large corpus

Page 3: Statistical Methods

Example – Language Identification

Use P(ci|c

i-2:ci-1,l), where l ranges over languages

About 100,000 characters of each language are needed l* = argmax

l P(l|c

1:N) = argmax

l P(l) P(c

i|c

i-2:ci-1,l)

Learn the model from a corpus P(l), the probability of a given language can be

estimated Other examples: spelling correction, genre

classification, and named-entity recognition

Page 4: Statistical Methods

Smoothing

Problem: What if a particular n-gram does not

appear in the training corpus? Probability would be 0 – should be a small, but

positive number Smoothing – adjusting the probability of low-

frequency counts Laplace: use 1/(n+2) instead of 0 (n observations) Backoff model: back off to n-1 grams

Page 5: Statistical Methods

Model Evaluation

Use cross-validation (split corpus into training and

evaluation sets) Need a metric for evaluation Can use perplexity to describe the probability of a

sequence Perplexity(c

1:N) = P(c

1:n)-1/N

Can be thought of as the reciprocal of probability

normalized by the sequence length

Page 6: Statistical Methods

N-gram Word Models

Can be used for text classification Example: spam vs. ham Problem: out-of-vocabulary word

Trick: During training, use <UNK> first time word is

seen, then after that use word regularly. Then when an

unknown word is seen, treat it as <UNK> Calculate probabilities from a corpus, then

randomly generate phrases

Page 7: Statistical Methods

Example – Spam Detection

Text classification problem Train for P(Message|spam) and P(Message|ham)

using n-grams Calculate P(message|spam) P(spam) and

P(message|ham) P(ham) and take whichever is

greater

Page 8: Statistical Methods

Spam Detection – Other Methods

Represent message as a set of feature/value pairs Apply a classification algorithm for the feature vector Strongly depends on the features chosen

Data compression Data compression algorithms such a LZW look for

commonly re-occurring sequences and replace later

copies with pointers to earlier ones. Append new message to list of spam messages and

compress, do the same for ham, and whichever

compresses smaller...

Page 9: Statistical Methods

Information Retrieval

Think WWW and search engines Characterized by

Corpus of documents Queries in some query language Result set Presentation of result sort (some ordering)

Methods: Simple Boolean keyword models, IR

scoring functions, PageRank algorithm, HITS

algorithm

Page 10: Statistical Methods

IR Scoring Function - BM25

Okapi Project (Robertson, et. al.) Three factors:

Frequency word appears in the document (TF) The inverse document frequency (IDF) – inverse of

times word appears in all documents Length of document

|dj| is the length of the document, L is the average

document length, k and b are tuned parameters

BM25 d j , q1 : N =∑i=1

N

IDF qi∗TF qi , d j∗k1

TF qi , d j k∗1−bb∗∣d j∣/ L

Page 11: Statistical Methods

BM25 cont'd.

IDF qi= log N−DFqi0.5

DFq i0.5

Page 12: Statistical Methods

Precision and Recall

Precision measures the proportion of the

documents in the result set that are actually

relevant, e.g., if the result set contains 30 relevant

documents and 10 non-relevant documents,

precision is .75 Recall is the proportion of relevant documents that

are in the result set, e.g., if 30 relevant documents

are in the result set out of a possible 50, recall

is .60

Page 13: Statistical Methods

IR Refinement

Pivoted document length normalization Longer documents tend to be favored Instead of document length, use a different

normalization function that can be tuned Use word stems Use synonyms Look at metadata

Page 14: Statistical Methods

PageRank Algorithm (Google)

Count the links that point to the page Weight links from “high-quality sites” higher

Minimizes the effect of creating lots of pages that point

to the chosen page

where PR(p) is the PageRank of p, N is the total number

of pages in the corpus, xi is a page that link to p, and

C(xi) is the count of the total number of out-links on

the page xi

PR p=1−dN

d∗∑i

PRx iC x i

Page 15: Statistical Methods

Information Extraction

Ability to answer questions Possibilities range from simple template matching

to full-blown language understanding systems May be domain specific or general Used as DB front-end, or WWW searching Examples: AskMSR, IBM's Watson, Wolfram

Alpha, Siri

Page 16: Statistical Methods

Template Matching

Simple template matching (Weizenbaum's Eliza) Regular Expression matching – finite state

automata Relational extraction methods – FASTUS: Processing done in stages: Tokenization,

Complex-word handling, Basic-group handling,

Complex-phrase handling, Structure merging Each stages uses a FSA

Page 17: Statistical Methods

Stochastic Methods for NLP

Probabilistic Context-Free ParsersProbabilistic Lexicalized Context-Free ParsersHidden Markov Models – Viterbi AlgorithmStatistical Decision-Tree Models

Page 18: Statistical Methods

Markov Chain

Discrete random process: The system is in various

states and we move from state to state. The

probability of moving to a particular next state (a

transition) depends solely on the current state and

not previous states (the Markov property).May be modeled by a finite state machine with

probabilities on the edges.

Page 19: Statistical Methods

Hidden Markov Model

Each state (or transition) may produce an output.The outputs are visible to the viewer, but the

underlying Markov model is not.The problem is often to infer the path through the

model given a sequence of outputs.The probabilities associated with the transitions are

known a priori.There may be more than one start state. The

probability of each start state may also be known.

Page 20: Statistical Methods

Uses of HMM

Parts of speech (POS) taggingSpeech recognitionHandwriting recognitionMachine TranslationCryptanalysis Many other non-NLP applications

Page 21: Statistical Methods

Viterbi Algorithm

Used to find the mostly likely sequence of states

(the Viterbi path) in a HMM that leads to a given

sequence of observed events.Runs in time proportional to (number of

observations) * (number of states)2.Can be modified if the state depends on the last n

states (instead of just the last state). Take time

(number of observations) * (number of states)n

Page 22: Statistical Methods

Viterbi Algorithm - Assumptions

The system at any given time is in one particular

state.There are a finite number of states.Transitions have an associated incremental metric.Events are cumulative over a path, i.e., additive in

some sense.

Page 23: Statistical Methods

Viterbi Algorithm - Code

See the

http://en.wikipedia.org/wiki/Viterbi_algorithm.

Page 24: Statistical Methods

Example - Using HMMs

Using HMMs to parse seminar announcements Look for different features: Speaker, date, etc. Could use one big HMM for all features or

separate HMMs for each feature Advantages: resistant to noise, can be trained from

data, easily updated Can be used to generate output as well as parse

Page 25: Statistical Methods

Example: HMM for speaker recog.

Page 26: Statistical Methods

Conditional Random Fields

HMM models the full joint probability of

observations and hidden states – too much work Instead, model the conditional probability of the

hidden attributes given the observations Given a text e

1:N, find the hidden state sequence

X1:N

that maximizes P(X1:N

|e1:N

) Conditional Random Field (CRF) does this Linear Chain CRF: variables in temporal sequence

Page 27: Statistical Methods

Automated Template Construction

Start with examples of output, e.g., author-title

pairs Match over large corpus, noting order, and prefix,

suffix, and intermediate text Generate templates from the matches Sensitive to noise

Page 28: Statistical Methods

Types of Grammars - Chomsky

Recursively Enumerable: unrestricted rules Context-Sensitive: right-hand side must contain at

least as many symbols as the left-hand side Context-Free: The left-hand side contains a single

symbol Regular Expression: left-hand side is a single non-

terminal, right-hand side is a terminal symbol

optionally followed by a non-terminal symbol

Page 29: Statistical Methods

Probabilistic CFG

1. sent <- np, vp. p(sent) = p(r

1) * p(np) * p(vp).

2. np <- noun. p(np) = p(r

2) * p(noun).

....9. noun <- dog. p(noun) = p(dog).

The probabilities are taken from a particular corpus

of text.

Page 30: Statistical Methods

Probabilistic Lexicalized CFG

1. sent <- np(noun), vp(verb). p(sent) = p(r

1) * p(np) * p(vp)

* p(verb|noun).2. np <- noun. p(np) = p(r

2) * p(noun).

....9. noun <- dog. p(noun) = p(dog).

Note that we've introduced the probability of a

particular verb given a particular noun.