Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine.

Outline

Applications:

• Spelling correction

Formal Representation:

• Weighted FSTs

Algorithms:

• Bayesian Inference (Noisy channel model)

• Methods to determine weights– Hand-coded– Corpus-based estimation

• Dynamic Programming– Shortest path

Detecting and Correcting Spelling Errors

Sources of lexical/spelling errors

• Speech: lexical access and recognition errors (more later)

• Text: typing and cognitive

• OCR: recognition errors

Applications:

• Spell checking

• Hand-writing recognition of zip codes, signatures, Graffiti

Issues:

• Correct non-words in isolation (dg for dog, why not dig?)

• Correcting non-words could lead to valid words– Homophone substitution: “parents love there children”; “Lets order a desert after

dinner”– Correcting words in context

Patterns of Error

Human typists make different types of errors from OCR systems -- why?

Error classification I: performance-based:

• Insertion: catt

• Deletion: ct

• Substitution: car

• Transposition: cta

Error classification II: cognitive

• People don’t know how to spell (nucular/nuclear; potatoe/potato)

• Homonymous errors (their/there)

http://home.earthlink.net/~dcrehr/whyqwert.html

Probability: Refresher

Population: 10 Princeton students

– What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4–That a rcs is a CS major? p(c) = 0.3–That a rcs is a vegetarian and CS major? p(c,v) = 0.2–That a vegetarian is a CS major? p(c|v) = 0.5–That a CS major is a vegetarian? p(v|c) = 0.66–That a non-CS major is a vegetarian? p(v|c’) = ??

–4 vegetarians

–3 CS majors

Bayes Rule and Noisy Channel model

• We know the joint probabilities– p(c,v) = p(c) p(v|c) (chain rule)– p(v,c) = p(c,v) = p(v) p(c|v)

• So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c).

• “Noisy channel” metaphor: channel corrupts the input; recover the original.

– think cell-phone conversations!!– Hearer’s challenge: decode what the speaker said (w), given a channel-

corrupted observation (O).

)|(maxarg* OwPwVw

)()|()()|(

vpcvpcpvcp

)(*)|(maxarg* wPwOPwVw

Source modelChannel model

How do we use this model to correct spelling errors?

• Simplifying assumptions– We only have to correct non-word errors– Each non-word (O) differs from its correct word (w) by one step

(insertion, deletion, substitution, transposition)

• Generate and Test Method: (Kernighan et al 1990)– Generate a word using one of substitution, deletion or insertion,

transposition operations– Test if the resulting word is in the dictionary.

• Example:

Observation

Correct Correct letter

Error Letter

Position Type of Error

caat cat - a 2 insertion

caat carat r - 3 deletion

How do we decide which correction is most likely?

Validate the generated word in a dictionary.

• But there may be multiple valid words, how to rank them?

• Rank them based on a scoring function– P(w | typo) = P(typo | w) * P(w)– Note there could be other scoring functions

• Propose n-best solutions

Estimate the likelihood P(typo|w) and the prior P(w)

• count events from a corpus to estimate these probabilities

• Labeled versus Unlabeled corpus

• For spelling correction, what do we need?– Word occurrence information (unlabeled corpus)– A corpus of labeled spelling errors– Approximate word replacement by local letter replacement

probabilities: Confusion matrix on letters

Cat vs Carat

Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus• cat occurs 6500 times, so p(cat) = .00013

• carat occurs 3000 times, so p(carat) = .00006

Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word))• suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500

times (p(-r)=.15)

Scoring function: p(word|typo) = p(typo|word) * p(word)• p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013

• p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009

Encoding One-Error Correction as WFSTs

Let Σ = {c,a,r,t};

One-edit model:

Dictionary model:

One-Error spelling correction:

• Input ● Edit ● Dictionary

tc a

r

a t

a

c:c,a:a,r:r,t:t c:c,a:a,r:r,t:tc:c,a:a,r:r,t:t

:c,:a,:r,:t

c:,a:,r:,t:

c:a,c:r,c:t,a:c,a:t…

Del

0

Ins

0 0

Sub

t

Issues

What if there are no instances of carat in corpus?

• Smoothing algorithms

Estimate of P(typo|word) may not be accurate

• Training probabilities on typo/word pairs

What if there is more than one error per word?

Minimum Edit Distance

How can we measure how different one word is from another word?

• How many operations will it take to transform one word into another?

caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)

• Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1)

• Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent

Computing Levinshtein Distance

]1||,1|[|),(

)(]1,[

),(]1,1[

)(],1[

min],[

tsdtsLev

tinsjid

tssubjid

sdeljid

jid

j

ji

i

• Dynamic Programming algorithm

– Solution for a problem is a function of the solutions of subproblems

– d[i,j] contains the distance upto si

and tj

– d[i,j] is computed by combining the

distance of shorter substrings using insertion, deletion and substitution operations.

– optimal edit operations is recovered by storing back-pointers.

Edit Distance Matrix

NB: errors

Cost=1 for insertions and deletions; Cost=2 for substitutionsRecompute the matrix: insertions=deletions=substituitions=1

http://www.cs.colorado.edu/~martin/SLP/slp-errata.html

Levenstein Distance with WFSTs

Let Σ = {c,a,r,t};

Edit model:

The two sentences to compared are encoded as FSTs.

Levenstein distance between two sentences:

• Dist(s1,s2) = s1 ● Edit ● s2

Subc:c,a:a,r:r,t:t

:c,:a,:r,:t

c:,a:,r:,t:

c:a,c:r,c:t,a:c,a:t…

Del

Ins

0

Spelling Correction with WFSTsDictionary: FST representation of words

Isolated word spelling correction:• AllCorrections(w) = w ● Edit ● Dictionary

• BestCorrection(w) = Bestpath (w ● Edit ● Dictionary)

Spelling correction in context: “parents love there children”• S = w1, w2, … wn

• Spelling correction of wi

• Generate possible edits for wi

• Pick the edit that fits best in context

• Use a n-gram language model (LM) to rank the alternatives.

• “love there” vs “love their”; “there children” vs “their children”

• SentenceCorrection (S) = F(S) ● Edit ● LM

• Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.

Can humans understand ‘what is meant’ as opposed to ‘what is said/written’?

How?

http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/

Summary

We can apply probabilistic modeling to NL problems like spell-checking• Noisy channel model, Bayesian method• Training priors and likelihoods on a corpus

Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems• e.g. Minimum Edit Distance algorithm

A number of Speech and Language tasks can be cast in this framework.• Generate alternatives using a generator• Select best/ Rank the alternatives using a model• If the generator and the model are encodable as FST

– Decoding becomes • composition followed by search for best path.

Word Classes and Tagging

Word Classes and Tagging

Words can be grouped into classes based on a number of criteria.• Application independent criterion

– Syntactic class (Nouns, Verbs, Adjectives…)– Proper names (People names, country names…)– Dates, currencies

• Application specific criterion– Product names (Ajax, Slurpee, Lexmark 3100)– Service names (7-cents plan, GoldPass)

Tagging: Categorizing words of a sentence into one of the classes.

Syntactic Classes in English: Open Class Words

Nouns: • Defined semantically: words for people, places, things• Defined syntactically: words that take determiners• Count nouns: nouns that can be counted

– One book, two computers, hundred men• Mass nouns: nouns that represent homogenous groups, can occur without

articles.– snow, salt, milk, water, hair

• Proper nouns; common nouns

Verbs: words for actions and processes• Hit, love, run, fly, differ, go

Adjectives: words for describing qualities and properties (modifiers) of objects• White, black, old, young, good, bad

Adverbs: words for describing modifiers of actions• Unfortunately, John walked home extremely slowly yesterday• Subclasses: locative (home), degree (very), manner (slowly), temporal

(yesterday)

Syntactic Classes in English: Closed Class Words

Closed Class words:

• fixed set for a language

• Typically high frequency words

Prepositions: relational words for describing relations among objects and events

• In, on, before, by

• Particles: looked up, throw out

Articles/Determiners: definite versus indefinite

• Indefinite: a, an

• Definite: the

Conjunctions: used to join two phrases, clauses, sentences.

• Coordinating conjunctions: and, or, but

• Subordinating conjunctions: that, since, because

Pronouns: shorthand to refer to objects and events.

• Personal pronouns: he, she, it, they, us

• Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s

• Wh-pronouns: whose, what, who, whom, whomever

Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action

• Tense: past, present, future

• Aspect: completed or on-going

• Polarity: negation

• Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might)

• Copula: “be” connects a subject to a predicate (John is a teacher)

Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).

Tagset

Tagset: set of tags to use; depends on the application.• Basic tags; tags with some morphology• Composition of a number of subtags

– Agglutinative languages

Popular tagsets for English• Penn Treebank Tagset: 45 tags• CLAWS tagset: 61 tags• C7 tagset: 146 tags

How do we decide how many tags to use?• Application utility• Ease of disambiguation• Annotation consistency

– “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions– “TO” tag represents preposition “to” and infinitival marker “to read”

Supertags: fold in syntactic information into tagset• of the order of 1000 tags

Tagging: Disambiguating Words

Three different models• ENGTWOL model (Karlsson et.al. 1995)

• Transformation-based model (Brill 1995)

• Hidden Markov Model tagger

ENGTWOL tagger• Constraint-based tagger

• 1,100 hand-written constraints to rule out invalid combinations of tags.– Use of probabilistic constraints and syntactic information

Transformation-based model• Start with the most likely assignment

• Make note of the context when the most likely assignment is wrong.

• Induce a transformation rule that corrects the most likely assignment to the correct tag in that context.

• Rules can be seen as α β | δ – γ

• Compilable into an FST

Again, the Noisy Channel Model

Input to channel: Part-of-speech sequence T

• Output from channel: a word sequence W

• Decoding task: find T’ = P(T|W)

• Using Bayes Rule

• And since P(W) doesn’t change for any hypothetical T’

• T’ = P(W|T) P(T)

• P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual Probability

Source Noisy Channel Decoder

maxargVT

)()()|(maxarg

WPTPTWP

VT

maxargVT

Stochastic Tagging: Markov Assumption

• The tagging model is approximated using Markov assumptions.– T’ = P(T) * P(W|T)– Markov (first-order) assumption: – Independence assumption:– Thus:

• The probability distributions are estimated from an annotated corpus.– Maximum Likelihood Estimate

• P(w|t) = count(w,t)/count(t)

• P(ti|ti-1) = count(ti, ti-1)/count(ti-1)

• Don’t forget to smooth the counts!!

– There are other means of estimating these probabilities.

maxargVT

i

ii ttPTP )|()( 1

i

ii twPTWP )|()|(

)|(*)|(maxarg' 1

iii

iiVT

ttPtwPT

Best Path Search

Search for the best path pervades many Speech and NLP problems.• ASR: best path through a composition of acoustic, pronunciation and

language models• Tagging: best path through a composition of lexicon and contextual

model• Edit distance: best path through a search space set up by insertion,

deletion and substitution operations.

In general: • Decisions/operations create a weighted search space• Search for the best sequence of decisions

Dynamic programming solution• Sometimes the score is only relevant.• Most often the path (sequence of states; derivation) is relevant.

Multi-stage decision problems

DT •VB VBZ

NN NNS

The dog runs .•

P(dog|NN) = 0.99

P(dog|VB) = 0.01

P(the|DT) = 0.999

P(runs|NNS) = 0.63

P(runs|VBZ) = 0.37

P( | ) = 0.999• •

P(DT|BOS) =1

P(NN|DT) = 0.9

P(VB|DT) = 0.1

P(NNS|NN) = 0.3

P(VBZ|NN) = 0.7

P( |NNS) = 0.3

P( |VBZ) = 0.7

P(EOS | ) = 1

••

•

BOS EOS

P(NNS|VB) = 0.7

P(VBZ|VB) = 0.3

Multi-stage decision problems

Find the state sequence through this space that maximizes P(w|t)*P(t|t-1)

cost(BOS, EOS) = 1*cost(DT, EOS)

cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS),

P(the|DT)*P(VB|DT)*cost(VB,EOS)}

DT •VB VBZ

NN NNS

The dog runs .•

BOS EOS

Two ways of reasoning

Forward approach (Backward reasoning)• Compute the best way to get from a state to the goal state.

Backward approach (Forward reasoning)• Compute the best way from the source state to get to a

state.

A combination of these two approaches is used in unsupervised training of HMMs.• Forward-backward algorithm (Appendix D)

Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine.

Documents

nonword errors

deletion slide

noncs major

correct word w

nonword o

correct nonwords

resulting word

generated word