Outline Applications: • Spelling correction Formal Representation: • Weighted FSTs Algorithms: • Bayesian Inference (Noisy channel model) • Methods to determine weights – Hand-coded – Corpus-based estimation • Dynamic Programming – Shortest path
Dec 22, 2015
Outline
Applications:
• Spelling correction
Formal Representation:
• Weighted FSTs
Algorithms:
• Bayesian Inference (Noisy channel model)
• Methods to determine weights– Hand-coded– Corpus-based estimation
• Dynamic Programming– Shortest path
Detecting and Correcting Spelling Errors
Sources of lexical/spelling errors
• Speech: lexical access and recognition errors (more later)
• Text: typing and cognitive
• OCR: recognition errors
Applications:
• Spell checking
• Hand-writing recognition of zip codes, signatures, Graffiti
Issues:
• Correct non-words in isolation (dg for dog, why not dig?)
• Correcting non-words could lead to valid words– Homophone substitution: “parents love there children”; “Lets order a desert after
dinner”– Correcting words in context
Patterns of Error
Human typists make different types of errors from OCR systems -- why?
Error classification I: performance-based:
• Insertion: catt
• Deletion: ct
• Substitution: car
• Transposition: cta
Error classification II: cognitive
• People don’t know how to spell (nucular/nuclear; potatoe/potato)
• Homonymous errors (their/there)
Probability: Refresher
Population: 10 Princeton students
– What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4–That a rcs is a CS major? p(c) = 0.3–That a rcs is a vegetarian and CS major? p(c,v) = 0.2–That a vegetarian is a CS major? p(c|v) = 0.5–That a CS major is a vegetarian? p(v|c) = 0.66–That a non-CS major is a vegetarian? p(v|c’) = ??
–4 vegetarians
–3 CS majors
Bayes Rule and Noisy Channel model
• We know the joint probabilities– p(c,v) = p(c) p(v|c) (chain rule)– p(v,c) = p(c,v) = p(v) p(c|v)
• So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c).
• “Noisy channel” metaphor: channel corrupts the input; recover the original.
– think cell-phone conversations!!– Hearer’s challenge: decode what the speaker said (w), given a channel-
corrupted observation (O).
)|(maxarg* OwPwVw
)()|()()|(
vpcvpcpvcp
)(*)|(maxarg* wPwOPwVw
Source modelChannel model
How do we use this model to correct spelling errors?
• Simplifying assumptions– We only have to correct non-word errors– Each non-word (O) differs from its correct word (w) by one step
(insertion, deletion, substitution, transposition)
• Generate and Test Method: (Kernighan et al 1990)– Generate a word using one of substitution, deletion or insertion,
transposition operations– Test if the resulting word is in the dictionary.
• Example:
Observation
Correct Correct letter
Error Letter
Position Type of Error
caat cat - a 2 insertion
caat carat r - 3 deletion
How do we decide which correction is most likely?
Validate the generated word in a dictionary.
• But there may be multiple valid words, how to rank them?
• Rank them based on a scoring function– P(w | typo) = P(typo | w) * P(w)– Note there could be other scoring functions
• Propose n-best solutions
Estimate the likelihood P(typo|w) and the prior P(w)
• count events from a corpus to estimate these probabilities
• Labeled versus Unlabeled corpus
• For spelling correction, what do we need?– Word occurrence information (unlabeled corpus)– A corpus of labeled spelling errors– Approximate word replacement by local letter replacement
probabilities: Confusion matrix on letters
Cat vs Carat
Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus• cat occurs 6500 times, so p(cat) = .00013
• carat occurs 3000 times, so p(carat) = .00006
Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word))• suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500
times (p(-r)=.15)
Scoring function: p(word|typo) = p(typo|word) * p(word)• p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013
• p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009
Encoding One-Error Correction as WFSTs
Let Σ = {c,a,r,t};
One-edit model:
Dictionary model:
One-Error spelling correction:
• Input ● Edit ● Dictionary
tc a
r
a t
a
c:c,a:a,r:r,t:t c:c,a:a,r:r,t:tc:c,a:a,r:r,t:t
:c,:a,:r,:t
c:,a:,r:,t:
c:a,c:r,c:t,a:c,a:t…
Del
0
Ins
0 0
Sub
t
Issues
What if there are no instances of carat in corpus?
• Smoothing algorithms
Estimate of P(typo|word) may not be accurate
• Training probabilities on typo/word pairs
What if there is more than one error per word?
Minimum Edit Distance
How can we measure how different one word is from another word?
• How many operations will it take to transform one word into another?
caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)
• Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1)
• Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent
Computing Levinshtein Distance
]1||,1|[|),(
)(]1,[
),(]1,1[
)(],1[
min],[
tsdtsLev
tinsjid
tssubjid
sdeljid
jid
j
ji
i
• Dynamic Programming algorithm
– Solution for a problem is a function of the solutions of subproblems
– d[i,j] contains the distance upto si
and tj
– d[i,j] is computed by combining the
distance of shorter substrings using insertion, deletion and substitution operations.
– optimal edit operations is recovered by storing back-pointers.
Edit Distance Matrix
NB: errors
Cost=1 for insertions and deletions; Cost=2 for substitutionsRecompute the matrix: insertions=deletions=substituitions=1
Levenstein Distance with WFSTs
Let Σ = {c,a,r,t};
Edit model:
The two sentences to compared are encoded as FSTs.
Levenstein distance between two sentences:
• Dist(s1,s2) = s1 ● Edit ● s2
Subc:c,a:a,r:r,t:t
:c,:a,:r,:t
c:,a:,r:,t:
c:a,c:r,c:t,a:c,a:t…
Del
Ins
0
Spelling Correction with WFSTsDictionary: FST representation of words
Isolated word spelling correction:• AllCorrections(w) = w ● Edit ● Dictionary
• BestCorrection(w) = Bestpath (w ● Edit ● Dictionary)
Spelling correction in context: “parents love there children”• S = w1, w2, … wn
• Spelling correction of wi
• Generate possible edits for wi
• Pick the edit that fits best in context
• Use a n-gram language model (LM) to rank the alternatives.
• “love there” vs “love their”; “there children” vs “their children”
• SentenceCorrection (S) = F(S) ● Edit ● LM
• Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.
Can humans understand ‘what is meant’ as opposed to ‘what is said/written’?
How?
http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/
Summary
We can apply probabilistic modeling to NL problems like spell-checking• Noisy channel model, Bayesian method• Training priors and likelihoods on a corpus
Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems• e.g. Minimum Edit Distance algorithm
A number of Speech and Language tasks can be cast in this framework.• Generate alternatives using a generator• Select best/ Rank the alternatives using a model• If the generator and the model are encodable as FST
– Decoding becomes • composition followed by search for best path.
Word Classes and Tagging
Word Classes and Tagging
Words can be grouped into classes based on a number of criteria.• Application independent criterion
– Syntactic class (Nouns, Verbs, Adjectives…)– Proper names (People names, country names…)– Dates, currencies
• Application specific criterion– Product names (Ajax, Slurpee, Lexmark 3100)– Service names (7-cents plan, GoldPass)
Tagging: Categorizing words of a sentence into one of the classes.
Syntactic Classes in English: Open Class Words
Nouns: • Defined semantically: words for people, places, things• Defined syntactically: words that take determiners• Count nouns: nouns that can be counted
– One book, two computers, hundred men• Mass nouns: nouns that represent homogenous groups, can occur without
articles.– snow, salt, milk, water, hair
• Proper nouns; common nouns
Verbs: words for actions and processes• Hit, love, run, fly, differ, go
Adjectives: words for describing qualities and properties (modifiers) of objects• White, black, old, young, good, bad
Adverbs: words for describing modifiers of actions• Unfortunately, John walked home extremely slowly yesterday• Subclasses: locative (home), degree (very), manner (slowly), temporal
(yesterday)
Syntactic Classes in English: Closed Class Words
Closed Class words:
• fixed set for a language
• Typically high frequency words
Prepositions: relational words for describing relations among objects and events
• In, on, before, by
• Particles: looked up, throw out
Articles/Determiners: definite versus indefinite
• Indefinite: a, an
• Definite: the
Conjunctions: used to join two phrases, clauses, sentences.
• Coordinating conjunctions: and, or, but
• Subordinating conjunctions: that, since, because
Pronouns: shorthand to refer to objects and events.
• Personal pronouns: he, she, it, they, us
• Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s
• Wh-pronouns: whose, what, who, whom, whomever
Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action
• Tense: past, present, future
• Aspect: completed or on-going
• Polarity: negation
• Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might)
• Copula: “be” connects a subject to a predicate (John is a teacher)
Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).
Tagset
Tagset: set of tags to use; depends on the application.• Basic tags; tags with some morphology• Composition of a number of subtags
– Agglutinative languages
Popular tagsets for English• Penn Treebank Tagset: 45 tags• CLAWS tagset: 61 tags• C7 tagset: 146 tags
How do we decide how many tags to use?• Application utility• Ease of disambiguation• Annotation consistency
– “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions– “TO” tag represents preposition “to” and infinitival marker “to read”
Supertags: fold in syntactic information into tagset• of the order of 1000 tags
Tagging: Disambiguating Words
Three different models• ENGTWOL model (Karlsson et.al. 1995)
• Transformation-based model (Brill 1995)
• Hidden Markov Model tagger
ENGTWOL tagger• Constraint-based tagger
• 1,100 hand-written constraints to rule out invalid combinations of tags.– Use of probabilistic constraints and syntactic information
Transformation-based model• Start with the most likely assignment
• Make note of the context when the most likely assignment is wrong.
• Induce a transformation rule that corrects the most likely assignment to the correct tag in that context.
• Rules can be seen as α β | δ – γ
• Compilable into an FST
Again, the Noisy Channel Model
Input to channel: Part-of-speech sequence T
• Output from channel: a word sequence W
• Decoding task: find T’ = P(T|W)
• Using Bayes Rule
• And since P(W) doesn’t change for any hypothetical T’
• T’ = P(W|T) P(T)
• P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual Probability
Source Noisy Channel Decoder
maxargVT
)()()|(maxarg
WPTPTWP
VT
maxargVT
Stochastic Tagging: Markov Assumption
• The tagging model is approximated using Markov assumptions.– T’ = P(T) * P(W|T)– Markov (first-order) assumption: – Independence assumption:– Thus:
• The probability distributions are estimated from an annotated corpus.– Maximum Likelihood Estimate
• P(w|t) = count(w,t)/count(t)
• P(ti|ti-1) = count(ti, ti-1)/count(ti-1)
• Don’t forget to smooth the counts!!
– There are other means of estimating these probabilities.
maxargVT
i
ii ttPTP )|()( 1
i
ii twPTWP )|()|(
)|(*)|(maxarg' 1
iii
iiVT
ttPtwPT
Best Path Search
Search for the best path pervades many Speech and NLP problems.• ASR: best path through a composition of acoustic, pronunciation and
language models• Tagging: best path through a composition of lexicon and contextual
model• Edit distance: best path through a search space set up by insertion,
deletion and substitution operations.
In general: • Decisions/operations create a weighted search space• Search for the best sequence of decisions
Dynamic programming solution• Sometimes the score is only relevant.• Most often the path (sequence of states; derivation) is relevant.
Multi-stage decision problems
DT •VB VBZ
NN NNS
The dog runs .•
P(dog|NN) = 0.99
P(dog|VB) = 0.01
P(the|DT) = 0.999
P(runs|NNS) = 0.63
P(runs|VBZ) = 0.37
P( | ) = 0.999• •
P(DT|BOS) =1
P(NN|DT) = 0.9
P(VB|DT) = 0.1
P(NNS|NN) = 0.3
P(VBZ|NN) = 0.7
P( |NNS) = 0.3
P( |VBZ) = 0.7
P(EOS | ) = 1
••
•
BOS EOS
P(NNS|VB) = 0.7
P(VBZ|VB) = 0.3
Multi-stage decision problems
Find the state sequence through this space that maximizes P(w|t)*P(t|t-1)
cost(BOS, EOS) = 1*cost(DT, EOS)
cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS),
P(the|DT)*P(VB|DT)*cost(VB,EOS)}
DT •VB VBZ
NN NNS
The dog runs .•
BOS EOS
Two ways of reasoning
Forward approach (Backward reasoning)• Compute the best way to get from a state to the goal state.
Backward approach (Forward reasoning)• Compute the best way from the source state to get to a
state.
A combination of these two approaches is used in unsupervised training of HMMs.• Forward-backward algorithm (Appendix D)