Top Banner
CS 4705 CS 4705 Probabilistic Approaches to Pronunciation and Spelling
24

CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

CS 4705

CS 4705

Probabilistic Approaches to Pronunciation and Spelling

Page 2: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Spoken and Written Word (Lexical) Errors

• Variation vs. error• Word formation errors:

– I go to Columbia Universary.– Easy enoughly.– words of rule formation

• Lexical access problems:– Turn to the right (left)– “I called my mother on the television and did not

understand the door. It was too breakfast, but they came from far to near. My mother is not too old for me to be young." (Wernecke’s aphasia)

Page 3: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

• Can humans understand ‘what is meant’ as opposed to ‘what is said/written’?– Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it

deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.

• How do we do this?

Page 4: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Detecting and Correcting Spelling Errors

• Applications:– Spell checking in M$ word

– OCR scanning errors

– Hand-writing recognition of zip codes, signatures, Graffiti

• Issues:– Correct non-words (dg for dog, but frplc?)

– Correct “wrong” words in context (their for there, words of rule formation)

Page 5: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Patterns of Error

• Human typists make different types of errors from OCR systems -- why?

• Error classification I: performance-based:– Insertion: catt

– Deletion: ct

– Substitution: car

– Transposition: cta

• Error classification II: cognitive– People don’t know how to spell (nucular/nuclear)

– Homonymous errors (their/there)

Page 6: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

How do we decide if a (legal) word is an error?

• How likely is a word to occur? – They met there friends in Mozambique.

• The Noisy Channel Model

– Input to channel: true (typed or spoken) word w

– Output from channel: an observation O

– Decoding task: find w = P(w|O)

Source Noisy Channel Decoder

maxargVw

Page 7: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Bayesian Inference• Population: 10 Columbia students

–What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = .4

–That a rcs is a CS major? p(c) = .3

–That a rcs is a vegetarian CS major? p(c,v) = .2

–4 vegetarians –3 CS majors

Page 8: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Bayesian Inference• Population: 10 Columbia students

– 4 vegetarians, 3 CS major

– Probability that a rcs is a vegetarian? p(v) = .4– That a rcs who is a vegetarian is also a CS major? p(c|v) = .5– That a rcs is a vegetarian (and) CS major? p(c,v) = .2

Page 9: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Bayesian Inference

• Population: 10 Columbia students– 4 vegetarians, 3 CS major

– Probability that a rcs is vegetarian? p(v) = .4

– That a rc who is a vegetarian is also a CS major p(c|v) = .5

– That a rcs is a vegetarian and a CS major? p(c,v) = .2 = p(v) p(c|v) (.4 * .5)

Page 10: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Bayesian Inference

• Population: Columbia students– 4 vegetarians, 3 CS major

– Probability that a rcs is a CS major? p(c) = .3

– That rc who is a CS major is also a vegetarian? p(v|c) = .66

– That rcs is a vegetarian CS major? p(c,v) = .2 = p(c) p(v|c) = (.3 * .66)

Page 11: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Bayes Rule

• So, we know the joint probabilities– p(c,v) = p(c) p(v|c)

– p(v,c) = p(v) p(c|v)

– p(c,v) = p(v,c)

• Using these equations, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c)– p(v) p(c|v) = p(c) p(v|c)

)()|()()|(

vpcvpcpvcp

Page 12: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Returning to Spelling...

– Channel Input: w; Output: O

– Decoding: hypothesis w = P(w|O)

– or, by Bayes Rule...

– w =

– and, since P(O) doesn’t change for any entries in our lexicon we are going to consider, we can ignore it as constant, so…

– w = P(O|w) P(w) (Given that w was intended, how likely are we to see O)

Source Noisy Channel Decoder

maxargVw

)()()|(maxarg

OPwPwOP

Vw

maxargVw

Page 13: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

How could we use this model to correct spelling errors?

• Simplifying assumptions– We only have to correct non-word errors

– Each non-word (O) differs from its correct word (w) by one step (insertion, deletion, substitution, transposition)

• From O, generate a list of candidates differing by one step and appearing in the lexicon, e.g.

Error Corr Corr letter Error letter Pos Type

caat cat - a 2 ins

caat carat r - 3 del

Page 14: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

How do we decide which correction is most likely?

• We want to find the lexicon entry w that maximizes P(typo|w) P(w)

• How do we estimate the likelihood P(typo|w) and the prior P(w)?

• First, find some corpora– Different corpora needed for different purposes– Some need to be labeled -- others do not– For spelling correction, what do we need?

• Word occurrence information (unlabeled)• A corpus of labeled spelling errors

Page 15: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Cat vs Carat

• Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus– cat occurs 6500 times, so p(cat) = .00013– carat occurs 3000 times, so p(carat) = .00006

• Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word))– suppose ‘a’ insertion after ‘a’ occurs 5000 times

(p(+a)=.1) and ‘r’ deletion occurs 7500 times (p(-r)=.15)

Page 16: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

• Then p(word|typo) = p(typo|word) * p(word)– p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013– p(carat|caat) = p(-r) * p(carat) = .15 * .000006

= .000009

• Issues:– What if there are no instances of carat in corpus?

• Smoothing algorithms– Estimate of P(typo|word) may not be accurate

• Training probabilities on typo/word pairs– What if there is more than one error per word?

Page 17: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

A More General Approach: Minimum Edit Distance

• How can we measure how different one word is from another word?– How many operations will it take to transform one

word into another?

caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)

– Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1)

– Alternative: weight each operation by training on a corpus of spelling errors to see which most frequent

Page 18: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

– Alternative: count substitutions as 2 (1 insertion and 1 deletion)

– Alternative: Damerau-Levenshtein Distance includes transpositions as a single operation (e.g. cta cat)

• Code and demo for simple Levenshtein MED

Page 19: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

MED Calculation is an Example of Dynamic Programming

• Decompose a problem into its subproblems– e.g. fp --> firep a subproblem of fplc --> fireplace

– Intuition: An optimal solution for the subproblem will be part of an optimal solution for the problem

– Solve any subproblem only once: store all solutions

– Recursive algorithm

• Often: Work backwards from the desired goal state to the initial state

Page 20: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

• For MED, create an edit-distance matrix:– each cell c[x,y] represents the distance between the first

x chars of the target t and the first y chars of the source s (e.g the x-length prefix of t compared to the y-length prefix of s)

– this distance is the minimum cost of inserting, deleting, or substituting operations on the previously considered substrings of the source and target

Page 21: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Edit Distance Matrix, Subst=2 -- WrongNB: errors

Page 22: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

NB: Subst x for x Cost is 0, not 2

n 9 8 9 10 11 12 11 10 9 8

o 8 7 8 9 10 11 10 9 8 9

i 7 6 7 8 9 10 9 8 9 10

t 6 5 6 7 8 9 8 9 10 11

n 5 4 5 6 7 8 9 10 11 10

e 4 3 4 5 6 7 8 9 10 9

t 3 4 5 6 7 8 7 8 9 8

n 2 3 4 5 6 7 8 7 8 7

i 1 2 3 4 5 6 7 6 7 8

# 0 1 2 3 4 5 6 7 8 9

# e x e c u t i o n

Page 23: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Summary

• We can apply probabilistic modeling to NL problems like spell-checking– Noisy channel model, Bayesian method– Training priors and likelihoods on a corpus

• Dynamic programming approaches allow us to solve large problems that can be decomposed into subproblems– e.g. MED algorithm

Apply similar methods to modeling pronunciation variation

Page 24: CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

– Allophonic variation + register/style (lexical) variation

butter/tub, going to/gonna

– Pronunciation phenomena can be seen as insertions/deletions/substitutions too, with somewhat different ways of computing the likelihoods

• Measuring ASR accuracy over words (WER)• Next time: Ch 6