LECTURE 5: FINITE-STATE METHODS AND STATISTICAL NLP Mark Granroth-Wilding LAST LECTURE • Finite-state automata (FSAs) • Finite-state transducers (FSTs) to produce output • Intro to morphology: word-internal structure • Now: application of FSTs to morphological analysis (and generation) 1 FSTs FOR MORPHOLOGY • Can encode morphological rules as FSTs • Example: Finnish vowel harmony • A reminder: • Back vowels: a, o, u • Front vowels: ¨ a, ¨ o, y • Middle vowels: e, i • Word contain back+middle or front+middle • Never: back+front • Affixes come in back and front forms: e.g. na/n¨ a 2 FSA FOR FINNISH VOWEL HARMONY Accepts only valid combinations of front/back/middle 1 2 3 a, o, u ¨ a, ¨ o, y other C, a, o, u, e, i C, ¨ a, ¨ o, y, e, i 3 FST FOR FINNISH VOWEL HARMONY • Can be used in NLG system • Affixes added by other process • Generic form: harmony left till later (now) A→a/¨ a O→o/¨ o U→u/y punaise+ssA → punaisessa p¨ a¨ a+ssA → p¨ a¨ ass¨ a 4 FST FOR FINNISH VOWEL HARMONY 1 2 3 a:a, o:o, u:u, A:a, O:o, U:u ¨ a:¨ a,¨ o:¨ o, y:y, A:¨ a, O:¨ o, U:y other:other C:C, a:a, o:o, u:u, e:e, i:i A:a, O:o, U:u C:C, ¨ a:¨ a,¨ o:¨ o, y:y, e:e, i:i A:¨ a, O:¨ o, U:y p u n a i s e s s A p u n a i s e s s a 5 FST FOR FINNISH VOWEL HARMONY 1 2 3 a:a, o:o, u:u, A:a, O:o, U:u ¨ a:¨ a,¨ o:¨ o, y:y, A:¨ a, O:¨ o, U:y other:other C:C, a:a, o:o, u:u, e:e, i:i A:a, O:o, U:u C:C, ¨ a:¨ a,¨ o:¨ o, y:y, e:e, i:i A:¨ a, O:¨ o, U:y p ¨ a ¨ a s s A p ¨ a ¨ a s s ¨ a 5 OTHER USES OF FSTs • Morphological generation • Text preprocessing, tokenization, NER, . . . • Simple dialogue systems: flow of possible questions, responses, . . . • Dialogue models: tracking dialogue state, user knowledge More on dialogue in lecture 11 6
8
Embed
HARMONY FST FOR FINNISH VOWEL - courses.helsinki.fi
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Finite-state transducers (FSTs) to produce output
• Intro to morphology: word-internal structure
• Now: application of FSTs to morphological analysis (andgeneration)
1
FSTs FOR MORPHOLOGY
• Can encode morphological rules as FSTs
• Example: Finnish vowel harmony
• A reminder:• Back vowels: a, o, u• Front vowels: a, o, y• Middle vowels: e, i• Word contain back+middle or front+middle• Never: back+front• Affixes come in back and front forms: e.g. na/na
2
FSA FOR FINNISH VOWELHARMONY
Accepts only valid combinations of front/back/middle
1 2
3
a, o, u
a, o, y
other
C, a, o, u,e, i
C, a, o, y,e, i
3
FST FOR FINNISH VOWELHARMONY
• Can be used in NLG system
• Affixes added by other process
• Generic form: harmony left till later (now)
A→a/a O→o/o U→u/y
punaise+ssA → punaisessapaa+ssA → paassa
4
FST FOR FINNISH VOWEL HARMONY
1 2
3
a:a, o:o, u:u,A:a, O:o, U:u
a:a, o:o, y:y,A:a, O:o, U:y
other:other
C:C, a:a, o:o, u:u,e:e, i:i
A:a, O:o, U:u
C:C, a:a, o:o, y:y,e:e, i:i
A:a, O:o, U:y
p u n a i s e s s Ap u n a i s e s s a
5
FST FOR FINNISH VOWEL HARMONY
1 2
3
a:a, o:o, u:u,A:a, O:o, U:u
a:a, o:o, y:y,A:a, O:o, U:y
other:other
C:C, a:a, o:o, u:u,e:e, i:i
A:a, O:o, U:u
C:C, a:a, o:o, y:y,e:e, i:i
A:a, O:o, U:y
p a a s s Ap a a s s a
5
OTHER USES OF FSTs
• Morphological generation
• Text preprocessing, tokenization, NER, . . .
• Simple dialogue systems:flow of possible questions, responses, . . .
• Dialogue models:tracking dialogue state, user knowledge
More on dialogue in lecture 11
6
PRACTICAL FSTs
• Vowel harmony: very simple example, not even complete
• Real morphology much more complex
• Huge, complicated FSTs
• Divide into smaller components: compose
• Limited ambiguity: few possible analyses per word
• Better predictions if enough examples of wi−3 wi−2 wi−1 intraining data
20
REMINDER: ZIPF’S LAW
0 20 40 60 80 1000
2000
4000
6000
8000
10000
Most frequent:,
Next:the
‘Long tail’
21
ZIPF’S LAW
0 1000 2000 3000 4000 5000 6000 7000
0
2000
4000
6000
8000
10000
• A few n-grams are very common
• Many n-grams are very rare (long tail)
For n-gram models:
• Always some unseen n-grams
• Diminishing returns of more data
22
DATA SPARSITY
• How to assign probability to unseen words?
p(phogen|ti−2, ti−1)
• What predictions to make in unseen context?
p(ti |phogen, cridget)
• General problem: models must be robust to things not seen intraining data
1. Do we have any information to inform the decision?
2. If no, what then?
23
DATA SPARSITY
Robust to things not seen in training data
1. Do we have any information to inform the decision?
2. If no, what then?
Some solutions for n-gram models:
1. Backoff: unseen/rare n-gram, try using (n − 1)-gram
2. Smoothing: reserve some probability for things never seenbefore
24
LANGUAGE MODELLINGAPPLICATIONOne example: speech recognition
Speech
Finally a small settle-
ment loomed ahead.
Final ya smalls set all
mens loom da head.
-221.6
-4840.1
LM
25
LANGUAGE MODELLING ACCURACY?
• Remember: LM is prob dist over sentences
p(W )
• Typically modelled as dist over words given earlier words
p(wi |w0, . . . ,wi−1)
• Two LMs produce different predictionsp(wi |w0, . . . ,wi−1, LM0) p(wi |w0, . . . ,wi−1, LM1)
• Could compare using accuracy
argmaxw (p(w |w0, . . . ,wi−1))?= wi
• Large vocabulary: top prediction rarely correct
26
EVALUATING A LANGUAGE MODEL
• Observed word should get high probability
• but others are not wrong
• Compare probabilities of observed words
• Perplexity: measure mean probability of observed words
• How surprised model is on average by each word
2−1mlog2(p(w0,...,wm−1))
• For LM whose context is previous words:
2−1m
∑mi=0 log2(p(wi |w0,...,wi ))
• Lower is better
27
EXAMPLE LM PERPLEXITIES
Evaluation on Penn Treebank texts (890k toks)
Model PerplexityAWD-LSTM, mixture of softmaxes (2018) 54.44LSTM (2014) 82.7Smoothed 5-gram (1996) 141.2
Evaluation on WikiText-2 (Wikipedia text, 2M toks)
Model PerplexityAWD-LSTM, mixture of softmaxes (2018) 61.45Variational LSTM (2016) 87.0
28
PERPLEXITY
• What does 54.44 PPL mean?
• Is it good?
• Depends on:• test corpus: how hard is it to predict?• training corpus: how big / representative?
• Compare models on same training + test corpus
29
POS TAGGINGReminder
• Distinguish syntactic function of words in broad classes
For the present
Noun
, we are. . . vs. The
Adjective
present situation. . .
• NLP subtask: part-of-speech (POS) tagging
• Shallow syntactic analysis: no explicit structure
• Includes some disambiguation important to meaning
• Useful practical first analysis step
30
POS TAGS
Some parts of speech:
POS ExampleNoun The dog ate the boneVerb The dog ate the boneAdjective The big dog ate the tasty boneAdverb The dog ate the bone quicklyPronoun He ate my boneDeterminer The dog ate that bone
Typically make slighly more fine-grained distinctionsSome words have one possible POS, most have several
31
EXERCISE: POS-TAGGINGAMBIGUITYIn small groups
POS ExampleNoun The dog ate the boneVerb The dog ate the boneAdjective The big dog ate the boneAdverb He ate the bone quicklyPronoun He ate my boneDeterminer The dog ate that bonePreposition He chewed with his teethCoordinatingconjunction
He chewed and he growled
Example
Return now to yourquarters and I will sendyou word of the outcome
• POS tag this sentence, using the POSs above
• What other POSs could each word take (in other contexts)?
32
AMBIGUITY OF INTERPRETATION
What is the mean temperature inKumpula?
SELECT day_mean FROM daily_forecast:
WHERE station = ’Helsinki Kumpula’:
AND date = ’2019-05-21’;
SELECT day_mean FROM weekly_forecast
WHERE station = ’Helsinki Kumpula’
AND week = ’w22’;
SELECT MEAN(day_temp)
FROM weather_history
WHERE station = ’Helsinki Kumpula’
AND year = ’2019’;
. . .?
• Many forms of ambiguity
• Every level/step of analysis
• The big challenge of NLP
33
SOME AMBIGUITIES
Easily misinterpreted headlines1:
• Ban on Nude Dancing on Governor’s Desk
• Juvenile Court to Tr
Same POS tagDifferent meanings
y Shoo
DifferentPOS tags
ting Defendant 2 interpretations
• Kids Make Nutritious Snacks
Several types of ambiguity – what?
Can POS-tagging ambiguity explain these?
1Thanks to Dan Klein & Roger Levy
34
AMBIGUITY
Juvenile Court to Try Shooting Defendant
• Two interpretations by human reader
• NLP system should output both
• Most syntactic parsers will produce dozens of structures!
• Majority have no intelligible interpretation
35
LOCAL AMBIGUITY
Ambiguity often resolved by immediate context
Turning to Doggo, Myles extended his left palm.
• Most ambiguities resolved in this way for humans
• Simple context sometimes enough
• Sometimes require sophisticated reasoning/knowledge:
Time flies like an arrow.Fruit flies like a banana.
36
LOCAL AMBIGUITY
Ambiguity often resolved by immediate context
Turning to Doggo, Myles extended his left palm.
• Most ambiguities resolved in this way for humans
• Simple context sometimes enough
• Sometimes require sophisticated reasoning/knowledge:
Time flies like an arrow.Fruit flies like a banana.
36
GARDEN-PATH SENTENCES
• Start encourages one interpretation
• Continuation forces re-analysis
Thanks to @lanegreen
37
GARDEN-PATH SENTENCES
• Start encourages one interpretation
• Continuation forces re-analysis
Thanks to @lanegreen
Thanks to @AlexBledsoe
37
CORPORA
Corpus (pl. corpora)
The body of written or spoken material upon which a linguisticanalysis is based
• Why do we need corpora?• Test linguistic hypotheses• Evaluate tools: annotated/labelled corpus• Train statistical models Statistical NLP:
today• Most often: collection of text• Also: speech, video, numeric data, . . . , combinations
• Used long-studied linguistic knowledge, but:• Lots of rules• Complex interactions• Narrow domain• Hard to handle varied (“incorrect”) and changing language
• Statistics from data can help
39
STATISTICAL MODELS
Help model ambiguity / uncertainty, as in POS tagging
• Multiple interpretations
• Weights/confidences derived from data
• Combinatorial effects→ influence of context
• Express uncertainty in output
40
STATISTICAL MODELS
• Statistics over previous analyses can help estimate confidences
• Often use probabilistic models
• Local ambiguity: probabilities/confidences
• Multiple hypotheses about meaning/structure
• Update hypotheses as larger units are combined
Statistical models ' machine learning (ML)
41
STATISTICAL MODELS
• Can try to learn everything from data
• Practical and theoretical difficulties
• Some success in recent work
Advanced statisticalNLP: lecture 13
• Mostly: focussed statistical modelling of sub-tasks
• Supervised & unsupervised learning
• Collect statistics (learn) from corpora• annotated (supervised) / raw data (unsupervised)
42
POS-TAGGING AMBIGUITY
Example
Return now to your quarters and I will send you word of theoutcome.