HARMONY FST FOR FINNISH VOWEL - courses.helsinki.fi

LECTURE 5: FINITE-STATE METHODSAND STATISTICAL NLP

Mark Granroth-Wilding

LAST LECTURE

• Finite-state automata (FSAs)

• Finite-state transducers (FSTs) to produce output

• Intro to morphology: word-internal structure

• Now: application of FSTs to morphological analysis (andgeneration)

1

FSTs FOR MORPHOLOGY

• Can encode morphological rules as FSTs

• Example: Finnish vowel harmony

• A reminder:• Back vowels: a, o, u• Front vowels: a, o, y• Middle vowels: e, i• Word contain back+middle or front+middle• Never: back+front• Affixes come in back and front forms: e.g. na/na

2

FSA FOR FINNISH VOWELHARMONY

Accepts only valid combinations of front/back/middle

1 2

3

a, o, u

a, o, y

other

C, a, o, u,e, i

C, a, o, y,e, i

3

FST FOR FINNISH VOWELHARMONY

• Can be used in NLG system

• Affixes added by other process

• Generic form: harmony left till later (now)

A→a/a O→o/o U→u/y

punaise+ssA → punaisessapaa+ssA → paassa

4

FST FOR FINNISH VOWEL HARMONY

1 2

3

a:a, o:o, u:u,A:a, O:o, U:u

a:a, o:o, y:y,A:a, O:o, U:y

other:other

C:C, a:a, o:o, u:u,e:e, i:i

A:a, O:o, U:u

C:C, a:a, o:o, y:y,e:e, i:i

A:a, O:o, U:y

p u n a i s e s s Ap u n a i s e s s a

5

FST FOR FINNISH VOWEL HARMONY

1 2

3

a:a, o:o, u:u,A:a, O:o, U:u

a:a, o:o, y:y,A:a, O:o, U:y

other:other

C:C, a:a, o:o, u:u,e:e, i:i

A:a, O:o, U:u

C:C, a:a, o:o, y:y,e:e, i:i

A:a, O:o, U:y

p a a s s Ap a a s s a

5

OTHER USES OF FSTs

• Morphological generation

• Text preprocessing, tokenization, NER, . . .

• Simple dialogue systems:flow of possible questions, responses, . . .

• Dialogue models:tracking dialogue state, user knowledge

More on dialogue in lecture 11

6

PRACTICAL FSTs

• Vowel harmony: very simple example, not even complete

• Real morphology much more complex

• Huge, complicated FSTs

• Divide into smaller components: compose

• Limited ambiguity: few possible analyses per word

• Real morphological analysis forFinnish, English, Swedish, Turkish, Italian, German, . . . :

HFST: https://github.com/hfstWeb demo

7

PRACTICAL FSTs

https://github.com/hfst

Downloadable FST for Finnish morphology analysis

∼1k nodes →Full analyser contains 2.5M nodes!

8

PRACTICAL FSTs

https://github.com/hfst

Downloadable FST for Finnish morphology analysis

∼1k nodes →Full analyser contains 2.5M nodes!

8

THEORETICAL LIMITATIONS OFFSTs

• No memory: just current state

• Can’t match brackets

this sentence ( is ( well ) bracketed ) .

this sentence ( is ( not ) well ) bracketed ) .

?• No non-linear structure

• Recursion/embedding: important for language

9

THEORETICAL LIMITATIONS OFFSTs

• Can’t generate separate analyses for:

(un+lock)+able vs un+(lock+able)

• Embedding creates long-range

More in next lecture!

dependencies

Alice, with her big boots on, walks quicklyAthletes, after much training, walk quickly

10

EXPRESSIVITY OF LANGUAGEPROCESSING SYSTEMS

• Formal grammars categorized by expressive power(expressivity)

FSAs < CFGs < Turing machine

E.g. programminglanguages

• Greater expressivity → more languages→ higher processing requirements

• FSA stores only current stateCFG additionally requires stack memory

this sentence ( is ( not ) well ) bracketed ) .

?

11

THE CHOMSKY HIERARCHY

• Noam Chomsky tried to define expressivity of NL• capture all dependencies, constraints, semantic connections• no more complexity than needed

• Aim: characterize constraints of human processing system

• Define in terms of hierarchy of formal grammars

• Increasing expressive powers

12


Regular

Context-free

Context-sensitive

Recursively enumerableType 0

Type 1

Type 2

Type 3Regular

expressions

Context-freegrammars

Context-sensitivegrammars

Turingmachines

Other,fine-grained

sub-categories

13


Morphology:

• mostly regular

• typically use FSTs

Syntax:

• open question

• ≥ CF, < CS

• English (many others) mostly CF

• small exceptions Regular

Context-free

Mildly CS?

Context-sensitive

Recursively enumerable

14

BEYOND REGULAR EXPRESSIVITYGroup discussion

• Real human language doesn’t use nested brackets (often)

• What’s wrong with FSTs?

• Come up with example of a linguistic phonomenon that can’tbe captured by FSAs/FSTs

• Any (human) language• Think about memory:

What does it need to keep track of from earlier insentence/dialogue/. . . ?

• What gets lost if you try to do analysis with FST?

• Why does this matter?Can’t we just use more expressive grammars?

15

LANGUAGE MODELLING

• Language model (LM): probability distribution oversentences of a language

• Probability of sentence appearing in language’s text

p(w0, . . . ,wN)

But he may make trouble withyour father

Highly plausible

Each night Arkilu departed,leaving a furry man-creature onguard

Less probable

(But both real, taken from same text)

16

LANGUAGE MODELLING

Uses of LMs:

• Speech recognition: plausibility judgement of possible outputs

• Word prediction for input device / communication aid

• Spelling correction

• Much more. . .

• One example of statistics in NLP

• Estimate probabilities from texts seen before

• Use linguistic corpora

17

MARKOV LM

Simple statistical LM: Markov chain

Markov assumption: probability of word depends only onprevious word

• Far from true: long-range dependencies

• But, how well does it work in practice?

Simply model: p(wi |wi−1)Probability of sentence:

p(W n−10 ) =

n−1∏

i=0

p(wi |wi−1)

18

MARKOV LM

AKA bigram model

• bigram: pair of consecutive words (wi wi+1)

• model only looks at bigram statistics

Estimate probabilities from corpus counts

p(wi |wi−1) ' C (wi−1 wi )

C (wi−1)

Can represent structure of model as plate diagram:

w0

p(w0|START)

w1

p(w1|w0)

w2

p(w2|w1)

. . .

19

N-GRAM MODELS

• Bigram model: predictions have limited context – Markovassumption

• Condition on longer context: n-gram• p(wi |wi−2,wi−1) – trigram• p(wi |wi−3,wi−2,wi−1) – 4-gram

• Estimate as before:

p(wi |wi−3,wi−2,wi−1) ' C (wi−3 wi−2 wi−1 wi )

C (wi−3 wi−2 wi−1)

• Better predictions if enough examples of wi−3 wi−2 wi−1 intraining data

20

REMINDER: ZIPF’S LAW

0 20 40 60 80 1000

2000

4000

6000

8000

10000

Most frequent:,

Next:the

‘Long tail’

21

ZIPF’S LAW

0 1000 2000 3000 4000 5000 6000 7000

0

2000

4000

6000

8000

10000

• A few n-grams are very common

• Many n-grams are very rare (long tail)

For n-gram models:

• Always some unseen n-grams

• Diminishing returns of more data

22

DATA SPARSITY

• How to assign probability to unseen words?

p(phogen|ti−2, ti−1)

• What predictions to make in unseen context?

p(ti |phogen, cridget)

• General problem: models must be robust to things not seen intraining data

1. Do we have any information to inform the decision?

2. If no, what then?

23

DATA SPARSITY

Robust to things not seen in training data

1. Do we have any information to inform the decision?

2. If no, what then?

Some solutions for n-gram models:

1. Backoff: unseen/rare n-gram, try using (n − 1)-gram

2. Smoothing: reserve some probability for things never seenbefore

24

LANGUAGE MODELLINGAPPLICATIONOne example: speech recognition

Speech

Finally a small settle-

ment loomed ahead.

Final ya smalls set all

mens loom da head.

-221.6

-4840.1

LM

25

LANGUAGE MODELLING ACCURACY?

• Remember: LM is prob dist over sentences

p(W )

• Typically modelled as dist over words given earlier words

p(wi |w0, . . . ,wi−1)

• Two LMs produce different predictionsp(wi |w0, . . . ,wi−1, LM0) p(wi |w0, . . . ,wi−1, LM1)

• Could compare using accuracy

argmaxw (p(w |w0, . . . ,wi−1))?= wi

• Large vocabulary: top prediction rarely correct

26

EVALUATING A LANGUAGE MODEL

• Observed word should get high probability

• but others are not wrong

• Compare probabilities of observed words

• Perplexity: measure mean probability of observed words

• How surprised model is on average by each word

2−1mlog2(p(w0,...,wm−1))

• For LM whose context is previous words:

2−1m

∑mi=0 log2(p(wi |w0,...,wi ))

• Lower is better

27

EXAMPLE LM PERPLEXITIES

Evaluation on Penn Treebank texts (890k toks)

Model PerplexityAWD-LSTM, mixture of softmaxes (2018) 54.44LSTM (2014) 82.7Smoothed 5-gram (1996) 141.2

Evaluation on WikiText-2 (Wikipedia text, 2M toks)

Model PerplexityAWD-LSTM, mixture of softmaxes (2018) 61.45Variational LSTM (2016) 87.0

28

PERPLEXITY

• What does 54.44 PPL mean?

• Is it good?

• Depends on:• test corpus: how hard is it to predict?• training corpus: how big / representative?

• Compare models on same training + test corpus

29

POS TAGGINGReminder

• Distinguish syntactic function of words in broad classes

For the present

Noun

, we are. . . vs. The

Adjective

present situation. . .

• NLP subtask: part-of-speech (POS) tagging

• Shallow syntactic analysis: no explicit structure

• Includes some disambiguation important to meaning

• Useful practical first analysis step

30

POS TAGS

Some parts of speech:

POS ExampleNoun The dog ate the boneVerb The dog ate the boneAdjective The big dog ate the tasty boneAdverb The dog ate the bone quicklyPronoun He ate my boneDeterminer The dog ate that bone

Typically make slighly more fine-grained distinctionsSome words have one possible POS, most have several

31

EXERCISE: POS-TAGGINGAMBIGUITYIn small groups

POS ExampleNoun The dog ate the boneVerb The dog ate the boneAdjective The big dog ate the boneAdverb He ate the bone quicklyPronoun He ate my boneDeterminer The dog ate that bonePreposition He chewed with his teethCoordinatingconjunction

He chewed and he growled

Example

Return now to yourquarters and I will sendyou word of the outcome

• POS tag this sentence, using the POSs above

• What other POSs could each word take (in other contexts)?

32

AMBIGUITY OF INTERPRETATION

What is the mean temperature inKumpula?

SELECT day_mean FROM daily_forecast:

WHERE station = ’Helsinki Kumpula’:

AND date = ’2019-05-21’;

SELECT day_mean FROM weekly_forecast

WHERE station = ’Helsinki Kumpula’

AND week = ’w22’;

SELECT MEAN(day_temp)

FROM weather_history

WHERE station = ’Helsinki Kumpula’

AND year = ’2019’;

. . .?

• Many forms of ambiguity

• Every level/step of analysis

• The big challenge of NLP

33

SOME AMBIGUITIES

Easily misinterpreted headlines1:

• Ban on Nude Dancing on Governor’s Desk

• Juvenile Court to Tr

Same POS tagDifferent meanings

y Shoo

DifferentPOS tags

ting Defendant 2 interpretations

• Kids Make Nutritious Snacks

Several types of ambiguity – what?

Can POS-tagging ambiguity explain these?

1Thanks to Dan Klein & Roger Levy

34

AMBIGUITY

Juvenile Court to Try Shooting Defendant

• Two interpretations by human reader

• NLP system should output both

• Most syntactic parsers will produce dozens of structures!

• Majority have no intelligible interpretation

35

LOCAL AMBIGUITY

Ambiguity often resolved by immediate context

Turning to Doggo, Myles extended his left palm.

• Most ambiguities resolved in this way for humans

• Simple context sometimes enough

• Sometimes require sophisticated reasoning/knowledge:

Time flies like an arrow.Fruit flies like a banana.

36

LOCAL AMBIGUITY

Ambiguity often resolved by immediate context

Turning to Doggo, Myles extended his left palm.

• Most ambiguities resolved in this way for humans

• Simple context sometimes enough

• Sometimes require sophisticated reasoning/knowledge:

Time flies like an arrow.Fruit flies like a banana.

36

GARDEN-PATH SENTENCES

• Start encourages one interpretation

• Continuation forces re-analysis

Thanks to @lanegreen

37

GARDEN-PATH SENTENCES

• Start encourages one interpretation

• Continuation forces re-analysis

Thanks to @lanegreen

Thanks to @AlexBledsoe

37

CORPORA

Corpus (pl. corpora)

The body of written or spoken material upon which a linguisticanalysis is based

• Why do we need corpora?• Test linguistic hypotheses• Evaluate tools: annotated/labelled corpus• Train statistical models Statistical NLP:

today• Most often: collection of text• Also: speech, video, numeric data, . . . , combinations

• Vary in:source, language(s), domain, size, quality, annotations

38

STATISTICAL MODELS

Why use statistical models for NLP?

• Older systems were rule-based

• Used long-studied linguistic knowledge, but:• Lots of rules• Complex interactions• Narrow domain• Hard to handle varied (“incorrect”) and changing language

• Statistics from data can help

39

STATISTICAL MODELS

Help model ambiguity / uncertainty, as in POS tagging

• Multiple interpretations

• Weights/confidences derived from data

• Combinatorial effects→ influence of context

• Express uncertainty in output

40

STATISTICAL MODELS

• Statistics over previous analyses can help estimate confidences

• Often use probabilistic models

• Local ambiguity: probabilities/confidences

• Multiple hypotheses about meaning/structure

• Update hypotheses as larger units are combined

Statistical models ' machine learning (ML)

41

STATISTICAL MODELS

• Can try to learn everything from data

• Practical and theoretical difficulties

• Some success in recent work

Advanced statisticalNLP: lecture 13

• Mostly: focussed statistical modelling of sub-tasks

• Supervised & unsupervised learning

• Collect statistics (learn) from corpora• annotated (supervised) / raw data (unsupervised)

42

POS-TAGGING AMBIGUITY

Example

Return now to your quarters and I will send you word of theoutcome.

noun, verb, adjective, adverb, pronoun, determiner, . . .

• You POS-tagged this sentence

• Several words could have taken multiple tags

• Typically, many possible tags per word: some rare but allowed

43

POS-TAGGING STATISTICS

• Collect statistics about tag sequences from annotated corpus

• tag-word ocurrences• typical tag sequences

Potential benefits from statistical model:

• Improbable tag-word combinations outweighed by probabletag sequences

• And vice versa

• Broad domain models: learn from real data

• Robust:• never seen word, fall back on default tags• use tag that looks plausible in context

44

HIDDEN MARKOV MODEL

• Probabilistic model, widely used for POS tagging

• Maybe seen before: refresher here

• Model probability of (word,tag) sequence using• tag-word statistics• tag sequence statistics

• Makes some simplifying assumptions

45

HIDDEN MARKOV MODEL (HMM)

Prn

p(t0|Start)

Vb

p(t1|t0)

Det

p(t2|t1)

Nn

p(t3|t2). . .

this

p(this|t0)

is

p(is|t1)

a

p(a|t2)

sentence

p(sentence|t3)

States:POS tags

• Plate diagram: conditional prob dists

• Markov assumption on tag sequence: p(ti |ti−1)

• Conditional independence of words: p(wi | ti )• Independence assumptions pretty bad

• But works quite well in practice

46

HIDDEN MARKOV MODEL (HMM)

t0

p(t0|Start)

t1

p(t1|t0)

t2

p(t2|t1)

t3

p(t3|t2). . .

this

p(this|t0)

is

p(is|t1)

a

p(a|t2)

sentence

p(sentence|t3)

• Know p(ti |ti−1) and p(wi |ti )• Input sentence, POS tags unknown• Choose t0, t1, . . . that maximizes sentence probability• Efficient inference with Viterbi algorithm

(dynamic programming)

47

HMM PARAMETER ESTIMATION

• Gold standard POS-tagged corpus: supervised training

• Estimate prob dists using maximum likelihood

• Frequentist estimate

p(tx |ty ) ' C (ty tx)

C (ty )

p(wz |tx) ' C (wz → tx)

C (tx)

48

MORE CONTEXT: N-GRAMS

• Markov assumption: bigram model of tag sequences

• Short context

• Can extend context as with n-gram LM

• Condition on longer tag history: p(ti |ti−1, ti−2)

• Apply same methods as before to deal with sparse data• Backoff to shorter contexts• Smoothing of low counts

49

N-GRAM TAGGING MODEL

• Still an HMM: many more states

• Markov assumption over n-grams

• Each state represents (n-1)-gram

• Parameter estimation and inference as before

t0

p(t0|Start)

t0 t1

p(t1|t0, Start)

t1 t2

p(t2|t0, t1)

t2 t3

p(t3|t1, t2). . .

this

p(this|t0)

is

p(is|t1)

a

p(a|t2)

sentence

p(sentence|t3)

50

STATISTICAL NLPTo be continued. . .

Next lecture:

• Some difficulties with n-gram HMMs: data sparsity

• Unsupervised/semi-supervised learning

• Other statistical models

Later (lecture 13):

• More sophisticated ML

• RNNs, CNNs, . . .

51

THIS WEEK’S ASSIGNMENT

• Estimating HMM probabilities

• Semi-supervised learning: domain transfer

• Language modelling

• Simple generation

• Start now!

• Help session: Thursday, 9-11

• Due next Monday

52

SUMMARY

• Limitations of finite-state methods for NLP

• The Chomsky Hierarchy and expressivity

• Language modelling

• POS tagging: theory and applications

• Statistical models in NLP

• HMM for statistical POS tagging

53

READING MATERIAL

• Regular expressions: J&M3 2.1

• Morphology: J&M3 2.4.4 (J&M2 p79-80)

• Language modelling: J&M3 ch 3

• POS tagging: J&M3 8.1, 8.3 (J&M2 p157-8, 167-9)

• HMMs for POS tagging: J&M3 8.4

• More detail on HMMs: J&M3 appendix A

Ambiguity game, with excellent descriptions of linguistic concepts:http://madlyambiguous.osu.edu:1035/

54

HARMONY FST FOR FINNISH VOWEL - courses.helsinki.fi

Documents