Top Banner
Finite-State Methods in Finite-State Methods in Natural Language Natural Language Processing Processing Lauri Karttunen LSA 2005 Summer Institute July 20, 2005
31

Finite-State Methods in Natural Language Processing

Jan 11, 2016

Download

Documents

Cheri

Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute July 20, 2005. Course Outline. July 18: Intro to computational morphology XFST Readings - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 20, 2005

Page 2: Finite-State Methods in Natural Language Processing

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

Page 3: Finite-State Methods in Natural Language Processing

July 25Concatenative morphotacticsConstraining non-local dependencies

ReadingsChapter 4. “The LEXC Language”Chapter 5. “Flag Diacritics”

July 27Non-concatenative morphotactics

Reduplication, interdigitation

ReadingsChapter 8. “Non-Concatenative Morphotactics”

Page 4: Finite-State Methods in Natural Language Processing

August 1Realizational morphology

ReadingsGregory T. Stump. Inflectional Morphology. A Theory of

Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic

and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Page 5: Finite-State Methods in Natural Language Processing

Scripting xfstScripting xfst

xfst -l myscript

xfst -f myscript

xfst -e “echo Welcome” \ -e “regex a b c;” \ -e “save foo” \ -stop

Start XFST

execute myscriptwait for more commands from the command line

Execute myscript and exit

Execute the commands in the given order. The commands must be on the same line. The -stop at the end is required to make xfst quit.

Page 6: Finite-State Methods in Natural Language Processing

Numeral ScriptNumeral Script

# This script constructs the language of English# numerals from "one” to "ninety-nine".# This is a comment.

# From "one" through "nine":

define OneToNine [{one} | {two} | {three} | {four} | {five} | {six} | {seven} | {eight} | {nine}];

# It is convenient to define a set of prefixes that# can be followed either by "teen" or by "ty".

define TeenTyStem [{thir} | {fif} | {six} | {seven} | {eigh} | {nine}] ;

Page 7: Finite-State Methods in Natural Language Processing

Numeral Script (Continued)Numeral Script (Continued)

# From "ten" to "nineteen"define Teens [{ten} | {eleven} | {twelve} |

[TeenTyStem | {four}] {teen}];

# Let’s define stems that can be followed "ty".define TyStem [TeenTyStem | {twen} | {for}];

# TyStem is followed either by "ty" or by ty-"# and a number from OneToNine.

define Tens [TyStem [{ty} | {ty-} OneToNine]];

define OneToNinetyNine [ OneToNine | Teens | Tens ];

push OneToNinetyNine

Page 8: Finite-State Methods in Natural Language Processing

Number to NumeralNumber to Numeral

Generation

105

hundred five hundred and five

one hundred and five

Analysis

hundred five

105

Page 9: Finite-State Methods in Natural Language Processing

NumberToNumeral scriptNumberToNumeral script

# This script constructs a transducer that relates the

# English numerals "one", "two", ..., "ninety-nine",

# to the corresponding numbers "1", 2 ... "99".

define OneToNine [1:{one} | 2:{two} | 3:{three} |

4:{four} |5:{five} | 6:{six} |

7:{seven} | 8:{eight} | 9:{nine}];

define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}|

7:{seven} | 8:{eigh} | 9:{nine}];

define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} |

[TeenTyStem | 4:{four}] 0:{teen}]];

Page 10: Finite-State Methods in Natural Language Processing

NumberToNumeral (Continued)NumberToNumeral (Continued)

define TyStem [2:{twen} | TeenTyStem | 4:{for}];

# TyStem is followed either by "ty" paired with a zero

# or by "ty-" mapped to an epsilon and followed by a

# number. Note that {0} means zero and not epsilon.

define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]];

define OneToNinetyNine [ OneToNine | Teens | Tens ];

push OneToNinetyNine

Page 11: Finite-State Methods in Natural Language Processing

Xerox RE OperatorsXerox RE Operators

$ containment

=> restriction

-> @-> replacement

Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

Page 12: Finite-State Methods in Natural Language Processing

ContainmentContainment

aa?? ?? aa$a$a

[?* a ?*][?* a ?*]

Page 13: Finite-State Methods in Natural Language Processing

RestrictionRestriction

??cc

bb

bb

cc?? aa

cc

a => b _ ca => b _ c

““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”

~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]

Equivalent expression Equivalent expression

Page 14: Finite-State Methods in Natural Language Processing

ReplacementReplacement

a:ba:b

bb

aa

??

??

b:ab:a

aa

a:ba:b

a b -> b a

““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”

[[~$[a b] [[a b] .x. [b a]]]* ~$[a b]]

Equivalent expression Equivalent expression

Page 15: Finite-State Methods in Natural Language Processing

MarkingMarking

0:[0:[

[[

0:]0:]

??

aa

ee

iioo

uu]]

a|e|i|o|u -> %[ ... %]

p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]

Page 16: Finite-State Methods in Natural Language Processing

a b | b | b a | a b a -> x

(a) b (a) -> x

applied to “aba”

a b a a b a a b a a b a

a x a a x x a x

Multiple ResultsMultiple Results

Four factorizations of the input string.

Page 17: Finite-State Methods in Natural Language Processing

Directed Replace OperatorsDirected Replace Operators

guarantee a unique result by constraining the factorization of the input string by

Direction of the match (rightward or leftward)Length (longest or shortest)

Page 18: Finite-State Methods in Natural Language Processing

@-> Left-to-right, Longest-match @-> Left-to-right, Longest-match ReplacementReplacement

(a) b (a) @-> x

applied to “aba”

a b a a b a a b a a b a

a x a a x x a x

Page 19: Finite-State Methods in Natural Language Processing

Conditional ReplacementConditional Replacement

The relation that replaces A by B between L and R leaving everything else unchanged.

A -> BA -> B

Replacement

L _ RL _ R

Context

Sources of complexity:

Replacements and contexts may overlap

Alternative ways of interpreting “between left and right.”A -> B || L _ R both contexts on the inputA -> B // L _ R left context on the outputA -> B \\ L _ R right context on the output

Page 20: Finite-State Methods in Natural Language Processing

Vowel shortening after a long Vowel shortening after a long vowelvowel

V %: -> V || V %: C* _V %: -> V || V %: C* _Left context on the input side

Slovak

v o l + a: v + a: m e:

v o l + a: v + a m e

we call often

Gidabal

g u n u: m + ba: + d a: ng + b e: +

g u n u: m + ba +d a: ng + b e +

is certainly right on the stump

V%: -> V // V%: C* _V%: -> V // V%: C* _Left context on the output side

Page 21: Finite-State Methods in Natural Language Processing

Shortening scriptShortening script

define V [ a | e | i | o | u | a ];define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ];

define SlovakShortening %: -> 0 || V %: C* V _ ;

define GidabalShortening %: -> 0 // V %: C* V _ ;

push SlovakShorteningdown vola:va:me:vola:vame

push GidabalShorteningdown gunu:mba:da:ngbe:gunu:mbada:ngbe

Page 22: Finite-State Methods in Natural Language Processing

Palatalization and Vowel RaisingPalatalization and Vowel Raising

Palatalizationtim --> cim

Vowel Raisingmemi --> mimi

Interactiontemi --> cimi

tememi --> cimimi

Page 23: Finite-State Methods in Natural Language Processing

Vowel Raising & PalatalizationVowel Raising & Palatalization

define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ];

define Raising e -> i \\ _ C* i ;define Palatalization t -> c || _ i;

regex Raising .o. Palatalization;

down memimimidown timcimdown temicimidown tememicimimi

t e m e m i

t i m i m i

c i m i m i

Page 24: Finite-State Methods in Natural Language Processing

Making a lexical transducerMaking a lexical transducer

LexiconFST

RuleFSTs

CompilerLexical Transducer(a single FST)composition

LexiconRegular Expression

RulesRegular Expressions

Morphotactics

Alternations

Page 25: Finite-State Methods in Natural Language Processing

Finnish Gradation ScriptFinnish Gradation Script

define Stems [ {tukka}| {kakku} | {pappi} | {tippa} | {katto} | {juttu} |{tikka} | {huppu} | {rotta} | {nahka} |{lika} | {maku} | {rako} | {tuke} | {halko} | {jalka} | {virka} | {lanka} | {linko} | {puku} | {suku} | {tiuku} | {raaka} |{ripa} | {sopu} | {tapa} | {kampa} | {rumpu} | {sampe} | {sota} | {pata} | {kita} | {rinta} | {kanto} | {ranta} | {ilta} | {kulta} | {parta} | {kerta} ];

define Case [ "+Part":a | "+Gen":n ];

define Finnish [Stems Case];

Page 26: Finite-State Methods in Natural Language Processing

Auxiliary definitionsAuxiliary definitions

define V [a | e | i | o | u | y | ä | ö];

define C [b | c | d | f | g | h | j | k | l | m | n |

p | q | r | s | t | v | w | x | z];

define Coda [ C [C | .#.] ];

define ClosedSyll [V Coda] ;

Page 27: Finite-State Methods in Natural Language Processing

Weak form of kWeak form of k

define WeakK k -> ' || V a _ a Coda, V u _ u Coda

.o.

k -> j || r _ e Coda

.o.

k -> v || u _ u Coda

.o.

k -> g || n _ V Coda

.o.

k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail',

# nahkan 'skin

Page 28: Finite-State Methods in Natural Language Processing

Weak form of pWeak form of p

define WeakP p -> m || m _ V Coda

.o.

p -> v || \[s|p] _ V Coda # piispan 'bishop'

.o.

p -> 0 || p _ V Coda;

Page 29: Finite-State Methods in Natural Language Processing

Weak form of tWeak form of t

define WeakT t -> n || n _ V Coda

.o.

t -> l || l _ V Coda

.o.

t -> r || r _ V Coda

.o.

t -> d || \[s|t] _ V Coda # koston revenge

.o.

t -> 0 || t _ V Coda ;

Page 30: Finite-State Methods in Natural Language Processing

Putting it all togetherPutting it all together

define Gradation WeakK .o. WeakP .o. WeakT;

regex Finnish .o. Gradation;

print lower-words

echo *** Size of Finnish .o. Gradationprint sizeecho *** Size of Finnishpush Finnishprint sizeecho *** Size of Gradationpush Gradationprint size

Page 31: Finite-State Methods in Natural Language Processing

SyllabificationSyllabification

define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];

s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k - t u - r a - l i s - m is t r u k - t u - r a - l i s - m i

[C* V+ C*] @-> ... "-" || _ [C V][C* V+ C*] @-> ... "-" || _ [C V]

““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the

C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”