CSA3050: Natural Language Algorithms Finite State Devices
Mar 18, 2016
CSA3050: Natural Language Algorithms
Finite State Devices
October 2005 CSA3180 NLP 2
Sources
• Blackburn & Striegnitz Ch. 2
Part I
Parsers and Transducers
October 2005 CSA3180 NLP 4
Parsers vs. Recognisers
• Recognizers tell us whether a given input is accepted by some finite state automaton.
• Often we would like to have an explanation of why it was accepted.
• Parsers give us that kind of explanation.• What form does it take?
October 2005 CSA3180 NLP 5
Finite State Parser
• The output of a finite state parser is a sequence of nodes and arcs. If we, gave the input [h,a,h,a,!] to a parser for our first laughing automaton, it should give us [1,h,2,a,3,h,2,a,3,!,4].
• The standard technique in Prolog for turning a recognizer into a parser is to add one or more extra arguments to keep track of the structure that was found.
October 2005 CSA3180 NLP 6
Base Case
Recogniser
recognize1(Node,[ ]) :- final(Node).
Parser
parse1(Node,[ ],[Node]) :- final(Node).
October 2005 CSA3180 NLP 7
Recursive CaseRecogniser
recognize1(Node1, String) :- arc(Node1,Node2,Label), traverse1(Label, String, NewString), recognize1(Node2, NewString).
Parserparse1(Node1,
String, [Node1,Label|Path]) :-
arc(Node1,Node2,Label),traverse1( Label,
String,NewString),
parse1(Node2, NewString, Path).
October 2005 CSA3180 NLP 8
Words as Labels
• So far we have only considered transitions with single-character labels.
• More complex labels are possible – e.g. words comprising several characters.
• We can construct an FSA recognizing English noun phrases that can be built from the words:
the, a, wizard, witch, broomstick, hermione, harry, ron, with, fast.
October 2005 CSA3180 NLP 9
FSA for Noun Phrases
October 2005 CSA3180 NLP 10
FSA for NPs in Prolog
initial(1).final(3).arc(1,2,a).arc(1,2,the).arc(2,2,brave).arc(2,2,fast).arc(2,3,witch).
arc(2,3,wizard).arc(2,3,broomstick).arc(2,3,rat).arc(1,3,harry).arc(1,3,ron).arc(1,3,hermione).arc(3,1,with).
October 2005 CSA3180 NLP 11
Parsing a Noun Phrase
testparse1(Symbols,Parse) :-initial(Node),parse1(Node,Symbols,Parse).
?-testparse1([the,fast,wizard],Z). Z=[1, the, 2, fast, 2, wizard, 3]
October 2005 CSA3180 NLP 12
Rewriting Categories
• It is also possible to obtain a more abstract parse, e.g.
?- testparse2([the,fast,wizard],Z). Z=[1, det, 2, adj, 2, noun, 3]
• What changes are required to obtain this behaviour?
October 2005 CSA3180 NLP 13
1. Changes to the FSA%FSA %Lexiconinitial(1). lex(a,det).final(3). lex(the,det).arc(1,2,det). lex(fast,adj).arc(2,2,adj). lex(brave,adj).arc(2,3,cn). lex(witch,cn).arc(1,3,pn). lex(wizard,cn).arc(3,1,prep). lex(broomstick,cn). lex(rat,cn). lex(harry,pn). lex(hermione,pn). lex(ron,pn). lex(with,prep).
October 2005 CSA3180 NLP 14
Changes to the ParserParse1
parse1(Node1, String,
[Node1,Label|Path]) :-arc(Node1,Node2,Label),traverse1( Label,
String,NewString),
parse1(Node2, NewString, Path).
Parse2parse2(Node1,
String, [Node1,Label|Path]) :-
arc(Node1,Node2,Label),traverse2( Label,
String,NewString),
parse2(Node2, NewString, Path). traverse2(Cat,[Word|S],S) :-
lex(Word,Cat).
October 2005 CSA3180 NLP 15
Handling Jumpstraverse3('#',String,String).
traverse3(Cat,[Word|Words],Words) :- lex(Word,Cat).
October 2005 CSA3180 NLP 16
Finite State Transducers
• A finite state transducer essentially is a finite state automaton that works on two (or more) tapes.
• The most common way to think about transducers is as a kind of “translating machine” which works by reading from one tape and writing onto the other.
October 2005 CSA3180 NLP 17
A Translator from a to b
• initial state: arrowhead
• final state:double circle
• a:b read from first tape and write to second tape
1
a:b
October 2005 CSA3180 NLP 18
Prolog Representation
:- op(250,xfx,:). initial(1).final(1).arc(1,1,a:b).
October 2005 CSA3180 NLP 19
Modes of Operation• generation mode: It writes a string of as on one
tape and a string of bs on the other tape. Both strings have the same length.
• recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs.
• translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape.
• translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape.
Computational Morphology
Part II
October 2005 CSA3180 NLP 21
Morphology
• Morphemes: The smallest unit in a word that bear some meaning, such as rabbit and s, are called morphemes.
• Combination of morphemes to form words that are legal in some language.
• Two kinds of morphology– Inflectional– Derivational
October 2005 CSA3180 NLP 22
Inflectional/DerivationalMorphology
• Inflectional+s plural+ed past
• category preserving• productive: always
applies (esp. new words, e.g. fax)
• systematic: same semantic effect
• Derivational+ment
• category changingescape+ment
• not completely productive: detractment*
• not completely systematic: apartment
October 2005 CSA3180 NLP 23
Example: English Noun Inflections
Regular Irregular
Singular cat church mouse ox
Plural cats churches mice oxen
October 2005 CSA3180 NLP 24
Morphological Parsing
MorphologicalParser
Input Word
cats
OutputAnalysis
cat N PL
• Output is a string of morphemes• lexeme, other meaningful morphemes• Reversibility?
October 2005 CSA3180 NLP 25
Morphological Parsing
• The goal of morphological parsing is to find out what morphemes a given word is built from. cats cat N PLmice mouse N PLfoxes fox N PL
October 2005 CSA3180 NLP 26
Morphological Analysis with FSTs
• Basic idea is to write FSTs that map the surface form of a word to a description of the morphemes that constitute that word or vice versa.
• Example: wizard+s to wizard+PL or kiss+ed to kiss+PAST.
October 2005 CSA3180 NLP 27
Plural Nouns in English• Regular Forms
– add an s as in wizard+s. – add –es as in witch +s
• Handled with morpho-phonological rules that insert an e whenever the morpheme preceding the s ends in s, x, ch or another fricative.
• Irregular forms– mouse/mice– automaton/automata
• Handled on a case-by-case basis• Require transducer that translates wizard+s into wizard+PL,
witch+es into witch+PL, mice, into mouse+PL and automata into automaton+PL.
October 2005 CSA3180 NLP 28
2 Steps1. Split word up into its possible components,
using + to indicate possible morpheme boundaries.
cats cat + sfoxes fox + smice mouse + s
2. Look up the categories of the stems and the meaning of the affixes, using a lexicon of stems and affixes
cat + s cat NP PLfox + s fox N PLmouse + s mouse N PL
October 2005 CSA3180 NLP 29
Step 1
• Transducer may or may not insert a ‘+’ (morpheme boundary) if the word ends in ‘s’.
• If the word ends in ses, xes, or zes, it may delete the ‘e’ when inserting the morpheme boundary, e.g.churches → church + s
October 2005 CSA3180 NLP 30
Transducer for Step 1Surface Intermediate
October 2005 CSA3180 NLP 31
Transducer for Step 1Surface Intermediate
October 2005 CSA3180 NLP 32
Prolog Representation• The transducer
specifications we have seen translate easily into Prolog format except for the other transition.
• arc(1,3,z:z).arc(1,3,s:s).arc(1,3,x:x).arc(1,2,#:+).arc(3,1,<other>).Arc(1,1,<other>).
October 2005 CSA3180 NLP 33
One Way to Handle <other> arcsarc(1,3,z:z).arc(1,3,s:s).arc(1,3,x:x).arc(1,2,#:+).arc(3,1,a:a).arc(3,1,b:b).arc(3,1,c:c).: etc: etcarc(3,1,y:y).
October 2005 CSA3180 NLP 34
Transducer for Step2 Intermediate Morphemes
Possible inputs to the transducer are:
• Regular noun stem: cat• Regular noun stem + s: cat+s• Singular irregular noun stem: mouse• Plural irregular noun stem: mice
October 2005 CSA3180 NLP 35
2. Intermediate MorphemesTransducer
October 2005 CSA3180 NLP 36
Handling Stems
cat /cat
mice/mouse
October 2005 CSA3180 NLP 37
Completed Stage 2
October 2005 CSA3180 NLP 38
Joining Stages 1 and 2
• If the two transducers run in a cascade (i.e. we let the second transducer run on the output of the first one), we can do a morphological parse of (some) English noun phrases.
• We can change also the direction of translation (in translation mode).
• This transducer can also be used for generating a surface form from an underlying form.
October 2005 CSA3180 NLP 39
Combining Rules• Consider the word “berries”.• Two rules are involved
– berry + s– y → ie under certain circumstances.
• Combinations of such rules can be handled in two ways– Cascade, i.e. sequentially– Parallel
• Algorithms exist for combining transducers together in series or in parallel.
• Such algorithms involve computations over regular relations.
October 2005 CSA3180 NLP 40
3 Related Frameworks
REGULARLANGUAGES
REGULAREXPRESSIONS
FSA
October 2005 CSA3180 NLP 41
Concatenation overFS Automata
a
b
c
d
a
b
c
d
⌣
October 2005 CSA3180 NLP 42
REGULAR RELATIONS
REGULARRELATIONS
AUGMENTEDREGULAR
EXPRESSIONS
FINITE STATETRANSDUCERS
October 2005 CSA3180 NLP 43
Putting it all together
execution of FSTi
takes place in parallel
October 2005 CSA3180 NLP 44
Kaplan and KayThe Xerox View
FSTi are alignedbut separate
FSTi intersectedtogether
October 2005 CSA3180 NLP 45
Summary
• Morphological processing can be handled by finite state machinery
• Finite State Transducers are formally very similar to Finite State Automata.
• They are formally equivalent to regular relations, i.e. sets of pairings of sentences of regular languages.
October 2005 CSA3180 NLP 46
Exercises
• Change the representation of automata that allow them to be given names.
• Make the corresponding changes to the transducer.
• Write a predicate which allows two named automata to be composed – i.e. the output of one becomes the input of the other
October 2005 CSA3180 NLP 47
Simple Transducer in Prologtransduce1(Node,[ ],[ ]) :- final(Node).
transduce1(Node1,Tape1,Tape2) :-arc(Node1,Node2,Label),traverse1(Label, Tape1, NewTape1, Tape2, NewTape2),transduce1(Node2,NewTape1,NewTape2).
October 2005 CSA3180 NLP 48
Traverse for FSTtraverse1(L1:L2,
[L1|RestTape1], RestTape1,
[L2|RestTape2], RestTape2).
testtrans1(Tape1,Tape2) :- initial(Node), transduce1(Node,Tape1,Tape2).
October 2005 CSA3180 NLP 49
Transducers and Jumps
• Transducers can make jumps going from one state to another without doing anything on either one or on both of the tapes.
• So, transitions of the form a:# or #:a or #:# are possible.
October 2005 CSA3180 NLP 50
Handling Jumps:4 cases
• Jump on both tapes.• Jump on the first but not on the second
tape.• Jump on the second but not on the first
tape.• Jump on neither tape (this is what
traverse1 does).
October 2005 CSA3180 NLP 51
4 Corresponding Clausestraverse2('#':'#',Tape1,Tape1,Tape2,Tape2).traverse2('#':L2,Tape1,Tape1,[L2|RestTape2],RestTape2).traverse2(L1:'#',[L1|RestTape1],RestTape1,Tape2,Tape2).traverse2(L1:L2,
[L1|RestTape1], RestTape1, [L2|RestTape2], RestTape2).
October 2005 CSA3180 NLP 52
FST in Prolog
lex(wizard:wizard,’STEM-REG1’).lex(witch:witch,’STEM-REG2’).lex(automaton:automaton,’IRREG-SG’).lex(automata:’automaton-PL’,’IRREG-PL’).lex(mouse:mouse,’IRREG-SG’).lex(mice:’mouse-PL’,’IRREG-PL’).