Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany https://www.fbi.h-da.de/organisation/personen/harriehausen-muehlbauer-bettina.html [email protected]winter / fall 2015/2016 41.4268
111
Embed
Natural Language Processing >> Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Morphemes morpheme = smallest possible item in a language that carries meaning • lexeme (man, house, dog,...) • inflectional affixes (dog-s, want-ed,...) • other affixes (pre-/in-/suff-): unwanted, atypical, antipathetic,...
esp. in technical language (-itis = „infection“, gastro = stomach...gastroenteritis)
definition
- Natural Language Systems -
Harriehausen
5
morphemes
- Natural Language Systems -
Harriehausen
6
morphemes
unobvious non-obvious ( )
- Natural Language Systems -
Harriehausen
7
morphemes
free morphemes : stand-alone, carry lexical and morphological
meaning (e.g. house= sing, neuter, nominative ; case/number/gender)
bound morphemes : legal wordform only in combination with
another morpheme, stand-alone, carry lexical and morphological meaning. Various combinations exist:
bound + free: e.g. un-happy,
all bound: e.g. gastro-enter-itis (= Greek => Magen-Darm-Entzündung)
- Natural Language Systems -
Harriehausen
8
morphemes
inflectional morphemes : create words and carry morphological
meaning (e.g. dogs, laughed, going
derivational morphemes : create wordforms and carry
morphological meaning ( happily, intellectually, instruction, instructor, insulator, the pounding, limpness, blindness...)
Question: which string (~morpheme) do we include in our dictionary ?
• full form dictionary vs.
• base form dictionary (lemmas)
- Natural Language Systems -
Harriehausen
9
1 morphemes
2 compounds / concatenation / decompounding
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
content
- Natural Language Systems -
Harriehausen
10
Definition: a compound is a lexeme that consists of more than one stem. Compounding or composition is the word formation that creates compound lexemes (= compounds). There is no clear upper limit in number of roots allowed in English compounds. It usually doesn‘t exceed 3 morphemes, but it is clearly a stylistic issue. Some compounds are written as one word: blackbird. Some are written with hyphens: mother-in-law. Most are written as separate words: smoke screen. Typically not spelling, but stress and word-internal sound rules distinguish compounds from non-compounds: Compare white house with White House.
compounds / concatenation
Question: What do we put into our dictionary ?
- Natural Language Systems -
Harriehausen
11
Compounding follows rules. e.g. from chemical compounds. (http://www.chem.qmul.ac.uk/iupac/) Substitutive nomenclature This naming method generally follows established IUPAC organic
nomenclature. E.g.: Hydrides of the main group elements (groups 13–17) are given -ane base
names, e.g. borane (BH3), oxidane (H2O), phosphane (PH3) . The compound PCl3 would be named substitutively as trichlorophosphane. Additive nomenclature This naming method has been developed principally for coordination
compounds. An example of its application is: [CoCl(NH3)5]Cl2 pentaamminechloridocobalt(III) chloride
Example of a chemical compound Components of Phane Parent Names
bicyclo[8.6.0]hexadecaphane
• The prefix "bicyclo" indicates that there are two rings
(bi-cyclo).
• The bridge descriptor describes the ring structure in terms of a sixteen-membered main ring [8 + 6 + 2 (the bridgehead nodes)] with a bridge consisting of a bond, i.e., zero nodes, which divides the main ring into an eight-membered and a ten-membered ring.
• The numerical term "hexadeca" denotes the presence of sixteen skeletal nodes. and
• the term "phane" indicates that at least one node represents a multiatomic (cyclic) structural unit.
gastr- ancient Greek γαστήρ (gastēr), γαστρ- = stomach, belly
-o- linking 2 body parts (linguistically)
enter- ancient Greek ἔντερον (énteron) = intestine
-itis = inflammation
supra- = above
- ologist = person studying a certain body part
WS 2013/2014 - Natural Language Systems -
Harriehausen
15
- Natural Language Systems -
Harriehausen
16
formation of compounds: synthesis and agglutination
Compound formation rules vary widely across language types.
Examples of formation processes (usually linked to the language type):
• synthesis (typically with synthetic languages, i.e. languages with a
high morpheme-per-word ratio): e.g.
German:
Kapitänspatent = Kapitän (sea captain) + Patent (license) joined by an
-s- (originally a genitive case suffix);
„patent of a sea captain“
Latin:
paterfamilias = pater (father) + familias (genitive of the lexeme familia
(family)); „father of a family“
compounds / concatenation
- Natural Language Systems -
Harriehausen
17
formation of compounds:
It can get more difficult: (German -> English) Aufsichtsratsmitgliederversammlung =>
Auf = on sicht+s =view + “Fuge-s“ rat+s = council + „genitive-s“ mit = with
glied + er = link + „plural“ ver = „completion“ samml (stem = sammeln) = collect ung = „noun“ On-view-council-with-link-collect ?????????????????? = "meeting of members of the supervisory board"
compounds / concatenation
Notice: "with" and "link" form a derivation that is the German word for "member"; "completion", "collect" and "noun" form a derivation that means "meeting"
- Natural Language Systems -
Harriehausen
18
formation of compounds: synthesis and agglutination • agglutination (usually with agglutinative languages, which tend to
create very long words with derivational morphemes), e.g.
German Farbfernsehgerät = color television set Funkfernbedienung = radio remote control Donaudampfschifffahrtsgesellschaftskapitänsmütze = Danube steamboat
shipping company Captain's hat Finnish hätä-uloskäytävä = emergency exit Lentokone-suihku-turbiini-moottori-apu-mekaanikko-aliupseeri-oppilas = Airplane jet turbine engine auxiliary
mechanic non-commissioned officer student Swedish rörelseuppskattningssökintervallsinställningar = Motion estimation search
range settings
compounds / concatenation
- Natural Language Systems -
Harriehausen
19
Samples for long compounds in German
• die Armbrust
• die Mehrzweckhalle
• das Mehrzweckkirschentkerngerät
• die Gemeindegrundsteuerveranlagung
• die Nummernschildbedruckungsmaschine
• der Mehrkornroggenvollkornbrotmehlzulieferer
• der Schifffahrtskapitänsmützenmaterialhersteller
• die Verkehrsinfrastrukturfinanzierungsgesellschaft
• die Feuerwehrrettungshubschraubernotlandeplatzaufseherin
• der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker
• das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
• die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft
Finnish: sanakirja 'dictionary': sana 'word', + kirja 'book' tietokone 'computer': tieto 'knowledge, data', + kone 'machine' keskiviikko 'Wednesday': keski 'middle', + viikko 'week' maailma 'world': maa 'land', + ilma 'air' rautatieasema 'railway station': rauta 'iron' + tie 'road' + asema 'station' suihkuturbiiniapumekaanikkoaliupseerioppilas: 'Jet engine assistant mechanic NCO student' atomiydinenergiareaktorigeneraattorilauhduttajaturbiiniratasvaihde: some part of a nuclear plant Korean: 안팎 anpak 'inside and outside': 안 an 'inside' + 밖 bak 'outside‚ Spanish: Ciempiés 'centipede': cien 'hundred', + pies 'feet' Ferrocarril 'railway': ferro 'iron', + carril 'lane' Paraguas 'umbrella': para 'to stop, stops' + aguas '(the) water'
Samples for long compounds in different languages (see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)
In each of these cases, the syntactic class of the compound is the same
as the syntactic class (= part-of-speech) of the final element of the compound.
Rule: • Germanic languages (e.g. English, German) are left-branching (the
modifiers come before the head). Schoolteacher = teacher of a school, bluebird = bird of blue color
• Romance languages ( e.g. French, Spanish) are usually right-branching; i.e. they are often formed by left-hand heads with prepositional components inserted before the modifier: chemin-de-fer = railway (lit. 'road of iron') moulin à vent = windmill (lit. 'mill (that works)-by-means-of wind')
(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm) In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound.
formation of compounds and their structure: B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working - ing is typically the aspect-suffix that gets added to the verb (root):
e.g. play-ing, laugh-ing, ask-ing,…
As a rule, we can form other wordforms (inflections, due to different tenses) from those roots, following the same inflectional pattern, i.e. verbal root + tense-marking-suffix, or insertion of modal verb:
Simple Present: He play-s. He laugh-s. He ask-s. Simple Past: They play-ed. They laugh-ed. They ask-ed. Simple Future: I will play. I will laugh. I will ask.
compounds / concatenation
- Natural Language Systems -
Harriehausen
27
formation of compounds and their structure:
B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working * He hardworks. * They hardworked. * I will hardwork.
-> hardwork + ing
i.e. hardwork is not a verb by itself (see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)
B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working * He hardworks. * They hardworked. * I will hardwork.
Semantic classification : it it common to classify compounds into 4 types: • endocentric description: A+B denotes a special kind of B • exocentric • copulative • appositional Endocentric compounds consist of a head and modifiers, which restrict this meaning. Endocentric compounds tend to be of the same part of speech (word class) as their head. Examples: - doghouse, where house is the head and dog is the modifier; i.e. a house intended for a dog -darkroom, where dark modifies room; i.e. a type of a room (usually used in photography)
- Natural Language Systems -
Harriehausen
31
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric description: (one) whose B is A • copulative • appositional Exocentric compounds have an unexpressed semantic head (e.g. a person, a plant, an animal...), and their meaning is often not transparent from its constituent parts. Examples: ●white-collar is neither a kind of collar nor a white thing, but the collar's colour is a metaphor for socioeconomic status ● red-neck only indirectly refers to a neck, but refers to a working person (e.g. farmer) ● skinhead, may refer to a bald head but also refers to a certain group of people ● paleface, native American Indians call the White Man a paleface
- Natural Language Systems -
Harriehausen
32
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative description: A+B denotes 'the sum' of what A and B denote • appositional Copulative compounds are compounds which have two semantic heads. Examples: - bittersweet; having both tastes - sleepwalk; sleeping while walking OR walking in your sleep
- Natural Language Systems -
Harriehausen
33
semantics of compounds
Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative • appositional description: A and B provide different descriptions for the same referent; the meaning of which can be characterized as 'AS WELL AS'. Appositional compounds refer to lexemes that have two (contrary) attributes which classify the compound.
Examples: - actor-director; an actor who also plays the role of the director - maidservant; a maid who is also a servant OR a servant who is also a maid - Player-coach; someone who is a player as well as a coach
- Natural Language Systems -
Harriehausen
34
semantics of compounds (ambiguities)
When - in Germanic languages (e.g. German, English) - compound words are formed by prepending a descriptive word in front of the main word, the description or meaning between the components may be ambiguous. This is a problem for decompounding or translation. -> the orange bowl problem
- Natural Language Systems -
Harriehausen
35
Can you please bring me the orange bowl ?
bowl filled with oranges
bowl having the shape of an orange bowl with an
orange pattern
bowl of orange colour
bowl that was formerly / usually filled with oranges
compounding - decompounding German: Staubecken: Stau-becken = a reservoir Staub-ecken = dusty corners Wachstube: Wach-stube = die Stube einer Wache (the room of a guard) Wachs-tube = eine Tube, in der Wachs aufbewahrt wird (a tube filled with wax) Gelbrand: Gelb-rand = gelber Rand (a yellow border) Gel-brand = Brand eines Gels (burning of a gel) Tonerkennung: Toner-kennung = die Kennung eines Toners (the identifier of a toner) Ton-erkennung = das Erkennen von Tönen (the identification of tones) Lachen: Lache-n = mehrere Pfützen (multiple puddles of water) Lachen = eine menschliche Lautäußerung wie Gelächter (laughter) Druckerzeugnis: Druck-erzeugnis = Gedrucktes (printed matter) Drucker-zeugnis = Zeugnis für einen Drucker (certificate for a printer) beinhalten : bein-halten vs. be-inhalten (imagine: Beinhalten….) Abteilungen : Abtei-lungen vs. Abteil-ungen
- Natural Language Systems -
Harriehausen
39
compounding - decompounding
context or stress (in spoken language) is needed for disambiguation
Stress makes a difference: a green ´house vs. a ´greenhouse The white ´house vs. The ´White House
- Natural Language Systems -
Harriehausen
40
(problems with )concatenation
Summary
Structural as well as semantic challenges with compounds:
• ambiguities in meaning (orange bowl)
• ambiguities in hyphenation points (Staubecken)
• not all morphemes can form a compound (sheepchops)->
- Natural Language Systems -
Harriehausen
41
(problems with )concatenation
- Natural Language Systems -
Harriehausen
42
compounds -> MWE -> idiomatic phrases
In addition to the compounds that have one of the four descriptions (endocentric, exocentric, copulative, appositional), i.e. stick to the original lexical meaning of at least one of its components, we need to consider „multiple morpheme strings / multi word expressions (MWE)“ (fixed phrases) that have „lost“ the original lexical meaning of its components. Those MWE are called idiomatic phrases or idioms.
incr
easi
ng t
he
form
al co
mple
xity
=
incr
easi
ng t
he
idio
matic
rigid
ity • compounding: combination of lexical
meanings: carseat, houseboat, cellar door,...
• compounding: not a combination of the lexical meanings: starfish, paperback, ladybug,...
• depending on the context: bite the dust, lose face, kick the bucket,...
Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system.
- Natural Language Systems -
Harriehausen
67
EMOJIs vs. hieroglyphs
- Natural Language Systems -
Harriehausen
68
back to: spell aid – spell checking
- Natural Language Systems -
Harriehausen
69
spell aid – spell checking
spell checking algorithms are based on the following types of mistakes (statistics !):
• phonetic similarities (ph – f : telephone – telefone)
• deletion of multiple entries ( mouuse - mouse)
• wrong order (from – form ; mouse – muose)
• substitution of neighbouring letters on the keyboard (miuse – mouse)
• include missing letters (vowels in between consonants...) (telephne)
• typos occur towards the end of a word (assumption:first letter is correct)
• segmentation / decomposition into substrings (horses‘hoe – horse‘shoe)
- Natural Language Systems -
Harriehausen
70
spell aid – spell checking
• phonetic similarities (ph – f : telephone – telefone)
• deletion of multiple entries ( mouuse - mouse)
• wrong order (from – form ; mouse – muose)
• substitution of neighbouring letters on the keyboard (miuse – mouse)
• include missing letters (vowels in between consonants...) (telephne)
• typos occur towards the end of a word (assumption:first letter is correct)
• segmentation / decomposition into substrings (horeshoe – horseshoe)
BTW - gap-filling & interpolation also works on sentence level.
- Natural Language Systems -
Harriehausen
73
spell aid – spell checking
How does spell checking work (w.r.t. grammar checking) ?
Various degrees of „intelligence“:
System A : no match found in the dictionary -> mark entry as incorrect
System B: no match found in the dictionary. Initiate a rudimentary parse (left-right-search). Try to identify the wordclass, i.e. limit possibilities and continue a sentential analysis. e.g. the ...man (statistics: DET + ADJ + NOUN); n-gram
System C: no match found in the dictionary. Initiate a segmentation of the word to identify the wordclass, e.g. look for typical endings (-ly = adverb / capital letters = proper noun, ...). This way new wordcreations can be identified (e.g. any word ending in -ness = noun); n-gram
- Natural Language Systems -
Harriehausen
74
n-grams / language models (statistical language processing)
An n-gram is a substring of n items from a given string.
A complete string of words: w1 … wn or w1
In NLP, the items in question can be phonemes, syllables, letters, words or any
substring. This depends on the application.
An n-gram of size 1 is a "unigram";
size 2 is a "bigram" ;
size 3 is a "trigram"; etc. …
size n is an "n-gram ".
n
- Natural Language Systems -
Harriehausen
75
n-grams / language models (statistical language processing)
Example: „he reads a book"
For a sequence of words, the trigrams would be: "# he reads", „he reads a",
„reads a book", and "a book #".
For sequences of characters, the trigrams that can be generated from „hello world"
are "hel", "ell", "llo", "lo ", "o w", " wo", "wor" etc.
In practice, we often
• collapse whitespace to a single space
• remove punctuation
- Natural Language Systems -
Harriehausen
76
n-grams / language models (statistical language processing)
Example of an n-gram count from the GOOGLE n-gram corpus: (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-n-
gram-are-belong-to-you.html)
File sizes: approx. 24 GB compressed (gzip'ed) text files
n-grams / language models (statistical language processing)
Example of an n-gram count from the GOOGLE n-gram corpus: (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-n-gram-are-belong-to-you.html)
n-grams / language models (statistical language processing)
Example of an n-gram count from the GOOGLE n-gram corpus: (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-n-gram-are-belong-to-you.html)
fourgrams: serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40
n-grams / language models (statistical language processing)
In an n-gram analysis, we compute the probability of the occurence of x (e.g. a letter or word) AFTER a certain sequence, i.e. the conditional probability of x is always given on the basis of the PREVIOUS word/character. Example: for ex_ In English, the probabilities for a = 0.4 b = 0.00001 all probabilities sum to 1 c = 0,……
- Natural Language Systems -
Harriehausen
80
n-grams / language models (statistical language processing)
The theory behind it: A statistical language model assigns a probability to a sequence of n words P (w1,…,wn) by means of a probability distribution. All words (or characters) depend on the last n-1 words. More concisely, an n-gram model predicts xi based on In probability terms, this is This is also called an n-1-order Markov Model.
In speech recognition, sequences of phonemes are often modeled using an n-gram distribution.
- Natural Language Systems -
Harriehausen
81
n-grams / language models (statistical language processing)
In an n-gram model, the conditional probability P (w1,…,wm) of observing the sentence w1,...,wm can be approximated: It is assumed that the probability of observing the i th word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words. In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as: Whereas in a trigram (n=3) language model, the approximation is:
• In order to figure out whether something is an incorrect word, the machine has to match the string (= a sequence of symbols; any sequence of alphanumeric characters (letters, numbers, spaces, tabs, punctuation) to an entry in the dictionary
• other matches: e.g. information retrieval in www-search engines (Google, altavista,…)
• the standard notation for characterizing text sequences= regular expressions
• regular expressions are written in (regular expression) languages: e.g. Perl, grep (Global Regular Expression Print)
• formally, regular expressions are algebraic notations for characterizing a set of strings
• regular expression search requires a pattern that we want to search for (and a corpus of text to search through) (text mining !)
- Natural Language Systems -
Harriehausen
89
Example: Search for the pattern “linguistics”.
• You also want to find documents with “Linguistics” and “LINGUISTICS”. (remember: the computer does EXACTLY do what you tell him to…)
• The regular expression /linguistics/ matches any string in any document containing exactly the substring “linguistics”
• Regular expressions are case sensitive
• samples (Jurafsky, p. 23)
regular expression example pattern matched
/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/Claire says,/ Dagmar, my gift please,” Claire says,”
/song/ “all our pretty songs”
/!/ “You’ve left the burglar behind again!” said Nori
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
90
linguistics - Linguistics - LINGUSTICS
to search for alternative characters “l” and/or “L” we use square
brackets: [l L]
Regular expression match sample pattern
/[l L] inguistics/ Linguistics or linguistics “computational
linguistics is fun”
/[1 2 3 4 5 6 7 8 9 0]/ any digit this is Linguistics
5981
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
91
to search for a character in a range we use the dash: [-]
Regular expression match sample pattern
/[A-Z]/ any uppercase letter this is Linguistics 5981
/[0-9]/ any single digit this is Linguistics 5981
/[1 2 3 4 5 6 7 8 9 0]/ any single digit this is Linguistics 5981
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
92
to search for negation, i.e. a character that I do NOT want to find we
use the caret: [^]
Regular expression match sample pattern
/[^A-Z]/ not an uppercase letter this is Linguistics 5981
/[^L l]/ neither L nor l this is Linguistics 5981
/[^\.]/ not a period this is Linguistics 5981
\* an asterisk “L*I*N*G*U*I*S*T*I*C*S” \. a period “Dr.Doolittle” \? a question mark “Is this Linguistics 5981 ?” \n a newline \t a tab
Special characters:
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
93
to search for optional characters we use the question mark: [?]
Regular expression match sample pattern
/colou?r/ colour or color beautiful colour
to search for any number of a certain character we use the Kleene star: [*]
Regular expression match
/a*/ any string of zero or more “a”s
/aa*/ at least one a but also any number of “a”s
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
94
Regular expression match
/[ab]*/ zero or more “a”s or “b”s
/[0-9] [0-9]*/ any integer (= a string of digits)
To look for at least one character of a type we use the Kleene “+”:
Regular expression match
/[0-9]+/ a sequence of digits
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
95
The “.” is a very special character -> so-called wildcard
Regular expression match sample pattern
/b.ll/ any character ball between b and ll bell bull bill
Will the search find “Bill” ?
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
96
Anchors (start of line: “^”, end of line:”$”)
Regular expression match sample pattern
/^Linguistics/ “Linguistics” at the Linguistics is fun. beginning of a line
/linguistics\.$/ “linguistics” at the We like linguistics. end of a line Anchors (word boundary: “\b”, non-boundary:”\B”)
Regular expression match sample pattern
/\bthe\b/ “the” alone This is the place.
/\Bthe\B/ “the” included This is my mother.
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
97
More on alternative characters: the pipe symbol: “|” (disjunction)
Regular expression match sample pattern
/colou?r/ colour or color beautiful colour
/progra(m|mme)/ program or programme linguistics program
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
98
What does the following expression match ?
/student [0-9]+ */
Will it match “student 1 student 2 student 3” ?
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
99
Perl expressions are also used for string substitution: (used in ELIZA)
s/man/men/ man -> men
Perl expressions are also used for string repetition via memory:
s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1 ?/
regular expressions (Jurafsky, section 2.1)
- Natural Language Systems -
Harriehausen
100
1 morphemes
2 compounds / concatenation
3 idiomatic phrases
4 multiple word entries (MWE)
5 spell aid
6 regular expressions
7 Finite State Automata (FSA)
content
- Natural Language Systems -
Harriehausen
101
The regular expression is more than just a convenient metalanguage for text searching.
• First, a regular expression is one way of describing a finite-state automaton (FSA). Finite-state automata are the theoretical foundation of a good deal of the computational work we will describe and look at in this lecture. Any regular expression can be implemented as a finite-state automaton*. Symmetrically, any finite-state automaton can be described with a regular expression.
• Second, a regular expression is one way of characterizing a particular kind of formal language called a regular language. Both regular expressions and finite-state automata can be used to describe regular languages. The relation among these three theoretical constructions is sketched out in the following figure:
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
102
regular
expressions
Finite regular
Automata languages
The relationship between finite state automata, regular expressions, and regular languages* * as suggested by Martin Kay in: Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference
of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark,pp. 2-10.ACL.).
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
103
Definition: A finite-state machine (FSM) or finite-state automaton (plural: automata) (FSA), or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition… In linguistics, they are used to describe simple parts of the grammars of natural languages.
Finite State Automata (FSA)
Finite-State Language Processing by Emmanuel Roche (ed), Yves Schabes (ed
- Natural Language Systems -
Harriehausen
104
Examples:
• Introduction to finite-state automata for regular expressions
• Mapping from regular expressions to automata
examples
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
105
Using a FSA to recognize sheeptalk
After a while, with the parrot‘s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said.
Hugh Lofting, The Story of Doctor Doolittle
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
106
Using a FSA to recognize sheeptalk
Sheep language can be defined as any string from the following (infinite) set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa!
....
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
107
baa!
baaa!
baaaa!
baaaaa!
baaaaaa!
....
The regular expression for this kind of sheeptalk is
/baa+!/
All regular expressions can be represented as finite-state automata (FSA):
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
108
a finite-state automaton (FSA) for the regular expression /baa+!/
q
0 q
q
q
q
1 2 3 4
b a a
a
!
start state final state/ accepting state
Finite State Automata (FSA)
- Natural Language Systems -
Harriehausen
109
function D-RECOGNIZE(tape,machine) returns accept or reject
index <- Beginning of tape
current-state <- Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else return reject
elseif transition-table[current-state,tape[index]] is empty then
return reject
else
current-state <- transition-table[current-state,tape[index]] index <- index +1
end
An algorithm for deterministic recognition of FSAs