Top Banner
Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany https://www.fbi.h-da.de/organisation/personen/harriehausen-muehlbauer-bettina.html [email protected] winter / fall 2015/2016 41.4268
111

Natural Language Processing >> Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

Mar 06, 2018

Download

Documents

nguyenkien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 2: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

2

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 3: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

3

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 4: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

4

Morphemes morpheme = smallest possible item in a language that carries meaning • lexeme (man, house, dog,...) • inflectional affixes (dog-s, want-ed,...) • other affixes (pre-/in-/suff-): unwanted, atypical, antipathetic,...

esp. in technical language (-itis = „infection“, gastro = stomach...gastroenteritis)

definition

Page 5: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

5

morphemes

Page 6: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

6

morphemes

unobvious non-obvious ( )

Page 7: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

7

morphemes

free morphemes : stand-alone, carry lexical and morphological

meaning (e.g. house= sing, neuter, nominative ; case/number/gender)

bound morphemes : legal wordform only in combination with

another morpheme, stand-alone, carry lexical and morphological meaning. Various combinations exist:

bound + free: e.g. un-happy,

all bound: e.g. gastro-enter-itis (= Greek => Magen-Darm-Entzündung)

Page 8: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

8

morphemes

inflectional morphemes : create words and carry morphological

meaning (e.g. dogs, laughed, going

derivational morphemes : create wordforms and carry

morphological meaning ( happily, intellectually, instruction, instructor, insulator, the pounding, limpness, blindness...)

Question: which string (~morpheme) do we include in our dictionary ?

• full form dictionary vs.

• base form dictionary (lemmas)

Page 9: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

9

1 morphemes

2 compounds / concatenation / decompounding

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 10: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

10

Definition: a compound is a lexeme that consists of more than one stem. Compounding or composition is the word formation that creates compound lexemes (= compounds). There is no clear upper limit in number of roots allowed in English compounds. It usually doesn‘t exceed 3 morphemes, but it is clearly a stylistic issue. Some compounds are written as one word: blackbird. Some are written with hyphens: mother-in-law. Most are written as separate words: smoke screen. Typically not spelling, but stress and word-internal sound rules distinguish compounds from non-compounds: Compare white house with White House.

compounds / concatenation

Question: What do we put into our dictionary ?

Page 11: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

11

Compounding follows rules. e.g. from chemical compounds. (http://www.chem.qmul.ac.uk/iupac/) Substitutive nomenclature This naming method generally follows established IUPAC organic

nomenclature. E.g.: Hydrides of the main group elements (groups 13–17) are given -ane base

names, e.g. borane (BH3), oxidane (H2O), phosphane (PH3) . The compound PCl3 would be named substitutively as trichlorophosphane. Additive nomenclature This naming method has been developed principally for coordination

compounds. An example of its application is: [CoCl(NH3)5]Cl2 pentaamminechloridocobalt(III) chloride

compounds / concatenation

Page 12: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

12

Example of a chemical compound Components of Phane Parent Names

bicyclo[8.6.0]hexadecaphane

• The prefix "bicyclo" indicates that there are two rings

(bi-cyclo).

• The bridge descriptor describes the ring structure in terms of a sixteen-membered main ring [8 + 6 + 2 (the bridgehead nodes)] with a bridge consisting of a bond, i.e., zero nodes, which divides the main ring into an eight-membered and a ten-membered ring.

• The numerical term "hexadeca" denotes the presence of sixteen skeletal nodes. and

• the term "phane" indicates that at least one node represents a multiatomic (cyclic) structural unit.

[http://www.chem.qmul.ac.uk/iupac/phane/PhI2.html]

Page 13: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

WS 2013/2014 - Natural Language Systems -

Harriehausen

13

Page 14: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

14

Example of a medical compound

Medical compounds are usually composed of a prefix + root +

suffix, where neither of the components can be used stand-alone.

nephritis: inflammation of the kidney

supra-renal: situated above the kidneys

nephrologist: a kidney doctor

gastroenteritis : inflammation of stomach and intestines

nephr- 2 roots: Greek (νεφρός nephr(os)) , Latin (ren(es)). = kidney

gastr- ancient Greek γαστήρ (gastēr), γαστρ- = stomach, belly

-o- linking 2 body parts (linguistically)

enter- ancient Greek ἔντερον (énteron) = intestine

-itis = inflammation

supra- = above

- ologist = person studying a certain body part

Page 15: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

WS 2013/2014 - Natural Language Systems -

Harriehausen

15

Page 16: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

16

formation of compounds: synthesis and agglutination

Compound formation rules vary widely across language types.

Examples of formation processes (usually linked to the language type):

• synthesis (typically with synthetic languages, i.e. languages with a

high morpheme-per-word ratio): e.g.

German:

Kapitänspatent = Kapitän (sea captain) + Patent (license) joined by an

-s- (originally a genitive case suffix);

„patent of a sea captain“

Latin:

paterfamilias = pater (father) + familias (genitive of the lexeme familia

(family)); „father of a family“

compounds / concatenation

Page 17: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

17

formation of compounds:

It can get more difficult: (German -> English) Aufsichtsratsmitgliederversammlung =>

Auf = on sicht+s =view + “Fuge-s“ rat+s = council + „genitive-s“ mit = with

glied + er = link + „plural“ ver = „completion“ samml (stem = sammeln) = collect ung = „noun“ On-view-council-with-link-collect ?????????????????? = "meeting of members of the supervisory board"

compounds / concatenation

Notice: "with" and "link" form a derivation that is the German word for "member"; "completion", "collect" and "noun" form a derivation that means "meeting"

Page 18: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

18

formation of compounds: synthesis and agglutination • agglutination (usually with agglutinative languages, which tend to

create very long words with derivational morphemes), e.g.

German Farbfernsehgerät = color television set Funkfernbedienung = radio remote control Donaudampfschifffahrtsgesellschaftskapitänsmütze = Danube steamboat

shipping company Captain's hat Finnish hätä-uloskäytävä = emergency exit Lentokone-suihku-turbiini-moottori-apu-mekaanikko-aliupseeri-oppilas = Airplane jet turbine engine auxiliary

mechanic non-commissioned officer student Swedish rörelseuppskattningssökintervallsinställningar = Motion estimation search

range settings

compounds / concatenation

Page 19: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

19

Samples for long compounds in German

• die Armbrust

• die Mehrzweckhalle

• das Mehrzweckkirschentkerngerät

• die Gemeindegrundsteuerveranlagung

• die Nummernschildbedruckungsmaschine

• der Mehrkornroggenvollkornbrotmehlzulieferer

• der Schifffahrtskapitänsmützenmaterialhersteller

• die Verkehrsinfrastrukturfinanzierungsgesellschaft

• die Feuerwehrrettungshubschraubernotlandeplatzaufseherin

• der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker

• das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

• die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

Wolkenkratzer 'skyscraper': wolken 'clouds', + kratzer 'scraper' Eisenbahn 'railway': Eisen 'iron', + bahn 'track' Kraftfahrzeug 'automobile': Kraft 'power', + fahren/fahr 'drive', + zeug 'machinery' Stacheldraht 'barbed wire': stachel 'barb/barbed', + draht 'wire' Rinderkennzeichnungs- und Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz: literally, Cattle marking and beef labeling supervision duties delegation law

Page 20: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

WS 2013/2014 - Natural Language Systems -

Harriehausen

20

Samples for long compounds in different languages (see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)

Chinese (Cantonese Jyutping):

學生 'student': 學 learn + 生 grow 太空 'universe': 太 t great + 空 emptiness 摩天樓 'skyscraper': 摩 touch + 天 sky + 樓 building (with more than 1 floor) 打印機 'printer': 打 strike + 印 stamp/print + 機 machine 百科全書 'encyclopaedia': 百 100 + 科 (branch of) study + 全 entire/complete + 書 book Dutch: Arbeidsongeschiktheidsverzekering 'disability insurance': arbeid 'labour', + ongeschiktheid 'inaptitude', + verzekering 'insurance'. Rioolwaterzuiveringsinstallatie 'wastewater treatment plant': riool 'sewer', + water 'water', + zuivering 'cleaning', + installatie 'installation'. Verjaardagskalender 'birthday calendar': verjaardag 'birthday', + kalender 'calendar'. Klantenservicemedewerker 'customer service representative': klanten 'customers', + service 'service', + medewerker 'worker'. Universiteitsbibliotheek 'university library': universiteit 'university', + bibliotheek 'library'. Doorgroeimogelijkheden 'possibilities for advancement': door 'through', + groei 'grow', + mogelijkheden 'possibilities'.

Page 21: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

WS 2013/2014 - Natural Language Systems -

Harriehausen

21

Samples for long compounds in different languages

Finnish: sanakirja 'dictionary': sana 'word', + kirja 'book' tietokone 'computer': tieto 'knowledge, data', + kone 'machine' keskiviikko 'Wednesday': keski 'middle', + viikko 'week' maailma 'world': maa 'land', + ilma 'air' rautatieasema 'railway station': rauta 'iron' + tie 'road' + asema 'station' suihkuturbiiniapumekaanikkoaliupseerioppilas: 'Jet engine assistant mechanic NCO student' atomiydinenergiareaktorigeneraattorilauhduttajaturbiiniratasvaihde: some part of a nuclear plant Korean: 안팎 anpak 'inside and outside': 안 an 'inside' + 밖 bak 'outside‚ Spanish: Ciempiés 'centipede': cien 'hundred', + pies 'feet' Ferrocarril 'railway': ferro 'iron', + carril 'lane' Paraguas 'umbrella': para 'to stop, stops' + aguas '(the) water'

Samples for long compounds in different languages (see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)

Page 22: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

WS 2013/2014 - Natural Language Systems -

Harriehausen

22

Samples for long compounds in different languages

Icelandic: járnbraut 'railway': járn 'iron', + braut 'path' or 'way' farartæki 'vehicle': farar 'journey', + tæki 'apparatus' alfræðiorðabók 'encyclopædia': al 'everything', + fræði 'study' or 'knowledge', + orða 'words', + bók 'book' símtal 'telephone conversation': sím 'telephone', + tal 'dialogue' Italian: Millepiedi 'centipede': mille 'thousand', + piedi 'feet' Ferrovia 'railway': ferro 'iron', + via 'way' Tergicristallo 'windscreen wiper': tergere 'to wash', + cristallo 'crystal, glass' Japanese: 目覚まし(時計) mezamashi(dokei) 'alarm clock': 目 me 'eye' + 覚まし samashi (-zamashi) 'awakening (someone)' (+ 時計 tokei (-dokei) clock) お好み焼き okonomiyaki: お好み okonomi 'preference' + 焼き yaki 'cooking' 日帰り higaeri 'day trip': 日 hi 'day' + 帰り kaeri (-gaeri) 'returning (home)' 国会議事堂 kokkaigijidō 'national diet building': 国会 kokkai 'national diet' + 議事 giji 'proceedings' + 堂 dō 'hall'

Samples for long compounds in different languages (see: http://en.wikipedia.org/wiki/Compound_%28linguistics%29)

Page 23: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

23

formation of compounds and their structure:

Most compounds are 2-root-compounds, but they come with a number of different structures: Nouns – Adjectives - Verbs A. Nouns

(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)

In each of these cases, the syntactic class of the compound is the same

as the syntactic class of the final element of the compound.

compounds / concatenation

Noun-Noun Adjective-Noun Preposition-Noun Verb-Noun

apron string high school overdose swearword

hubcap smallpox underdog whetstone

bedroom poorhouse uptone scrubwoman

schoolteacher bluebird afterthought rattlesnake

Page 24: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

24

formation of compounds and their structure:

In each of these cases, the syntactic class of the compound is the same

as the syntactic class (= part-of-speech) of the final element of the compound.

Rule: • Germanic languages (e.g. English, German) are left-branching (the

modifiers come before the head). Schoolteacher = teacher of a school, bluebird = bird of blue color

• Romance languages ( e.g. French, Spanish) are usually right-branching; i.e. they are often formed by left-hand heads with prepositional components inserted before the modifier: chemin-de-fer = railway (lit. 'road of iron') moulin à vent = windmill (lit. 'mill (that works)-by-means-of wind')

compounds / concatenation

Noun-Noun Adjective-Noun Preposition-Noun Verb-Noun

schoolteacher bluebird afterthought rattlesnake

Page 25: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

25

formation of compounds and their structure:

B. Adjectives

(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm) In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound.

compounds / concatenation

Noun-Adjective Adjective-Adjective Preposition-Adjective

headstrong white-hot overwide

skin-deep widespread ingrown

nationwide bittersweet underripe

earthbound hardworking above-mentioned

Page 26: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

26

formation of compounds and their structure: B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working - ing is typically the aspect-suffix that gets added to the verb (root):

e.g. play-ing, laugh-ing, ask-ing,…

As a rule, we can form other wordforms (inflections, due to different tenses) from those roots, following the same inflectional pattern, i.e. verbal root + tense-marking-suffix, or insertion of modal verb:

Simple Present: He play-s. He laugh-s. He ask-s. Simple Past: They play-ed. They laugh-ed. They ask-ed. Simple Future: I will play. I will laugh. I will ask.

compounds / concatenation

Page 27: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

27

formation of compounds and their structure:

B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working * He hardworks. * They hardworked. * I will hardwork.

-> hardwork + ing

i.e. hardwork is not a verb by itself (see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)

compounds / concatenation

Page 28: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

28

formation of compounds and their structure:

B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working * He hardworks. * They hardworked. * I will hardwork.

-> hardwork + ing

(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)

compounds / concatenation

hard work ing

verb suffix

Adv Adj

Adj

Page 29: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

29

formation of compounds and their structure:

C. Verbs

(see: http://public.wsu.edu/~gordonl/S05/256/compounds.htm)

In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound.

compounds / concatenation

Noun-Verb Adjective-Verb

Preposition-

Verb

Verb-Verb

spoonfeed dry-clean outlive sleepwalk

aircondition whitewash overdo

window-shop broadcast uproot

Page 30: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

30

semantics of compounds

Semantic classification : it it common to classify compounds into 4 types: • endocentric description: A+B denotes a special kind of B • exocentric • copulative • appositional Endocentric compounds consist of a head and modifiers, which restrict this meaning. Endocentric compounds tend to be of the same part of speech (word class) as their head. Examples: - doghouse, where house is the head and dog is the modifier; i.e. a house intended for a dog -darkroom, where dark modifies room; i.e. a type of a room (usually used in photography)

Page 31: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

31

semantics of compounds

Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric description: (one) whose B is A • copulative • appositional Exocentric compounds have an unexpressed semantic head (e.g. a person, a plant, an animal...), and their meaning is often not transparent from its constituent parts. Examples: ●white-collar is neither a kind of collar nor a white thing, but the collar's colour is a metaphor for socioeconomic status ● red-neck only indirectly refers to a neck, but refers to a working person (e.g. farmer) ● skinhead, may refer to a bald head but also refers to a certain group of people ● paleface, native American Indians call the White Man a paleface

Page 32: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

32

semantics of compounds

Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative description: A+B denotes 'the sum' of what A and B denote • appositional Copulative compounds are compounds which have two semantic heads. Examples: - bittersweet; having both tastes - sleepwalk; sleeping while walking OR walking in your sleep

Page 33: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

33

semantics of compounds

Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative • appositional description: A and B provide different descriptions for the same referent; the meaning of which can be characterized as 'AS WELL AS'. Appositional compounds refer to lexemes that have two (contrary) attributes which classify the compound.

Examples: - actor-director; an actor who also plays the role of the director - maidservant; a maid who is also a servant OR a servant who is also a maid - Player-coach; someone who is a player as well as a coach

Page 34: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

34

semantics of compounds (ambiguities)

When - in Germanic languages (e.g. German, English) - compound words are formed by prepending a descriptive word in front of the main word, the description or meaning between the components may be ambiguous. This is a problem for decompounding or translation. -> the orange bowl problem

Page 35: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

35

Can you please bring me the orange bowl ?

bowl filled with oranges

bowl having the shape of an orange bowl with an

orange pattern

bowl of orange colour

bowl that was formerly / usually filled with oranges

?

?

?

?

?

semantics of compounds (ambiguities)

Page 36: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

36

compounding - decompounding

decompounding -> follows rules principles / rules:

FANO rule: „the analysis is unambiguous, when a morpheme is not the beginning of another morpheme“

(= principle of longest match)

e.g. but / butter

(Orthographic) Ambiguities in segmentation :

horseshoe: horses – hoe (?) vs. horse-shoe

(the FANO rule would lead to the incorrect/unlikely segmentation)

Segmentation has to be done recursively in order to find all possibilities:

Page 37: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

37

compounding - decompounding

English: petshopping: pet-shopping vs. pets-hopping

Martine Adda-Decker. “A corpus-based decompounding algorithm for German lexical

modeling in LVCSR”. EUROSPEECH 2003.

https://perso.limsi.fr/madda/publications/PDF/ES031038.pdf

Page 38: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

38

compounding - decompounding German: Staubecken: Stau-becken = a reservoir Staub-ecken = dusty corners Wachstube: Wach-stube = die Stube einer Wache (the room of a guard) Wachs-tube = eine Tube, in der Wachs aufbewahrt wird (a tube filled with wax) Gelbrand: Gelb-rand = gelber Rand (a yellow border) Gel-brand = Brand eines Gels (burning of a gel) Tonerkennung: Toner-kennung = die Kennung eines Toners (the identifier of a toner) Ton-erkennung = das Erkennen von Tönen (the identification of tones) Lachen: Lache-n = mehrere Pfützen (multiple puddles of water) Lachen = eine menschliche Lautäußerung wie Gelächter (laughter) Druckerzeugnis: Druck-erzeugnis = Gedrucktes (printed matter) Drucker-zeugnis = Zeugnis für einen Drucker (certificate for a printer) beinhalten : bein-halten vs. be-inhalten (imagine: Beinhalten….) Abteilungen : Abtei-lungen vs. Abteil-ungen

Page 39: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

39

compounding - decompounding

context or stress (in spoken language) is needed for disambiguation

Stress makes a difference: a green ´house vs. a ´greenhouse The white ´house vs. The ´White House

Page 40: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

40

(problems with )concatenation

Summary

Structural as well as semantic challenges with compounds:

• ambiguities in meaning (orange bowl)

• ambiguities in hyphenation points (Staubecken)

• not all morphemes can form a compound (sheepchops)->

Page 41: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

41

(problems with )concatenation

Page 42: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

42

compounds -> MWE -> idiomatic phrases

In addition to the compounds that have one of the four descriptions (endocentric, exocentric, copulative, appositional), i.e. stick to the original lexical meaning of at least one of its components, we need to consider „multiple morpheme strings / multi word expressions (MWE)“ (fixed phrases) that have „lost“ the original lexical meaning of its components. Those MWE are called idiomatic phrases or idioms.

incr

easi

ng t

he

form

al co

mple

xity

=

incr

easi

ng t

he

idio

matic

rigid

ity • compounding: combination of lexical

meanings: carseat, houseboat, cellar door,...

• compounding: not a combination of the lexical meanings: starfish, paperback, ladybug,...

• depending on the context: bite the dust, lose face, kick the bucket,...

Page 43: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

43

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 44: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

44

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/englisch)

• Out of the blue

• To be on Cloud Nine

• A leopard cannot change its spots

• Head over heels

• Fair Play

• As cool as a cucumber

• The early bird catches the worm

• As fit as a fiddle

• Beat about the bush

• The Big Apple

• The apple of my eye

• Wet behind the ears

• A bird in the hand is worth two in the bush

• It's raining cats and dogs

Page 45: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

45

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Wie bei Hempels unterm Sofa

• Schmetterlinge im Bauch

• Jemanden übers Ohr hauen

• Ein Bäuerchen machen

• Mit jemandem durch dick und dünn gehen

• Seine Pappenheimer kennen

• Jemandem die Würmer aus der Nase ziehen

• Die Arschkarte ziehen

• Mit jemandem Pferde stehlen können

• Sich aus dem Staub machen

• Hummeln im Hintern haben

• Im siebten Himmel sein

• Viele Wege führen nach Rom

• Mit einem lachenden und einem weinenden Auge

• Nah am Wasser gebaut haben

• Da ist der Bär los

• Nachtigall, ick hör dir trapsen

• Mein lieber Scholli!

Page 46: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

46

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Jemandem einen Denkzettel verpassen

• Sich auf den Schlips getreten fühlen

• Alles für die Katz

• Wo drückt denn der Schuh?

• Gegen den Strich gehen

• Den Faden verlieren

• Etwas ausbaden müssen

• Einen Stein im Brett haben

• Bahnhof verstehen

• Der springende Punkt

• Der Sündenbock sein

• Einen Ohrwurm haben

• Das ist doch zum Mäusemelken!

• Schmiere stehen

• Den Teufel an die Wand malen

• Auf dem Holzweg sein

• Eselsbrücke

• In der Kreide stehen

Page 47: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

47

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Die Ohren steif halten

• Auf Vordermann bringen

• Um die Ecke bringen

• Hals- und Beinbruch

• Auf dem Kerbholz haben

• Eine Schlappe einstecken

• Frosch im Hals

• Es zieht wie Hechtsuppe

• Jemandem einen Bärendienst erweisen

• Damoklesschwert

• Tomaten auf den Augen haben

• Jemandem raucht der Kopf

• Für 'n Appel und 'n Ei

• Etwas an die große Glocke hängen

• Das ist Jacke wie Hose

• Etwas aus dem Ärmel schütteln

• Ein X für ein U vormachen

• Jemandem nicht das Wasser reichen können

Page 48: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

48

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Alles im grünen Bereich

• Die Hand ins Feuer legen

• Das kann kein Schwein lesen!

• Auf Draht sein

• Sein blaues Wunder erleben

• Der hat es faustdick hinter den Ohren

• Mein Name ist Hase, ich weiß von nichts

• Aus dem Stegreif

• Der Groschen ist gefallen

• Einen Vogel haben

• Den Kürzeren ziehen

• Bis in die Puppen

• Etwas hinter die Ohren schreiben

• Ins Fettnäpfchen treten

• Beleidigte Leberwurst

• Jemanden auf dem Kieker haben

• Ich verstehe immer nur Bahnhof!

• Die Katze im Sack kaufen

Page 49: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

49

idiomatic phrases (http://www.geo.de/GEOlino/mensch/redewendungen/deutsch)

• Bekannt wie ein bunter Hund

• Den Kopf in den Sand stecken

• Mit dem ist nicht gut Kirschen essen

• Aller guten Dinge sind drei

• Lampenfieber

• Das kommt mir spanisch vor

• Schwein haben

• Das hast du dir selbst eingebrockt

• Seinen Senf dazugeben

• Jemandem ist eine Laus über die Leber gelaufen

• Kalte Füße bekommen

• Im Stich lassen

• Schwedische Gardinen

• Alles in Butter

• Geld auf den Kopf hauen

• Das Handtuch werfen

• Sich mit fremden Federn schmücken

Page 50: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

50

idiomatic phrases – and their morpho-syntax

Idiomatic expressions are extremely rigid, in that morpho-syntactic modifications are not allowed (without a change in meaning) :

GERMAN

Singular - Plural

• Bekannt wie ein bunter Hund

• ??? Bekannt wie bunte Hunde.

• * Bekannt wir 2 bunte Hunde.

adjectival modification

• Den Kopf in den Sand stecken.

• Den Kopf in den weichen Sand stecken.

Page 51: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

51

idiomatic phrases – and their morpho-syntax

Idiomatic expressions are extremely rigid, in that morpho-syntactic modifications are not allowed (without a change in meaning) :

ENGLISH

Adjectival modification:

• to be on cloud nine –> * to be on cloud eight

Singular – Plural:

• The early bird gets the worm. -> ? The early birds get the worm. • It's raining cats and dogs. -> * It's raining 2 cats and 3 dogs.

Neither adjectival modification nor change of subject: • He kicked the bucket. • * He kicked the green bucket. • * It kicked the bucket.

Page 52: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

52

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE) – and their relationship

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 53: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

53

multiple word entries (MWE)

We have already looked at the semantics / meaning of compounds and idioms.

But what about the relationship within the MWE ?

Page 54: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

54

multiple word entries (MWE)

Problems: the relationships among the components change

the „Schnitzel“ problem

• Schweineschnitzel / -steak

• Pfefferschnitzel / -steak

• Wienerschnitzel

• Soyaschnitzel

• Rückensteak, Lendensteak, Ribeyesteak

• Minutenschnitzel / -steak

• Jäger Schnitzel

• Zigeuner Schnitzel

• Tiefkühl-Schnitzel

Page 55: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

55

multiple word entries (MWE)

Problems: the relationships among the components change

the „Schnitzel“ problem

• Schweineschnitzel / -steak made of pork meat

• Pfefferschnitzel / -steak garnished / spiced with pepper

• Wienerschnitzel a certain recipe

• Soyaschnitzel made of soy

• Rückensteak, Lendensteak, Ribeyesteak body part

• Minutenschnitzel / -steak time / length of cooking

• Jäger Schnitzel a certain recipe

• Zigeuner Schnitzel a certain recipe

• Tiefkühl-Schnitzel status (frozen)

Page 56: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

56

multiple word entries (MWE)

Problems: the relationships among the components change

the „Schnitzel“ problem

Even though the single lexical meanings remain untouched in the compound, the relationships between the compounds vary tremendously !

Page 57: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

57

multiple word entries (MWE)

the 3 main relationships (default ?) between parts of a compound word: (the role of global knowledge in decompounding)

compound meaning relationship

doorknob knob of the door is-a / is-part-of/

carseat seat of the car genitive

glasdoor door made of glas made from / material

nutbread ‡ bread of the nut

waterglas glas filled with water used for

oiltruck truck that carries oil

‡ truck made of oil

1

2

3

Page 58: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

58

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 59: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

59

spell aid

in NLP, decompounding algorithms are essential for spell-checking / spell aid :

How do we define a lexical error in NLP terms ?

An error is a string that cannot be found in / matched with a dictionary entry.

It is not necessarily an incorrect word (esp. neologisms).

Page 60: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

60

spell aid Neologism (Definition):

A neologism is a new term, word or phrase, that may or may not be in

the process of entering common use, but has not yet been accepted into

mainstream language, i.e. it has NOT entered written dictionaries (yet).

For a long time neologisms were mainly seen as pathological or

deviating - Webster’s Third New International Dictionary (1966)

describes neologism as „a meaningless word coined by a psychotic“.

http://www.neologisms.us/

a-er

aagram

aagram string

aangram

Aazymurgy

abasure

abberateur

abbrantcooty

abbrhyme

abched

abilliant

abomasum

abrabro

abrickity

abthurt

Page 61: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

61

spell aid - neologisms http://www.wortwarte.de/

Neue Wörter vom 24.10.2015

Heute servieren wir Ihnen 13 neue Wörter:

•Abgas-Testverfahren, das

•Ad-Blue-Dosierung, die

•Bierkasten-Curling, das

•Codeshare-Dienst, der

•Codeshare-Verbindung, die

•Employer-Branding-Wettbewerb, der

•Flüssigbatterie, die

•Meteorolügner, der

•Schoolbike, das

•Smarthaus-Markt, der

•Speedabteilung, die

•Zeitgeisttrinker, der

•Zielgesicht, das

Page 63: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

63

spell aid – chat language (acronyms)

AFAIK -- As Far As I Know

AFK -- Away From Keyboard

ASAP -- As Soon As Possible

BAS -- Big A** Smile

BBL -- Be Back Later

BBN -- Bye Bye Now

BBS -- Be Back Soon

BEG -- Big Evil Grin

BF -- Boyfriend

BIBO -- Beer In, Beer Out

BRB -- Be Right Back

BTW -- By The Way

BWL -- Bursting With Laughter

C&G -- Chuckle and Grin

CICO -- Coffee In, Coffee Out

CID -- Crying In Disgrace

CP -- Chat Post(a chat message)

CRBT -- Crying Real Big Tears

CSG -- Chuckle Snicker Grin

CYA -- See You (Seeya)

CYAL8R -- See You Later

(Seeyalata)

DLTBBB -- Don't Let The Bed

Bugs Bite

EG -- Evil Grin

EMSG -- Email Message

FC -- Fingers Crossed

FTBOMH -- From The Bottom Of

My Heart

FYI -- For Your Information

See: http://www.chatdefinitions.com/

Page 64: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

WS 2014/2015 - Natural Language Systems -

Harriehausen

64

http://abkuerzungen.woxikon.de/

spell aid – acronyms & chat language

German

25803 abbreviations –50070 meanings

WM, EU, ADAC (final stress) NATO, UNO (2 syllables) BaföG, Azubi (mix of syllable & single letter) IEEE LAN (1 word- ambiguous) Laser – AIDS – Ufo – Unbekanntes Flug Objekt GAU – Größter Anzunehmender Unfall LG, VG, HDGDL,….

Light Amplification (by) Stimulated Emission (of) Radiation Acquired Immune

Deficiency Syndrome

Page 65: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

65

spell aid – chat language (symbols)

:-| -- Ambivalent

o:-) -- Angelic

>:-( -- Angry

|-I -- Asleep

(::()::) -- Bandaid

:-{} -- Blowing a Kiss

\-o -- Bored

:-c -- Bummed Out

|C| -- Can of Coke

|P| -- Can of Pepsi

:( ) -- Can't Stop Talking

:*) -- Clowning

:' -- Crying

:'-) -- Crying with Joy

:'-( -- Crying Sadly

:-9 -- Delicious, Yummy

:-> -- Devilish

;-> -- Devilish Wink

:P -- Disgusted (sticking out

tongue)

:*) -- Drunk

:-6 -- Exhausted, Wiped Out

:( -- Frown

\~/ -- Full Glass

\_/ -- Glass (drink)

^5 -- High Five

See: http://www.chatdefinitions.com/

Page 66: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

66

spell aid – EMOJIs

… language ???

Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system.

Page 67: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

67

EMOJIs vs. hieroglyphs

Page 68: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

68

back to: spell aid – spell checking

Page 69: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

69

spell aid – spell checking

spell checking algorithms are based on the following types of mistakes (statistics !):

• phonetic similarities (ph – f : telephone – telefone)

• deletion of multiple entries ( mouuse - mouse)

• wrong order (from – form ; mouse – muose)

• substitution of neighbouring letters on the keyboard (miuse – mouse)

• include missing letters (vowels in between consonants...) (telephne)

• typos occur towards the end of a word (assumption:first letter is correct)

• segmentation / decomposition into substrings (horses‘hoe – horse‘shoe)

Page 70: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

70

spell aid – spell checking

• phonetic similarities (ph – f : telephone – telefone)

• deletion of multiple entries ( mouuse - mouse)

• wrong order (from – form ; mouse – muose)

• substitution of neighbouring letters on the keyboard (miuse – mouse)

• include missing letters (vowels in between consonants...) (telephne)

• typos occur towards the end of a word (assumption:first letter is correct)

• segmentation / decomposition into substrings (horeshoe – horseshoe)

Page 71: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

71

spell aid – spell checking

• include missing letters

www.dositey.com/language/spelling/Mislet3.htm

Page 72: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

72

BTW - gap-filling & interpolation also works on sentence level.

Page 73: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

73

spell aid – spell checking

How does spell checking work (w.r.t. grammar checking) ?

Various degrees of „intelligence“:

System A : no match found in the dictionary -> mark entry as incorrect

System B: no match found in the dictionary. Initiate a rudimentary parse (left-right-search). Try to identify the wordclass, i.e. limit possibilities and continue a sentential analysis. e.g. the ...man (statistics: DET + ADJ + NOUN); n-gram

System C: no match found in the dictionary. Initiate a segmentation of the word to identify the wordclass, e.g. look for typical endings (-ly = adverb / capital letters = proper noun, ...). This way new wordcreations can be identified (e.g. any word ending in -ness = noun); n-gram

Page 74: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

74

n-grams / language models (statistical language processing)

An n-gram is a substring of n items from a given string.

A complete string of words: w1 … wn or w1

In NLP, the items in question can be phonemes, syllables, letters, words or any

substring. This depends on the application.

An n-gram of size 1 is a "unigram";

size 2 is a "bigram" ;

size 3 is a "trigram"; etc. …

size n is an "n-gram ".

n

Page 75: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

75

n-grams / language models (statistical language processing)

Example: „he reads a book"

For a sequence of words, the trigrams would be: "# he reads", „he reads a",

„reads a book", and "a book #".

For sequences of characters, the trigrams that can be generated from „hello world"

are "hel", "ell", "llo", "lo ", "o w", " wo", "wor" etc.

In practice, we often

• collapse whitespace to a single space

• remove punctuation

Page 76: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

76

n-grams / language models (statistical language processing)

Example of an n-gram count from the GOOGLE n-gram corpus: (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-n-

gram-are-belong-to-you.html)

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of sentences: 95,119,665,584

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

Page 77: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

77

n-grams / language models (statistical language processing)

Example of an n-gram count from the GOOGLE n-gram corpus: (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-n-gram-are-belong-to-you.html)

trigrams: ceramics collectables collectibles 55 ceramics collectables fine 130 ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking 45 ceramics collection , 144 ceramics collection . 247

Page 78: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

78

n-grams / language models (statistical language processing)

Example of an n-gram count from the GOOGLE n-gram corpus: (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html#!/2006/08/all-our-n-gram-are-belong-to-you.html)

fourgrams: serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40

Page 79: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

79

n-grams / language models (statistical language processing)

In an n-gram analysis, we compute the probability of the occurence of x (e.g. a letter or word) AFTER a certain sequence, i.e. the conditional probability of x is always given on the basis of the PREVIOUS word/character. Example: for ex_ In English, the probabilities for a = 0.4 b = 0.00001 all probabilities sum to 1 c = 0,……

Page 80: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

80

n-grams / language models (statistical language processing)

The theory behind it: A statistical language model assigns a probability to a sequence of n words P (w1,…,wn) by means of a probability distribution. All words (or characters) depend on the last n-1 words. More concisely, an n-gram model predicts xi based on In probability terms, this is This is also called an n-1-order Markov Model.

In speech recognition, sequences of phonemes are often modeled using an n-gram distribution.

Page 81: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

81

n-grams / language models (statistical language processing)

In an n-gram model, the conditional probability P (w1,…,wm) of observing the sentence w1,...,wm can be approximated: It is assumed that the probability of observing the i th word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words. In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as: Whereas in a trigram (n=3) language model, the approximation is:

Page 82: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

82

source: http://de.wikipedia.org/wiki/Buchstabenh%C3%A4ufigkeit

single characters (German) (statistical language processing)

= analysis/distribution of letters

Page 83: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

83

single characters (across languages) (statistical language processing)

letter German English French Spanish Italian Swedish

a 6,51 % 8,167 % 7,636 % 12,53 % 11,74 % 9,3 %

b 1,89 % 1,492 % 0,901 % 1,42 % 0,92 % 1,3 %

c 3,06 % 2,782 % 3,260 % 4,68 % 4,5 % 1,3 %

d 5,08 % 4,253 % 3,669 % 5,86 % 3,73 % 4,5 %

e 17,40 % 12,702 % 14,715 % 13,68 % 11,79 % 9,9 %

f 1,66 % 2,228 % 1,066 % 0,69 % 0,95 % 2,0 %

g 3,01 % 2,015 % 0,866 % 1,01 % 1,64 % 3,3 %

h 4,76 % 6,094 % 0,737 % 0,70 % 1,54 % 2,1 %

i 7,55 % 6,966 % 7,529 % 6,25 % 11,28 % 5,1 %

j 0,27 % 0,153 % 0,545 % 0,44 % 0,00 % 0,7%

k 1,21 % 0,772 % 0,049 % 0,00 % 0,00 % 3,2 %

l 3,44 % 4,025 % 5,456 % 4,97 % 6,51 % 5,2 %

m 2,53 % 2,406 % 2,968 % 3,15 % 2,51 % 3,5 %

n 9,78 % 6,749 % 7,095 % 6,71 % 6,88 % 8,8 %

o 2,51 % 7,507 % 5,378 % 8,68 % 9,83 % 4,1 %

p 0,79 % 1,929 % 3,021 % 2,51 % 3,05 % 1,7 %

q 0,02 % 0,095 % 1,362 % 0,88 % 0,51 % 0,007 %

r 7,00 % 5,987 % 6,553 % 6,87 % 6,37 % 8,3 %

s 7,27 % 6,327 % 7,948 % 7,98 % 4,98 % 6,3 %

ß 0,31 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

t 6,15 % 9,056 % 7,244 % 4,63 % 5,62 % 8,7 %

u 4,35 % 2,758 % 6,311 % 3,93 % 3,01 % 1,8 %

v 0,67 % 0,978 % 1,628 % 0,90 % 2,10 % 2,4 %

w 1,89 % 2,360 % 0,114 % 0,02 % 0,00 % 0,03 %

x 0,03 % 0,150 % 0,387 % 0,22 % 0,00 % 0,1 %

y 0,04 % 1,974 % 0,308 % 0,90 % 0,00 % 0,6 %

z 1,13 % 0,074 % 0,136 % 0,52 % 0,49 % 0,02 % htt

ps:

//en.w

ikip

edia

.org

/wik

i/Lett

er_

frequency

#Rela

tive_fr

equenci

es_

of_

lett

ers

_in

_oth

er_

languages

Page 84: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

84

single characters (across languages) (statistical language processing)

letter German English French Spanish Italian Swedish

à 0,00 % 0,00 % 0,486 % 0,00 % see a 0,00 %

ç 0,00 % 0,00 % 0,085 % 0,00 % 0,00 % 0,00 %

è 0,00 % 0,00 % 0,271 % 0,00 % see e 0,00 %

é 0,01 % 0,00 % 1,904 % 0,00 % see e 0,00 %

ê 0,00 % 0,00 % 0,225 % 0,00 % 0,00 % 0,00 %

ë 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ì 0,00 % 0,00 % 0,00 % 0,00 % see i 0,00 %

î 0,00 % 0,00 % 0,045 % 0,00 % 0,00 % 0,00 %

ï 0,00 % 0,01 % 0,005 % 0,00 % 0,00 % 0,00 %

ò 0,00 % 0,00 % 0,00 % 0,00 % see o 0,00 %

ó 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ù 0,00 % 0,00 % 0,058 % 0,00 % see u 0,00 %

ą 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ć 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ĉ 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ę 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ĝ 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ĥ 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ĵ 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ł 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

ń 0,00 % 0,00 % 0,00 % 0,00 % 0,00 % 0,00 %

œ 0,00 % 0,00 % 0,018 % 0,00 % 0,00 % 0,00 %

htt

ps:

//en.w

ikip

edia

.org

/wik

i/Lett

er_

frequency

#Rela

tive_fr

equenci

es_

of_

lett

ers

_in

_oth

er_

languages

Page 86: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

86

source: http://de.wikipedia.org/wiki/Buchstabenh%C3%A4ufigkeit

single characters (German) (statistical language processing)

= bi-gram analysis

Page 87: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

87

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 88: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

88

regular expressions (Jurafsky, section 2.1)

• In order to figure out whether something is an incorrect word, the machine has to match the string (= a sequence of symbols; any sequence of alphanumeric characters (letters, numbers, spaces, tabs, punctuation) to an entry in the dictionary

• other matches: e.g. information retrieval in www-search engines (Google, altavista,…)

• the standard notation for characterizing text sequences= regular expressions

• regular expressions are written in (regular expression) languages: e.g. Perl, grep (Global Regular Expression Print)

• formally, regular expressions are algebraic notations for characterizing a set of strings

• regular expression search requires a pattern that we want to search for (and a corpus of text to search through) (text mining !)

Page 89: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

89

Example: Search for the pattern “linguistics”.

• You also want to find documents with “Linguistics” and “LINGUISTICS”. (remember: the computer does EXACTLY do what you tell him to…)

• The regular expression /linguistics/ matches any string in any document containing exactly the substring “linguistics”

• Regular expressions are case sensitive

• samples (Jurafsky, p. 23)

regular expression example pattern matched

/woodchucks/ “interesting links to woodchucks and lemurs”

/a/ “Mary Ann stopped by Mona’s”

/Claire says,/ Dagmar, my gift please,” Claire says,”

/song/ “all our pretty songs”

/!/ “You’ve left the burglar behind again!” said Nori

regular expressions (Jurafsky, section 2.1)

Page 90: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

90

linguistics - Linguistics - LINGUSTICS

to search for alternative characters “l” and/or “L” we use square

brackets: [l L]

Regular expression match sample pattern

/[l L] inguistics/ Linguistics or linguistics “computational

linguistics is fun”

/[1 2 3 4 5 6 7 8 9 0]/ any digit this is Linguistics

5981

regular expressions (Jurafsky, section 2.1)

Page 91: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

91

to search for a character in a range we use the dash: [-]

Regular expression match sample pattern

/[A-Z]/ any uppercase letter this is Linguistics 5981

/[0-9]/ any single digit this is Linguistics 5981

/[1 2 3 4 5 6 7 8 9 0]/ any single digit this is Linguistics 5981

regular expressions (Jurafsky, section 2.1)

Page 92: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

92

to search for negation, i.e. a character that I do NOT want to find we

use the caret: [^]

Regular expression match sample pattern

/[^A-Z]/ not an uppercase letter this is Linguistics 5981

/[^L l]/ neither L nor l this is Linguistics 5981

/[^\.]/ not a period this is Linguistics 5981

\* an asterisk “L*I*N*G*U*I*S*T*I*C*S” \. a period “Dr.Doolittle” \? a question mark “Is this Linguistics 5981 ?” \n a newline \t a tab

Special characters:

regular expressions (Jurafsky, section 2.1)

Page 93: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

93

to search for optional characters we use the question mark: [?]

Regular expression match sample pattern

/colou?r/ colour or color beautiful colour

to search for any number of a certain character we use the Kleene star: [*]

Regular expression match

/a*/ any string of zero or more “a”s

/aa*/ at least one a but also any number of “a”s

regular expressions (Jurafsky, section 2.1)

Page 94: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

94

Regular expression match

/[ab]*/ zero or more “a”s or “b”s

/[0-9] [0-9]*/ any integer (= a string of digits)

To look for at least one character of a type we use the Kleene “+”:

Regular expression match

/[0-9]+/ a sequence of digits

regular expressions (Jurafsky, section 2.1)

Page 95: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

95

The “.” is a very special character -> so-called wildcard

Regular expression match sample pattern

/b.ll/ any character ball between b and ll bell bull bill

Will the search find “Bill” ?

regular expressions (Jurafsky, section 2.1)

Page 96: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

96

Anchors (start of line: “^”, end of line:”$”)

Regular expression match sample pattern

/^Linguistics/ “Linguistics” at the Linguistics is fun. beginning of a line

/linguistics\.$/ “linguistics” at the We like linguistics. end of a line Anchors (word boundary: “\b”, non-boundary:”\B”)

Regular expression match sample pattern

/\bthe\b/ “the” alone This is the place.

/\Bthe\B/ “the” included This is my mother.

regular expressions (Jurafsky, section 2.1)

Page 97: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

97

More on alternative characters: the pipe symbol: “|” (disjunction)

Regular expression match sample pattern

/colou?r/ colour or color beautiful colour

/progra(m|mme)/ program or programme linguistics program

regular expressions (Jurafsky, section 2.1)

Page 98: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

98

What does the following expression match ?

/student [0-9]+ */

Will it match “student 1 student 2 student 3” ?

regular expressions (Jurafsky, section 2.1)

Page 99: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

99

Perl expressions are also used for string substitution: (used in ELIZA)

s/man/men/ man -> men

Perl expressions are also used for string repetition via memory:

(the number operator)

s/(linguistics)/wonderful \1/ linguistics-> wonderful linguistics

ELIZA

s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1 ?/

regular expressions (Jurafsky, section 2.1)

Page 100: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

100

1 morphemes

2 compounds / concatenation

3 idiomatic phrases

4 multiple word entries (MWE)

5 spell aid

6 regular expressions

7 Finite State Automata (FSA)

content

Page 101: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

101

The regular expression is more than just a convenient metalanguage for text searching.

• First, a regular expression is one way of describing a finite-state automaton (FSA). Finite-state automata are the theoretical foundation of a good deal of the computational work we will describe and look at in this lecture. Any regular expression can be implemented as a finite-state automaton*. Symmetrically, any finite-state automaton can be described with a regular expression.

• Second, a regular expression is one way of characterizing a particular kind of formal language called a regular language. Both regular expressions and finite-state automata can be used to describe regular languages. The relation among these three theoretical constructions is sketched out in the following figure:

Finite State Automata (FSA)

Page 102: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

102

regular

expressions

Finite regular

Automata languages

The relationship between finite state automata, regular expressions, and regular languages* * as suggested by Martin Kay in: Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference

of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark,pp. 2-10.ACL.).

Finite State Automata (FSA)

Page 103: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

103

Definition: A finite-state machine (FSM) or finite-state automaton (plural: automata) (FSA), or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition… In linguistics, they are used to describe simple parts of the grammars of natural languages.

Finite State Automata (FSA)

Finite-State Language Processing by Emmanuel Roche (ed), Yves Schabes (ed

Page 104: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

104

Examples:

• Introduction to finite-state automata for regular expressions

• Mapping from regular expressions to automata

examples

Finite State Automata (FSA)

Page 105: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

105

Using a FSA to recognize sheeptalk

After a while, with the parrot‘s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said.

Hugh Lofting, The Story of Doctor Doolittle

Finite State Automata (FSA)

Page 106: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

106

Using a FSA to recognize sheeptalk

Sheep language can be defined as any string from the following (infinite) set:

baa!

baaa!

baaaa!

baaaaa!

baaaaaa!

....

Finite State Automata (FSA)

Page 107: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

107

baa!

baaa!

baaaa!

baaaaa!

baaaaaa!

....

The regular expression for this kind of sheeptalk is

/baa+!/

All regular expressions can be represented as finite-state automata (FSA):

Finite State Automata (FSA)

Page 108: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

108

a finite-state automaton (FSA) for the regular expression /baa+!/

q

0 q

q

q

q

1 2 3 4

b a a

a

!

start state final state/ accepting state

Finite State Automata (FSA)

Page 109: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

109

function D-RECOGNIZE(tape,machine) returns accept or reject

index <- Beginning of tape

current-state <- Initial state of machine

loop

if End of input has been reached then

if current-state is an accept state then

return accept

else return reject

elseif transition-table[current-state,tape[index]] is empty then

return reject

else

current-state <- transition-table[current-state,tape[index]] index <- index +1

end

An algorithm for deterministic recognition of FSAs

Finite State Automata (FSA)

conditions trigger a transition of the states

Page 110: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

110

Regular expressions can be represented as FSAs:

fail state

q

0 q

q

q

q

1 2 3 4

b a a

a

!

f q

a

! b

b b

b

! ! !

a c

?

Finite State Automata (FSA)

Page 111: Natural Language Processing &gt;&gt; Morphology · PDF file- Natural Language Systems - Harriehausen 2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries

- Natural Language Systems -

Harriehausen

111

Finite State Automata (FSA)

witch , witches , wizard , wizards

FSA … on word level

FSA … on phrase level

FSA … on sentence level