Top Banner
LIMSI-CNRS | 1/72 Fondements du TAL Delphine Bernhard Morphology Contains slides adapted from Pierre Zweigenbaum
72

Morphology - Limsi

Feb 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Morphology - Limsi

LIMSI-CNRS | 1/72

Fondements du TAL

Delphine Bernhard

Morphology

Contains slides adapted from Pierre Zweigenbaum

Page 2: Morphology - Limsi

LIMSI-CNRS | 2/72

Outline

1. Word segmentation

2. Linguistic morphologya) Morphemesb) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

Page 3: Morphology - Limsi

LIMSI-CNRS | 3/72

Levels of linguistic structure

(c) David Groome, 2006

Our focus today

Page 4: Morphology - Limsi

LIMSI-CNRS | 4/72

Outline

1. Word segmentation

2. Linguistic morphologya) Morphemesb) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

Page 5: Morphology - Limsi

LIMSI-CNRS | 5/72

Can you read this (fast!)?

wikipédiaestunprojetdencyclopédiecollectiveétabliesurinternetuniversellemultilingueetfonctionnantsurleprincipeduwikiwikipédiaapourobjectifdoffriruncontenulibrementréutilisableneutreetvérifiablequechacunpeutéditeretaméliorer

Wikipédia est un projet d’encyclopédie collective établie sur Internet, universelle, multilingue et fonctionnant sur le principe du wiki. Wikipédia a pour objectif d’offrir un contenu librement réutilisable, neutre et vérifiable, que chacun peut éditer et améliorer.

Page 6: Morphology - Limsi

LIMSI-CNRS | 6/72

It's only words ...

... but what are they exactly and how can we automatically recognise them?

In speech, there are no obvious breaks So how do babies learn words? According to (Saffran et al.,

1996) they use distributional cues and statistical regularities in speech

Page 7: Morphology - Limsi

LIMSI-CNRS | 7/72

How do we recognise words in speech? (Bauer, 1988)

There are no gaps between words in speech: Menbecomeoldbuttheyneverbecomegood

Thanks to our knowledge of language, we recognise certain strings of sounds/letters: e.g. we can recognise men in the previous sequence because

it also comes up in sequences like:MenareconservativeafterdinnerMenlosetheirtempersindefendingtheirtaste.Afterfortymenhavemarriedtheirhabits.

Page 8: Morphology - Limsi

LIMSI-CNRS | 8/72

Learning to read is difficult for humans

Reading disabilities: Dyslexia: inability to decode, or break down, words into

phonemes Comprehension difficulties

The invention of writing and reading is recent

Contrarily to speech or vision, it is anunnatural process that has to be learned: brains are not wired to read!

Page 9: Morphology - Limsi

LIMSI-CNRS | 9/72

For computers: characters and strings

Control characters: End of line: \n Tabulation: \t

Encodings: ASCII: English alphabet Latin 1, ISO-8859-1: Western European Languages ISO-8859-15: Similar to ISO-8859-1, but replaces some less

common symbols with €, Œ or œ Windows-1252, Cp1252: superset of ISO 8859-1 (includes €,

Œ and œ) UTF-8: can represent every character in the Unicode

character set, backward-compatible with ASCII

Page 10: Morphology - Limsi

LIMSI-CNRS | 10/72

Practical definition of wordsand sentences

Bauer (1988): A word is a unit which, in print, is bounded by spaces on both

sides. We will call this an orthographic word.

Kučera and Francis (1967): A graphic word is a string of contiguous alphanumeric

characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks

Grefenstette and Tapanainen (1994): Sentences end with punctuation.

Page 11: Morphology - Limsi

LIMSI-CNRS | 11/72

What are the "words" and sentences here?

Pacific Lumber Co. was trying to figure out the safest way to bring the activists down.

He doesn't need us.

For additional information see also http://www.limsi.fr

New York is situated on the east coast of the United States.

c’est-à-dire les pommes de terre des U.S.A.

Page 12: Morphology - Limsi

LIMSI-CNRS | 12/72

Tokenisation

Tokenisation: process which divides the input text into word tokens: punctuation marks, word-like units, numbers, etc.

A system which splits texts into word tokens is called a tokeniser

A very simple example: Input text:

John likes Mary and Mary likes John. Tokens:

{"John", "likes", "Mary", "and", "Mary", "likes", "John", "."}

Page 13: Morphology - Limsi

LIMSI-CNRS | 13/72

Problems of tokenisation

Numeric expressions:The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.

How many words are there in 4.53 +/- 0.15%? 1 3 9 (“four point five three, plus or minus fifteen percent”) not a word

The answer depends on the application at hand

Page 14: Morphology - Limsi

LIMSI-CNRS | 14/72

Problems of tokenisation

Boundaries

For "simple" words:SpacesPunctuation

Multiword expressions: several units, one wordpomme de terreparce que

Contracted forms: one unit, several wordsaux (à les), des (de les)

Page 15: Morphology - Limsi

LIMSI-CNRS | 15/72

Outline

1. Word segmentation

2. Linguistic morphologya) What is morphology?b) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

Page 16: Morphology - Limsi

LIMSI-CNRS | 16/72

Morphology

Words can be further decomposed into smaller units:

pneumonoultramicroscopicsilicovolcanoconiosis

lungextreme

microscopicsilicium

volcanodust

disease

lung disease caused by the inhalation of very fine silica dust found in volcanoes

Page 17: Morphology - Limsi

LIMSI-CNRS | 17/72

What is morphology?

Morphology is the branch of linguistics which studies word forms and word formation

Word formation processes Inflection Derivation Composition / Compounding

Page 18: Morphology - Limsi

LIMSI-CNRS | 18/72

Words vs. lexemes vs. lemmas

A lexeme refers to the set of word forms which correspond to the same dictionary entry

small, smaller, smallest → SMALLknife, knives → KNIFE

A lemma is the canonical form of a lexemeSMALL

In the following, capital letters are used to indicate lemmas

Page 19: Morphology - Limsi

LIMSI-CNRS | 19/72

Inflection

Inflection is the process of forming different grammatical forms of a single lexememontrer → montreracheval → chevaux

The grammatical category of the word form remains the same

Page 20: Morphology - Limsi

LIMSI-CNRS | 20/72

Word formation

Word formation is the process of creating new lexemes from existing ones: Derivation: combines bases and affixes Compounding: combines lexemes

Page 21: Morphology - Limsi

LIMSI-CNRS | 21/72

Derivation

Derivation involves the creation of one lexeme from another

re- + create → RECREATEre- is a derivational prefix

recreate + s → recreates-s is an inflectional suffix, it provides another word-form of the lexeme RECREATE!

Derivation might induce a change of the grammatical category

be- + witch → BEWITCH: changes a noun into a verb

Page 22: Morphology - Limsi

LIMSI-CNRS | 22/72

Compounding

A compound involves the creation of one lexeme from two or more other lexemes

popcorn = a kind of corn which popshot dog = a kind of food (opaque compound)

Compounding is particularly frequent in French medical language appendice + ectomie → appendicectomie

Page 23: Morphology - Limsi

LIMSI-CNRS | 23/72

Non concatenative phenomena

Root-and-pattern morphology (e.g. Arabic, Hebrew) the root consists of consonants only (3 by default)

ktb = to write the pattern is a combination of vowels (possibly consonants

too) with slots for the root consonantskaatab = he corresponded

Apophony: vowel changes within a root Ablaut: sing, sang, sung Umlaut: Buch, Bücher

Page 24: Morphology - Limsi

LIMSI-CNRS | 24/72

Outline

1. Word segmentation

2. Linguistic morphologya) Morphemesb) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

Page 25: Morphology - Limsi

LIMSI-CNRS | 25/72

Morphological normalisation

Morphological normalisation consists in identifying a single canonical representative for morphologically related word-forms

Methods: Stemming Lemmatisation

Page 26: Morphology - Limsi

LIMSI-CNRS | 26/72

Stemming

Stemming is an algorithmic approach to strip off the endings of words

Objective: group words belonging to the same morphological family by transforming them into a similar stemmed representation

Stemming does not distinguish between inflection and derivation

The stems obtained do not necessarily correspond to a genuine word form

The best known stemming algorithms have been developed by Lovins (1968) and Porter (1980)

Page 27: Morphology - Limsi

LIMSI-CNRS | 27/72

Algorithmic stemming method

1) Desuffixing: removal of predefined word endingssitting → sitt

2) Recoding: transform the endings of the previously obtained stems using transformation rulessitt → sit

These 2 phases can be performed successively (Lovins) or simultaneously (Porter)

Page 28: Morphology - Limsi

LIMSI-CNRS | 28/72

Porter's stemmer

Based on a limited set of general cascaded transformational rules:

-ational → -ate : relational → relate

Variants exist for many languages: English, French, Spanish, Portuguese, Italian, Romanian, German Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish

Fast Accurate enough for some applications, e.g.

Information Retrieval Available at http://snowball.tartarus.org/

Page 29: Morphology - Limsi

LIMSI-CNRS | 29/72

Steps in Porter stemming (excerpts)

Step 1a SSES → SS

caresses → caress Step 1b (m>0) EED → EE

feed → feed, agreed → agree Step 1c (*v*) Y → I

happy → happi, sky → sky Step 2 (m>0) ATIONAL → ATE

relational → relate

Step 3 (m>0) ICATE → IC

triplicate → triplic Step 4 (m>1) AL →

revival → reviv Step 5a (m>1) E →

probate → probat Step 5b (m > 1 and *d and *L) → single

letter controll → control

Page 30: Morphology - Limsi

LIMSI-CNRS | 30/72

Porter's stemmer

Original Word Stemmed Wordvision visionvisible visiblvisibility visiblvisionary visionarivisioner visionvisual visual

Page 31: Morphology - Limsi

LIMSI-CNRS | 31/72

Comparison of three stemmers

© 2008 Cambridge University Press, Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze

Page 32: Morphology - Limsi

LIMSI-CNRS | 32/72

Stemming errors

Under-stemming:adhere → adheradhesion → adhes

Over-stemming: appendicitis → append append → append

Page 33: Morphology - Limsi

LIMSI-CNRS | 33/72

Ambiguity

I saw the saw

Preterite form of the verb

SEE

Singular formof the noun

SAW≠

Homographs: words which have the same spelling but different meanings

Such cases cannot be properly dealt with with stemming only, the word's grammatical category has to be identified

Page 34: Morphology - Limsi

LIMSI-CNRS | 34/72

Lemmatisation

Lemmatisation consists in mapping word forms to their lemma (base form):

sing, sang, sung → sing

Lemmatisation only handles inflection, not derivation

In order to disambiguate ambiguous cases, lemmatisation is usually combined with part-of-speech tagging

Additional morphological information is usually provided with the lemma (more about this later in the presentation)

Page 35: Morphology - Limsi

LIMSI-CNRS | 35/72

Outline

1. Word segmentation

2. Linguistic morphologya) Morphemesb) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

Page 36: Morphology - Limsi

LIMSI-CNRS | 36/72

Morphological analysis

Aim: split a word into its constituent morphemes : foxes → fox + es get morpho-syntactic information : part-of-speech (POS),

tense, number, person, voice, gender, etc.

Morphological analysis can be perfomed: manually, the analyses are then stored in lexical databases automatically:based on some manually-written rules and lexicons in an unsupervised manner, using no external resources

Page 37: Morphology - Limsi

LIMSI-CNRS | 37/72

Lexical databases: contents

word entries + information surface form, lemma syntactic properties category, POS (Part Of Speech) features: masculine, feminine, etc.

semantic properties semantic relations: synonym, antonym, hypernym semantic type: person, event, object

Page 38: Morphology - Limsi

LIMSI-CNRS | 38/72

CELEX

CELEX is a lexical database which is available for English, Dutch and German

Page 39: Morphology - Limsi

LIMSI-CNRS | 39/72

Morphalou

http://www.cnrtl.fr/lexiques/morphalou/

Page 40: Morphology - Limsi

LIMSI-CNRS | 40/72

Prolex

http://www.cnrtl.fr/lexiques/prolex/

Page 41: Morphology - Limsi

LIMSI-CNRS | 41/72

French Verbs (Dubois & Dubois-Charlier)

Page 42: Morphology - Limsi

LIMSI-CNRS | 42/72

Unsupervised Segmentation

Unsupervised morphological segmentation consists in automatically breaking down words into their constituent morphemes

Only input dataset: list of words (no language-specific rules or lexicons)

Scientific goals: Learn of the phenomena underlying word construction in natural

languages Discover approaches suitable for a wide range of languages Advance machine learning methodology

See the Morpho Challenge websitehttp://www.cis.hut.fi/morphochallenge2009/

Page 43: Morphology - Limsi

LIMSI-CNRS | 43/72

Segmentation by analogy

(Lepage, 1998)

Page 44: Morphology - Limsi

LIMSI-CNRS | 44/72

Application of the analogy principle

fahre

fahren

schlafe

X?

Page 45: Morphology - Limsi

LIMSI-CNRS | 45/72

Segmentation by compression

Minimum Description Length and bayesian inference(Goldsmith, 2001; Creutz & Lagus, 2005)

Page 46: Morphology - Limsi

LIMSI-CNRS | 46/72

Harris (1955):Segmentation by successor counts

At the end of a morpheme (or word) almost any sound can follow:

design + #, design + ation, design + ing, design + ed, ...

However, within morphemes, the choice is more restricted:

desig + n

Basic algorithm: At each position in an utterance, count the number of

different sounds which can possibly follow Peaks in this count indicate morpheme boundaries

Page 47: Morphology - Limsi

LIMSI-CNRS | 47/72

Segmentation of He's quicker

Utterance = He's quicker (hiyzkwikәr) Successors of h: His ship's in? Humans act like simians. ...

Successors of hi: Hip-high in water. Hidden meanings were discovered. ...

Page 48: Morphology - Limsi

LIMSI-CNRS | 48/72

Successor counts

h - i - y - z - k - w - i - k - ә - r -0

5

10

15

20

25

30

35

Sounds

Su

cce

ssor

cou

nts

Page 49: Morphology - Limsi

LIMSI-CNRS | 49/72

Morphological parsing

Aim: break down a word into component morphemes and build a structured representation of the analysis

Example:cats → cat +N +PL

Our focus: finite-state morphological parsing

lemma features

Page 50: Morphology - Limsi

LIMSI-CNRS | 50/72

Finite state automata

A finite state automaton (FSA) recognises a set of strings

An FSA is represented as directed graph: vertices (nodes) represent states directed links between nodes represent transitions

Page 51: Morphology - Limsi

LIMSI-CNRS | 51/72

Sheep Talk FSA

The language of sheep includes to following utterances: baa!, baaa!, baaaa!, baaaaa!, etc.

Regular expression for this language: baa+! FSA that can accept this language:

q0

q2

q3

q4b a a !

a

q1

Page 52: Morphology - Limsi

LIMSI-CNRS | 52/72

Formal definition of an FSA

Q = q0 q

1 q

2 ... q

N-1a finite set of N states

Σ a finite input alphabet of symbols

q0

the start state

F the set of final states δ(q,i) the transition function

For the sheep talk automaton: Q = {q0, q

1, q

2, q

3, q

4},

Σ = {a, b, !}, F = {q4}

Page 53: Morphology - Limsi

LIMSI-CNRS | 53/72

Deterministic vs. non-deterministic FSA

Deterministic FSA for sheep talk

Non-deterministic FSA for sheep talk

q0

q2

q3

q4b a a !

a

q1

q0

q2 q3

q4b a a !

a

q1

Page 54: Morphology - Limsi

LIMSI-CNRS | 54/72

Morphological parsers

Components: lexicon: list of lemmas and affixes morphotactics: word grammar which accounts for

morpheme ordering orthographic rules: model the changes that occur when two

morphemes combinecity + s → cities

Morphological parsers can be implemented as finite-state transducers

Page 55: Morphology - Limsi

LIMSI-CNRS | 55/72

Finite State Transducers

State 0: start state State 1: cat has been recognised as a +N (possible end state) State 2: catch has been recognised as a +V (possible end state) State 3: cats has been recognised as +N +PL (possible end

state)

catch:V

cat:N

s:PL

0

1

2

3

Finite-state transducers map between one representation and another

Page 56: Morphology - Limsi

LIMSI-CNRS | 56/72

Two-level morphology

(Koskenniemi, 1984)

Surface level: words as they are pronounced or written

Lexical level: concatenation of morphemesLexical level: c a t +N +PLSurface level: c a t s

The mapping between the surface and the lexical level is constrained by rules

Page 57: Morphology - Limsi

LIMSI-CNRS | 57/72

Two-level rules

Example rule (Trost, 2003):

+:e ⇐ { s x z [ {s c} h ] } : _s

left context right contextsurface

level

lexicallevel

Application of the rule:# d i s h + s #| | | | | 1 | |0 d i s h e s 0

Spelling rule:

e-insertion

Page 58: Morphology - Limsi

LIMSI-CNRS | 58/72

PC-KIMMO

Demo: http://languagelink.let.uu.nl/~lion

Page 59: Morphology - Limsi

LIMSI-CNRS | 59/72

PC-Kimmo: POS Ambiguity

1:

Word:

[ cat: Word

head: [ pos: V

vform: BASE ]

root: `walk

root_pos:V

clitic:-

drvstem:- ]

2:

Word:

[ cat: Word

head: [ agr: [ 3sg: + ]

number:SG

pos: N

proper:-

verbal:- ]

root: `walk

root_pos:N

clitic:-

drvstem:- ]

Page 60: Morphology - Limsi

LIMSI-CNRS | 60/72

Inflectional Analysis for French: Flemm

Developed by F. Namer http://www.cnrtl.fr/outils/flemm/

Input : word + POS (as provided by the TreeTagger or the Brill tagger) renouent VER:pres renouer

Output: lemma + morpho-syntactic features renouent VER(pres):Vmip3p--1 renouer

|| renouent VER(pres):Vmsp3p--1 renouer Verbe au présent de l'indicatif ou du subjonctif à la troisième personne du pluriel, 1er

groupe

Page 61: Morphology - Limsi

LIMSI-CNRS | 61/72

Inflectional Analysis for French: Flemm

Analyse linguistique : le cas de -èrent en général, -èrent marque les verbes du 3ème groupe au

passé simple : céd-èrent quelquefois, la terminaison est plus courte et -èrent

marque le présent : légifèr-ent très rarement, terminaison ambiguë : lac-èrent et lacèr-

ent Règles et exceptions : le cas de -èrent les partitions ambiguës sont lexicalisées car rares la règle étant le désuffixage sur le suffixe le plus long, les

verbes correspondant au suffixe -ent tels que légifèr- sont lexicalisés

autres cas (e.g. céd-) : désuffixage régulier sur -èrent.

Page 62: Morphology - Limsi

LIMSI-CNRS | 62/72

Derivational Analysis: DériF

Developed by F. Namerhttp://www.cnrtl.fr/outils/DeriF/

Input: form/POS sympathique/ADJ

Output: analysis [ [ sympathie NOM] ique ADJ] (sympathique/ADJ,

sympathie/NOM) " En rapport avec le(s) sympathie"

Page 63: Morphology - Limsi

LIMSI-CNRS | 63/72

Derivational Analysis: DériF

Word formation rules déXiser V [dé [X N] +iser V] Xable A [[X (er) V] +able A] inX A [in [X A] A]

Sequence of decompositions impensable/ADJ in + pensable/ADJ décomposable/ADJ décomposer/VERBE + able/ADJ

Ambiguous analyses implantable/ADJ implanter/VERBE + able/ADJ

im + plantable/ADJ

Produces a gloss : " ( lequel - Que l') on peut implanter" // " Non plantable"

Page 64: Morphology - Limsi

LIMSI-CNRS | 64/72

Analysis of neoclassical compounds: DériF

acrodynie/N Hierarchical decomposition: [ [ acr N* ] [ odyn N* ] ie NOM ]

Definition (gloss): "douleur (du -- liée au) extrémité "

Semantic type:Type = maladie

Lexical and semantic relations with other lexemes:eql:acr/algie, eql:acr/algo, eql:acr/algés, eql:apex/algie,

eql:apex/algo, eql:apex/algés, eql:apex/odyn see:acr/ite, see:apex/ite

Page 65: Morphology - Limsi

LIMSI-CNRS | 65/72

Outline

1. Word segmentation

2. Linguistic morphologya) Morphemesb) Morphological processes

3. Computational morphologya) Normalisation: stemming, lemmatisationb) Analysis: lexical databases, unsupervised

segmentation, rule-based analysis and parsing

4. Applications

Page 66: Morphology - Limsi

LIMSI-CNRS | 66/72

Information RetrievalStemming

Stemming is frequently used in Information Retrieval: Stemming is applied at indexing time User queries are analysed likewise Stems in the user query are matched against stems in

documents

It reduces the number of terms to index

It improves recall (number of documents which are retrieved)

Page 67: Morphology - Limsi

LIMSI-CNRS | 67/72

Information RetrievalMorphological Query Expansion

Morphological variants of a word can be used to perform query expansion

The original word forms are indexed

Query terms are expanded with their morphological variants at retrieval time (Moreau et al., 2007)

Original query: Ineffectiveness of U.S. embargoes or sanctions

Expanded query: ineffectiveness ineffective effectiveness effective ineffectively embargoes embargo embargoed embargoing sanctioning sanction sanctioned sanctions sanctionable

Page 68: Morphology - Limsi

LIMSI-CNRS | 68/72

Text-To-Speech Systems

Aim: take text, in standard spelling, and synthesise a spoken version of the text

Problems Proper names (places, persons) Out of vocabulary words (words unknown to the system)

Solutions from morphology hothouse = hot + house and not hoth + ouse

Page 69: Morphology - Limsi

LIMSI-CNRS | 69/72

Machine Translation

Aim: translate a text from one language into another language

Problems: A word in one language may correspond to two or more

words in another language Out of vocabulary words

How can morphological analysis help? compounds: Aktionsplan (de) → action plan (en) inflection: va, aller (fr) → go (en)

Page 70: Morphology - Limsi

LIMSI-CNRS | 70/72

Meditate on this...

"Maybe in order to understand mankind, we have to look at the word itself. Mankind. Basically, it's made up of two separate words – 'mank' and 'ind'. What do these words mean? It's a mystery, and that's why so is mankind."

Jack Handey (Deep Thoughts)

Page 71: Morphology - Limsi

LIMSI-CNRS | 71/72

Relevant Literature

Creutz, M. & Lagus, K. (2005), Inducing the Morphological Lexicon of a Natural Language from Unannotated Text, in 'Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05)', pp. 106-113.

Goldsmith, J. (2001), 'Unsupervised Learning of the Morphology of a Natural Language', Computational Linguistics 27(2), 153-198.

Harris, Z. (1955), 'From phoneme to morpheme', Language 31(2), 190-222.

Koskenniemi, K. (1984), A general computational model for word-form recognition and production, in 'Proceedings of the 22nd annual meeting on Association for Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 178--181.

Lepage, Y. (1998), Solving analogies on words: an algorithm, in 'Proceedings of the 17th international conference on Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 728-734.

Page 72: Morphology - Limsi

LIMSI-CNRS | 72/72

Relevant Literature

Moreau, F.; Claveau, V. & Sébillot, P. (2007), 'Automatic morphological query expansion using analogy-based machine learning.', in Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), Roma, Italy, April 2007.

Trost, H. (2003), The Oxford Handbook of Computational Linguistics, Oxford University Press, chapter Morphology, pp. 25--47.

Saffran, J. R.; Newport, E. L. & Aslin, R. N. (1996), 'Word Segmentation: The Role of Distributional Cues', Journal of Memory and Language 35(4), 606-621.