Top Banner
Zdroje jazykových dat Word senses Sense tagged corpora
24

Zdroje jazykových dat Word senses Sense tagged corpora.

Dec 18, 2015

Download

Documents

Gabriel Terry
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Zdroje jazykových dat Word senses Sense tagged corpora.

Zdroje jazykových dat

Word sensesSense tagged corpora

Page 2: Zdroje jazykových dat Word senses Sense tagged corpora.

• Lev V. Ščerba: And indeed, every sufficiently complex word must actually become the subject of a scientific monograph; therefore it is hard to expect in the near future the completion of a good dictionary.

Page 3: Zdroje jazykových dat Word senses Sense tagged corpora.

Word sense disambiguation

• The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”

Page 4: Zdroje jazykových dat Word senses Sense tagged corpora.

Lexical Acquisition Bottleneck

• In NLP many systems do not perform in practice as well as they could with adequate dictionary resources, due to the cost of production, adaptation, and maintenance of these resouces

• Solutions– Reusing existing dictionaries and ontologies as

lexicons– Deriving disambiguation information directly from

corpora

Page 5: Zdroje jazykových dat Word senses Sense tagged corpora.

Usefulness of WSD

• NLP tools:– Systems – carries out some task of “interest for its

own sake” (e.g. MT,IR); applications potentially interesting for non-linguists

– Components – interesting for linguists and language engineers; e.g. WSD

Page 6: Zdroje jazykových dat Word senses Sense tagged corpora.

Early approaches

• Preference semantics – 1970’s– Selectional constraints (e.g. ANIMATE for subject of “to

drink”)

• Word experts – 1980’s– Hand crafted disambiguators constructed for each word

separately– Limited applicability

• Polaroid words– Gradual disambiguation (grammar, parser, lexicon, semantic

interpreter, knowledge representation language)

Page 7: Zdroje jazykových dat Word senses Sense tagged corpora.

Dictionary Based Approaches

• Since 1980’s – dictionary publishers started to produced “Machine Readable Dictionaries” (now - m. tractable d.)

• Wider polysemy than in the systems described so far

Page 8: Zdroje jazykových dat Word senses Sense tagged corpora.

Two claimsabout sense distribution

• One sense per discourse– There is a very strong tendency for multiple uses of a

word to share the same sense in a well-written discourse

• One sense per collocation– With a high probability an ambiguous word has only

one sense in a given collocation

Page 9: Zdroje jazykových dat Word senses Sense tagged corpora.

Taxonomy of WSD Algorithms

• Knowledge based• Corpus based

– Tagged corpora– Untagged corpora (bananaelephant)

• Hybrid approaches

Page 10: Zdroje jazykových dat Word senses Sense tagged corpora.

Word Senses and Lexicons

Sense tagging = attaching senses from some lexicon to words in text

Sense-enumerative dictionary

Page 11: Zdroje jazykových dat Word senses Sense tagged corpora.

Deficiencies of dictionaries

• Omissions and oversights• Coverage of names• Ghost words – Dord=density (D or d)• Differentiating senses (P.Hanks: A serious problem for

computer applications if that dictionaries compiled for human users focus on giving lists of meanings for each entry, without saying much about how one meaning may be distinguished from another in text)

Page 12: Zdroje jazykových dat Word senses Sense tagged corpora.

Two levels of sense distinction

• Homography– Two senses of a word are homographic when there

is no obvious semantic relation between them (e.g. a ball – a dance or a rounded object)

– Risk of amateur etymology

• Polysemy

Page 13: Zdroje jazykových dat Word senses Sense tagged corpora.

Distinguishing senses

• P.Hanks: No generally agreed criteria exist for what counts as a sense, or for how to distinguish one sense from another

• Zeugma: Arthur and his driving license expired last Thursday.

• Polysemy vs. vagueness (e.g. mountain)

Page 14: Zdroje jazykových dat Word senses Sense tagged corpora.

The Bank Model

• Assumption A – Words have a finite set of clearly distinct, well-defined sense

• Assumption B – Native speakers of … know instantly and effortlessly which meaning applies in a given situation

• Criticism of the bank model: Kilgarriff (“I don’t believe word senses”), Pustejovsky (Generative lexicon), and many others…

Page 15: Zdroje jazykových dat Word senses Sense tagged corpora.

NLP Lexicons

• Longman Dictionary of Contemporary English (LDOCE) – three-level embedded structure for sense distinctions (homographs,senses,optional subsenses)

• Roget’s Thesaurus• Cambridge International Dictionary of English• COBUILD English Language Dictionary• WordNet

Page 16: Zdroje jazykových dat Word senses Sense tagged corpora.

Thesaurus

Page 17: Zdroje jazykových dat Word senses Sense tagged corpora.

Ontology

Page 18: Zdroje jazykových dat Word senses Sense tagged corpora.

Ontology

• There is little agreement on what an ontology is… In general, an ontology can be described as an inventory of the objects, processes, etc. in a domain, as well as a specification of (some of ) the relation that hold among them.

• Aristotle: genus (category to which something belongs)and differentiae (property that uniquely distinguish the category member from their parent and from one another)

• Nodes (concepts) in the hierarchy related by subsumption

Page 19: Zdroje jazykových dat Word senses Sense tagged corpora.

Ontologies in different traditions

• Philosophical• Cognitive • Artificial intelligence• Lexical semantics• Lexicography• Information science

Page 20: Zdroje jazykových dat Word senses Sense tagged corpora.

Princeton WordNet

• Lexical semantic network structured around the notion of synsets

• Synset - skupina literálů téhož slovního druhu, které jsou v určitém kontextu vzájemně zaměnitelné („set of synonyms“)

• http://www.cogsci.princeton.edu/~wn/w3wn.html• Inspired by psycholinguistic theories of human lexical

memory• broad coverage, rich lexical information, freely available• too fine-grained for practical NLP tasks• Relations between two synsets: homonymy,

hyperonymy, meronymy …

Page 21: Zdroje jazykových dat Word senses Sense tagged corpora.

EuroWordNet (i)

• Multilingual database containing several monoloingual wordnets structured along the same lines as the Princeton WordNet1.5

• English,Dutch,German,Spanish,French,Italian, Czech,Estonian

• Inter-Lingual-Index• http://www.hum.uva.nl/~ewn

Page 22: Zdroje jazykových dat Word senses Sense tagged corpora.

EuroWordNet (ii)

Princeton WordNet 1.5 EuroWordNet

note, observe, make a remark, remark

prohodit, pozname

nat,připomen

out anmerken,

bemerken

. . . . . .. . . . . .

. . . . . .

Page 23: Zdroje jazykových dat Word senses Sense tagged corpora.

Sense tagged corpora

• “interest” corpus – 2kS containing the word “interest”

• SENSEVAL– http://www.senseval.org– WSD evaluation exercise, first run in 1998

• SEMCOR– http://multisemcor.itc.it/semcor.phpSubset of the English Brown corpus,700kW– More than 200kW sense-tagged according to Princeton

WordNet 1.6

Page 24: Zdroje jazykových dat Word senses Sense tagged corpora.

Final remarks

• Similarity of POS- and sense tagging• Mapping lexical resources