1 12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych Programming language is not an island: Word Sense Alignment of Lexical-Semantic Resources Iryna Gurevych Joint work with: Judith Eckle-Kohler, Kostadin Cholakov, Silvana Hartmann, Michael Matuschek, Christian M. Meyer http://www.ukp.tu-darmstadt.de/data/uby UBY
89
Embed
Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических
Лексико-семантические ресурсы играют ключевую роль в автоматической обработке текста. В последние годы ресурсы, создаваемые сообществом, такие как Википедия и Wiktionary, становятся привлекательной альтернативой для классических ресурсов, создаваемых экспертами, таких как WordNet, особенно для языков для которых мало ресурсов. Недавние крупномасштабные проекты, например YAGO, BabelNet, UBY, нацелены на комбинирование множества лексикосемантических ресурсов в рамках одной системы. В своем докладе я представлю выравнивание смыслов слов как задачу, критически важную для комбинирования лексико-семантических ресурсов и взаимодополняющего использования их сильных сторон. В задаче выравнивания смыслов слов, смысл термина (например, Java как язык программирования) должен быть связан с синонимичными значениями во множестве ресурсов и отделен от других значений того же слова (например, Java, как остров). В докладе будут рассмотрены два подхода к решению описанной задачи: основанный на близости текстов и основанный на графах, также их оценка на парах лексико-семантических ресурсов с различными свойствами. В конце будут приведены примеры использования выровненных лексикосемантических ресурсов в автоматической обработке текста.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Programming language is not an island: Word Sense Alignment of Lexical-Semantic Resources
Iryna Gurevych
Joint work with: Judith Eckle-Kohler, Kostadin Cholakov, Silvana
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Text Analysis Needs Lexical-Semantic Knowledge
Lexical resourceNLP application
Which lexical resource
to choose?
412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Resources are Largely Different
Different coverage of words/word senses
Different types of information
Encyclopedic vs. linguistic knowledge
Syntactic vs. semantic knowledge
…
Resource integration can significantly influence the performance of your system! – Instead of choosing only one (best performing):
Why not combine multiple resources
and benefit from all their knowledge?
512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Overlap of Lexical Entries
25,541
56,240
28,650
Roget’s Thesaurus
(62,797)
Wiktionary
(364,663)
WordNet
(149,502)
163,027 67,868
Common vocabulary is
rather small (28,650).
Each resource contains a lot
of “unique” words.
612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Overlap of Lexical Entries
slang
dialect
natural
sciences
computer
science
neologisms
humanities
social
sciences
biological
taxonomy
named
entities
surprisingly
small
overlap
math
7
1. To sing: To produce musical or
harmonious sounds with one’s
voice.
2. To sing: To express audibly by means of
a harmonious vocalization.
3. To sing: To confess under
interrogation.
1. singen: Mit
der Stimme
harmonische
Töne erzeugen.
1. To sing: Produce
tones with the voice
2. To sing: divulge
confidential information
or secrets
1. To sing: To produce
harmonious sounds
with one's voice.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Our Main Goal: Sense Alignments for Many Resources
812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Prior Work on Linked Lexical Resources (LLR)
Meaning Multilingual Central Repository, Atserias et al. (2004)
Yago, Suchanek et al. (2007)
SemLink (Palmer, 2009)
Universal Wordnet (UWN), Gerard de Melo and Gerhard Weikum(2009)
eXtended WordFrameNet, Laparra and Rigau (2010)
BabelNet, Navigli and Ponzetto (2010)
NULEX, McFate and Forbus (2011)
UBY, Gurevych et al. (2012)
… many more, e.g. on the Semantic Web
912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Potential of Linked Lexical Resources
Increased coverage and the enriched sense representation
Linking FrameNet, VerbNet, and WordNet for semantic parsing(Shi and Mihalcea, 2005)
Linking VerbNet, FrameNet and PropBank for semantic role labeling(Palmer, 2009)
Linking WordNet and Wikipedia for word sense disambiguation(Navigli and Ponzetto, 2010)
Linking WordNet and Wiktionary for measuring verb similarity(Meyer and Gurevych, 2012)
Linking OmegaWiki and Wiktionary for mining translations (McCrae and Cimiano, 2013)
1012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
The Challenge: Heterogeneity of Resources
Different coverage:
missing entities in one
of the resources
Different granularity:
entities are defined at
different levels
Different perspectives:
entities are defined for
a different purpose
vs.
vs.
vs.
(Euzenat/Shvaiko, 2007)
1312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Lemma Alignment
Wiktionary
WordNet
Content integration at the lemma
level is easy, but…
1412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Word Sense Alignment
Wiktionary
WordNet
…integration at the
sense level is hard!
Content integration at the lemma
level is easy, but…
1512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Word Sense Alignment
plant in Wiktionary
(botany) An organism of the kingdom Plantae […]
(proscribed as biologically inaccurate) Any creature that grows on soil or similar surfaces, including plants and fungi.
A factory or other industrial or institutional building or facility.
(snooker) A play in which the cue ball knocks one (usually red) ball onto another […]
plant in WordNet
buildings for carrying on
industrial labor
(botany) a living organism
lacking the power of
locomotion
an actor situated in the
audience whose acting is
rehearsed but seems
spontaneous to the
audience
?
?
1712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
The Alignment Process
Can be generalized for multiple resources „multi-alignment“:
Matching
parameters p
knowledge k
A‘A
r
r‘
resource 1
resource 2
initial
alignment(possibly empty)
output
alignment
A‘ = f(r, r‘, A, p, k)
A‘ = f(r1,…,rn, A, p, k)(Euzenat/Shvaiko, 2007)
20
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
2114.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Construction of aligned lexical resources
Sense Alignment
Niemann & Gurevych, IWCS 2011
█Meyer &
Gurevych, IJCNLP
2011
█
Matuschek& Gurevych, TACL, 2013
█ █ █Matuschek
& Gurevych, COLING,
2014
█ █ █
Hartmann & Gurevych, ACL 2013
█ █
Miller & Gurevych,
LREC 2014
█ █ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones
What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage. Christian M. Meyer and Iryna Gurevych. In: Proceedings of IJCNLP, pp. 883-892, November 2011.
2212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Automatic Word Sense Alignment
Case Study: Aligning Wiktionary and WordNet
Enriched sense
representations
Increased coverage
23
Wikipedia
article …Wikipedia
article …
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Aligning Wiktionary and WordNet
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
plant (factory)
plant (organism)
plant (person)
works (machine)
bird(animal)
works (factory) …
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}{plant, works,
industrial plant}{plant, works,
industrial plant}
to fly(move)reddish
(color)
24
Wikipedia
article …Wikipedia
article …
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Aligning Wiktionary and WordNet
plant (factory)
plant (organism)
plant (person)
works (machine)
bird(animal)
works (factory) …
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}{plant, works,
industrial plant}{plant, works,
industrial plant}
to fly(move)reddish
(color)
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
25
Wikipedia
article …Wikipedia
article …
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Aligning Wiktionary and WordNet
plant (factory)
plant (organism)
plant (person)
works (machine)
bird(animal)
works (factory) …
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}{plant, works,
industrial plant}{plant, works,
industrial plant}
to fly(move)reddish
(color)
X
X
X
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
2612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Bag of Words Representation
hypernyms
synset
hyponyms
hyper- &
hyponyms
bag-of-
words
bag-of-
words
sense
definition
lemma
usage
examples
synonyms
Synsets are represented
by synonyms, gloss,
examples
2712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Candidate Disambiguation
bag-of-
words
bag-of-
words
semantic
relatedness
measure
s < threshold s ≥ threshold
No alignment!
Align this pair of
WordNet synset and
Wiktionary sense!
score sCOS: Cosine similarity
PPR: Personalized PageRank
3912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation Dataset
Dataset creation:
No previous alignments = no other evaluation datasets
We created a new dataset with 2,423 sense pairs
10 human raters (students/researchers from CS, math, linguistics)
Annotate each pair as “same meaning” or “different meaning”
Dataset reliability:
Inter-rater agreement: AO = .93, κ = .70
Removing two biased raters: AO = .94, κ = .74
Gold standard:
Majority vote of the 8 raters, additional tie breaker
4012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation Results
Method A P R F1
RAND .662 .212 .594 .313
MFS .802 .329 .508 .399
COS only .901 .598 .703 .646
PPR only .915 .684 .636 .659
COS&PPR .914 .674 .649 .661
RAND: Random baseline
MFS: Baseline aligning always the first sense (≈ most frequent sense)
Our approach significantly outperforms the baseline (at 1% level)
COS highest recall; PPR highest precision; COS&PPR highest F1
Significant difference of PPR, COS&PPR over COS (at 1% level)
No significant difference between PPR and COS&PPR
4212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Error Analysis
110 false negatives:“same meaning, but was not aligned”
Very different wording
“good discernment” vs.“ability to notice what others might miss”
Similar senses but slightly below threshold
“plants of the genus Centaurea” vs. “common weeds of the genus Centaurea”
Pointing to another entry rather than a content-based gloss
pacification: “the process of pacifying”
4312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Error Analysis
98 false positives:“different meaning, but have been aligned”
Similar wording, but refer to different concepts
“a computer that provides client stations with access to files and printers as shared resources to a computer network” vs. “any computer attached to a network”
High relatedness, but generic- versus domain-specific vocabulary
“any computer attached to a network” vs. “any organization that provides resources and facilities for a function or event”
4412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Increased Coverage: Parts of Speech
Wiktionary
AND WordNet
Additionally in
Wiktionary
Additionally in
WordNet
Nouns 34,464 158,085 47,651
Verbs 8,252 29,119 5,515
Adj./Adv. 14,236 60,977 7,541
Other POS 0 16,778 0
Inflected Forms 0 106,328 0
Our alignment: 56,970 sense pairs
Final resource contains 488,988 word senses
Substantial increase in the coverage of senses
Wiktionary is not restricted to nouns/verbs/adjectives: proverbs,
idioms, collocations, particles, determiners, inflected forms, etc.
4512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Increased Coverage: Domains
WiktionaryAND WordNet
Additionally in Wiktionary
Additionally in WordNet
Biology 4,465 4,067 12,869
Chemistry 2,561 8,260 2,268
Engineering 1,108 940 1,080
Geology 2,287 2,898 2,479
Humanities 4,949 2,700 5,060
IT 439 3,032 557
Linguistics 1,249 1,011 1,576
Math 615 2,747 483
Medicine 3,613 3,728 3,058
Military 574 426 585
Physics 1,246 2,835 1,252
Religion 733 1,154 781
Social Sciences 3,745 2,907 4,458
Sport 905 2,821 807
4612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Enriched Sense Representation
Synonyms
Gloss
Example sentence
Subsumption hierarchy
Synset organization
…
Pronunciation
Etymology
Syntactic knowledge
Quotations
Related terms
Translations
…
4712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Selected Conclusions
Aligned Wiktionary – WordNet is characterized by:
(1) Increased coverage
Different parts of speech, not only nouns
e.g. humanities and social sciences from WordNet
e.g. technical domains and leisure from Wiktionary
(2) Enriched sense representation
Pronunciation, etymology, related terms, translations, etc.
Novel evaluation dataset annotated by 10 human raters
Better results based on the resource-structure based and hybrid techniques in later work (Matuschek & Gurevych, TACL ‘13)
48
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
4914.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Construction of aligned lexical resources
Sense Alignment
Niemann & Gurevych, IWCS 2011
█Meyer &
Gurevych, IJCNLP
2011
█
Matuschek& Gurevych, TACL, 2013
█ █ █Matuschek
& Gurevych, COLING,
2014
█ █ █
Hartmann & Gurevych, ACL 2013
█ █
Miller & Gurevych,
LREC 2014
█ █ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones
Michael Matuschek and Iryna Gurevych: Dijkstra-WSA: A Graph-Based
Approach to Word Sense Alignment, in: Transactions of the Association
for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013
5012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Similarity-Based Approaches Suffer From…
Different vocabulary employed by definitions
Example: English noun eye/discernment, e.g.,
she has an eye for fresh talent
he has an artist's eye
good discernment (either visually or as if visually)
ability to notice what others might miss
low semantic relatedness score…
5112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Linking Sense Definitions
Java
1. An island of Indonesia separated from Borneo by the
Java Sea, an arm of the western Pacific Ocean. Center of
an early Hindu Javanese civilization, Java was converted
to Islam before the arrival of the Europeans (mainly the
Dutch) in the late 16th century.
2. A trademark used for a programming language designed to
develop applications, especially ones for the Internet, that
can operate on different platforms.
3. Brewed coffee.
5212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Linking Sense Definitions
5312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Intuition of Graph Topology
Word Senses
of JavaJava1 Java2
Java3
5412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Linking Monosemous Lexemes
Monosemous
lexemeprogramminglanguage
Word Senses
of JavaJava1 Java2
Java3
programminglanguage1
5512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
How to Disambiguate
Given the lemma Ruby with the definition: Ruby is a programming language. Others are Java, C, and Perl.
Where in the graph is it located?
Which sense of Java should be picked?
5612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
How to Disambiguate
Given the lemma Ruby with the definition: Ruby is a programming language. Others are Java, C, and Perl.
Where in the graph is it located?
Which sense of Java should be picked?
Use monosemous lexemes as a “reference node”!
57
Word Senses
of Ruby
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Intuition of Graph Topology
Monosemous
lexemeprogramminglanguage
Word Senses
of JavaJava1 Java2
Java3
programminglanguage1
Ruby1
5812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Disambiguate by Reference Nodes
This disambiguation would be plausible as related senses are in the
same region. Goal: Find plausible disambiguation for all senses
Word Senses
of Ruby
Monosemous
lexemeprogramminglanguage
Word Senses
of JavaJava1 Java2
Java3
programminglanguage1
Ruby1
5912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Dijkstra-WSA
Graph-based word sense alignment approach
Key ideas:
Represent lexical resources as graphs
Rely on trivial alignments as “reference nodes” and “bridges”
Use Dijkstra’s shortest path algorithmto find alignments
Steps:
1. Graph construction
2. Computing Sense Alignments
(Matuschek/Gurevych, 2013)
6012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 1: Graph Construction
Represent each lexical resource as an undirected graph L = (V, E) with
the set of nodes V representing senses or synsets
the set of edges E V x V representing some kind of (semantic) similarity between a pair of nodes
There is an edge connecting sense S1 and sense S2 if
There exists a semantic relation between S1 and S2
A lexeme W2 occurs in the sense definition of S1, and W2 is monosemous
S1 and S2 share the same syntactic behavior
Others possibilities / all of the above(Matuschek/Gurevych, 2013)
6112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 1: Graph Construction
edges representing some kind of
(semantic) similarity between nodes
Graph of resource 1
(Matuschek/Gurevych, 2013)
6212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2: Computing Sense Alignments
a) Create trivial alignments between the resources:
Trivial = lexeme is unique/monosemous in both resources
Example: programming language
Precision: >0.95
b) Identify alignment candidates
For example: nodes representing the same lemma
c) For all nodes still unaligned, find shortest paths to the candidate nodes in the other graph
Trivial alignments serve as “bridges” between the graphs
Align the node pair with the shortest path
(Matuschek/Gurevych, 2013)
6312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2: Computing Sense Alignments
Graph of resource 1
Graph of resource 2
6412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2a: Create Trivial Alignments
Graph of resource 2
Graph of resource 1
6512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2b: Identify Alignment Candidates
Graph of resource 2
Graph of resource 1
?
?
?
6612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2c: Shortest Paths to the Candidates
Graph of resource 2
Graph of resource 1
3
5
∞
6712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2c: Align the Nodes
Graph of resource 2
Graph of resource 1
!
6912.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Parameter Choices
Restricting the alignment partners Stop when the first candidate is found (1:1 alignment)
Keep going and align everything you can reach (1:n alignment)
Graph construction Use semantic relations, monosemous linking, or both
Prune relations to high frequent monosemous lexemes (e.g., there is)
Limiting to rare lexemes avoids “explosion” of edges
Rare = only appearing in 1 / N of the definitions (e.g., N = 200)
Computing Sense Alignments Path length L: unbounded L yields unmanageable runtime!
Best F1 score between 5 and 8, depending on the resource pair
(Matuschek/Gurevych, 2013)
7012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Hybrid Approach
Main issue of Dijkstra-WSA Low recall due to missing edges / sparse graph
Hybrid approach
Try to align using the graph first
Parameterized for high precision
Align those with no match using a similarity-based approach
7112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation Datasets
WordNet – Wikipedia (1,815 sense pairs)
WordNet – Wiktionary (2,423 sense pairs)
GermaNet – Wiktionary (45,636 sense pairs)
FrameNet – Wiktionary (2,789 sense pairs)
WordNet – OmegaWiki (683 sense pairs)
Wiktionary – OmegaWiki (586 sense pairs)
Wiktionary –Wikipedia German (31,808 sense pairs)
Wiktionary –Wikipedia English (367 sense pairs)
7212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Dataset Display Different Properties
WordNet, OmegaWiki, Wikipedia: sense definitions and semantic relations
Wiktionary: no disambiguated semantic relations => sparse graphs
GermaNet: very few sense definitions
7312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
7412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
7512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
7612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Human performance
Hybrid
(Matuschek/Gurevych, 2013)
7712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Analysis
Dijkstra-WSA ≥ gloss similarity for densely linked LSRs
Generic alignment approach is valid
But: low recall for sparse LSRs (English Wiktionary, OmegaWiki)
Dijkstra-WSA + similarity-based backoff outperfoms previous workon all datasets
The two notions of similarity are complementary
Could they be combined in a smarter way?
78
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
Joint Modeling of Features
7914.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Construction of aligned lexical resources
Sense Alignment
Niemann & Gurevych, IWCS 2011
█Meyer &
Gurevych, IJCNLP
2011
█
Matuschek& Gurevych, TACL, 2013
█ █ █Matuschek
& Gurevych, COLING,
2014
█ █ █
Hartmann & Gurevych, ACL 2013
█ █
Miller & Gurevych,
LREC 2014
█ █ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones
Michael Matuschek and Iryna Gurevych: High Performance Word Sense Alignment by Joint Modeling of Sense Distance and Gloss Similarity, in: Proceedings of the 25th
International Conference on Computational Linguistics (COLING 2014). Dublin, Ireland.
8012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Joint Usage of Features
Similarity- and graph-based approaches both have weaknesses
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
Joint Modeling of Features
8912.0.2014 | Technische Universität Darmstadt | Iryna Gurevych
Linked Lexical Resources
LLRs
Gurevych et al., EACL
2012
█ █
Eckle-Kohler et al., LREC
2012
█ █
Eckle-Kohler & Gurevych, EACL 2012
█
Eckle-Kohler et al., LMF,
2013
█ █ █
Eckle-Kohler et al., SWJ,
2014
█
█ Large-scale unified LR based on LMF
█ Standardizing heterogeneous LRs
█ Standardized format for subcat frames
█ Language independence of lexicon
models
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth: UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF, in: Proceedings of the 13th Conference of the European chapter of the Association for Computational Linguistics (EACL), April 2012.
9012.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY: Linking Lexical Resource
Web 2.0
IMSLex-Subcat
UBYTwo main characteristics:- Word Sense Alignments
- Standardized Representation
9112.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Heterogeneity of Lexical Resources
Complementary information types
Different terminology
Incompatible Data formats
9212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Unified Lexical Resource UBY
Unified lexicon model
Extensible
Preserves variety of lexical information
9312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
9412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Structure Integration in UBY
(Eckle
-Ko
hle
re
ta
l.2
01
2)
95
1. To sing: To produce musical or
harmonious sounds with one’s
voice.
2. To sing: To express audibly by means of
a harmonious vocalization.
3. To sing: To confess under
interrogation.
1. singen: Mit
der Stimme
harmonische
Töne erzeugen.
1. To sing: Produce
tones with the voice
2. To sing: divulge
confidential information
or secrets
1. To sing: To produce
harmonious sounds
with one's voice.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Sense Alignments
Enable semantic interoperability between LSRs in UBY:
Senses linked by SenseAxis class (over 1,000,000 instances)
English alignments, e.g. WordNet-Wikipedia
German alignments, e.g. GermaNet-Wiktionary
Cross-lingual alignments, e.g. WordNet-OmegaWiki DE
9612.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Available Alignments
Wikipedia English—WordNet 83,192
Wiktionary English—WordNet 138,282
GermaNet—Wiktionary German 32,850
FrameNet—Wiktionary English 12,340
Wiktionary English—OmegaWiki English 34,509
WordNet—OmegaWiki German 27,529
Wiktionary German—Wikipedia German 21,872
Wiktionary English—Wikipedia English 66,050
WordNet—VerbNet 40,716
FrameNet—VerbNet 17,529
Wikipedia English—OmegaWiki English 3,960
Wikipedia German—OmegaWiki German 1,097
Wikipedia English—Wikipedia German 463,311
OmegaWiki English—OmegaWiki German 58,785
9712.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Resource Integration Workflow in UBY
JWNL FN API JWPL JWKTL
Human users Machines
9812.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 1. Structure Integration
UBY API UBY API UBY API UBY API
Human users Machines
UBY
99
UBY-API
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Step 2. Content Integration
Human users Machines
UBY
10212.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY Web UI – Textual View
Textual View: allows to list senses across all resources, to display sense detailsand to perform sense comparisons.
10312.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY Web UI – Visual View
Visual view: allows to explore the sense alignments.
10412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY Java API
The UBY API is open source at Google Code: http://code.google.com/p/uby/
Getting Started:
1. Download a UBY database dump
2. Import the dump into a MySQL database
3. Start using the UBY API
The UBY API is work in progress!
Many API methods need to be added – consider contributing!
105
http://uby.ukp.informatik.tu-darmstadt.de/uby/UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
UBY – Data and Tools
http://code.google.com/p/uby/
https://uby.ukp.informatik.tu-darmstadt.de/webui/
Web Interface
Open Source API (JAVA)
Database Dumps UBY
106
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Outline
Joint Approaches to Word Sense Alignment
Putting the Pieces Together: UBY
Applications of Linked Lexical Resources
10714.05.2014 | Technische Universität Darmstadt | Iryna Gurevych
Utilizing Linked Lexical Resources
Utilizing LLRs
Cholakov et al., EACL
2014
█
Matuschek et al.,
KONVENS 2014
█
Matuscheket al., TC3,
2013
█
Hartmann & Gurevych, ACL 2013
█
Hartmann et al., 2014 (in preparation)
█
█ Sense annotation/disambiguation
█ Machine/computer-assisted translation
█ Semantic role labelling
█ Cross-language transfer of lexical-
semantic resources
Michael Matuschek and Tristan Miller and Iryna Gurevych : A Language-independent Sense Clustering Approach for Enhanced WSD, in Proceedings of the 12th Konferenz zur Verarbeitung naturlicher Sprache (KONVENS 2014), to appear
Michael Matuschek and Christian M. Meyer and Iryna Gurevych: Multilingual Knowledge in Aligned Wiktionary and OmegaWiki for Translation Applications, in: Translation: Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87118, July 2013
12412.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Suggested Further Lines of Research
1. Linked lexical resources (LLRs)
Aligning the resources with a) corpus information, b) world knowledge
Advanced visual analytic techniques for navigating the resources
2. Construction of aligned lexical resources
Sense clustering and different sense granularities
Multiple resource alignment and sense alignment curation
3. Utilizing LLR for language processing
Unified deep learning framework utilizing linked resources
Distant supervision applied to semantic role labeling
Word sense disambiguation and lexical substitution for German
12512.09.2014 | Technische Universität Darmstadt | Iryna Gurevych