RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav Nakov, University of California, Berkeley Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and Management of Multilingual Lexicons
28
Embed
Svetlin Nakov - Cognate or False Friend? Ask the Web!
Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!, Proceedings of the International Workshop "Acquisition and Management of Multilingual Lexicons", part of the International Conference RANLP 2007, pp. 55-62, ISBN 978-954-452-004-5, Borovets, Bulgaria, 30 September 2007
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cognate or False Friend? Ask the
Web!
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
Preslav Nakov, University of California, Berkeley
Elena Paskaleva, Bulgarian Academy of Sciences
A Workshop on Acquisition andManagement of Multilingual Lexicons
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Introduction Cognates and false friends
Cognates are pair of words in different languages that sound similar and are translations of each other
False friends are pairs of words in two languages that sound similar but differ in their meanings
The problem
Design an algorithm that can distinguish between cognates and false friends
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cognates and False Friends Examples of cognates
ден in Bulgarian = день in Russian (day)
idea in English = идея in Bulgarian (idea)
Examples of false friends
майка in Bulgarian (mother) ≠ майка in Russian (vest)
prost in German (cheers) ≠ прост in Bulgarian (stupid)
gift in German (poison) ≠ gift in English (present)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
The Paper in One Slide Measuring semantic similarity
Analyze the words local contexts
Use the Web as a corpus
Similarities contexts similar words
Context translation cross-lingual similarity
Evaluation 200 pairs of words
100 cognates and 100 false friends
11pt average precision: 95.84%
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity What is local context?
Few words before and after the target word
The words in the local context of given word are semantically related to it
Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.
Stop words appear in all contexts
Need of sufficiently big corpus
Same day delivery of fresh flowers, roses, and unique gift baskets
from our online boutique. Flower delivery online by local florists for
birthday flowers.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Web as a corpus
The Web can be used as a corpus to extract the local context for given word
The Web is the largest possible corpus
Contains big corpora in any language
Searching some word in Google can return up to 1 000 excerpts of texts
The target word is given along with its local context: few words before and after it
Target language can be specified
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Web as a corpus
Example: Google query for "flower"
Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...
Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...
Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.
Flowers, plants, roses, & gifts. Flowers delivery with fewer ...
Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Measuring semantic similarity
For given two words their local contexts are extracted from the Web
A set of words and their frequencies
Semantic similarity is measured as similarity between these local contexts
Local contexts are represented as frequency vectors for given set of words
Cosine between the frequency vectors in the Euclidean space is calculated
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Example of context words frequencies
word countfresh 217
order 204
rose 183
delivery 165
gift 124
welcome 98
red 87
... ...
word: flower
word countInternet 291
PC 286
technology 252
order 185
new 174
Web 159
site 146
... ...
word: computer
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Example of frequency vectors
Similarity = cosine(v1, v2)
# word freq.0 alias 3
1 alligator 2
2 amateur 0
3 apple 5
... ... ...
4999 zap 0
5000 zoo 6
v1: flower
# word freq.0 alias 7
1 alligator 0
2 amateur 8
3 apple 133
... ... ...
4999 zap 3
5000 zoo 0
v2: computer
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cross-Lingual Similarity We are given two words in different
languages L1 and L2
We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}
Measuring cross-lingual similarity:
1. We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2
2. We translate the context
3. We measure distance between C1* and C2
C1*C1G
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Reverse Context Lookup Local context extracted from the Web can
contain arbitrary parasite words like "online", "home", "search", "click", etc.
Internet terms appear in any Web page
Such words are not likely to be associated with the target word
Will the word "flowers" appear in the local context of "send", "online" and "here"?
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Reverse Context Lookup If two words are semantically related both
should appear in the local contexts of each other
Let #{x,y} = number of occurrences of x in the local context of y
For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows:
p(w, wc) = min{ #(w, wc), #(wc,w) }
We use p(w,wc) as vector coordinates when measuring semantic similarity
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm*
We have a bilingual glossary G: L1 L2 of translation pairs and target words w1, w2
We search in Google the co-occurrences of the target words with the glossary entries
Compare the co-occurrence vectors
for each {p,q} ∈ G compare
max (google#("w1 p") and google#("p w1"))with
max (google#"w2 q") and google#("q w2"))
* P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation Data Set We use 200 Bulgarian/Russian pairs of
words:
100 cognates and 100 false friends
Manually assembled by a linguist
Manually checked in several large monolingual and bilingual dictionaries
Limited to nouns only
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Experiments We tested few modifications of our
contextual Web similarity algorithm Use of TF.IDF weighting
Preserve the stop words
Use of lemmatization of the context words
Use different context size (2, 3, 4 and 5)
Use small and large bilingual glossary
Compared it with the seed words algorithm
Compared with traditional orthographic similarity measures: LCSR and MEDR
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Experiments BASELINE: random
MEDR: minimum edit distance ratio
LCSR: longest common subsequence ration
SEED: the "seed words" algorithm
WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting
NO-STOP: WEB3 without stop words removal
WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5