Svetlin Nakov - Cognate or False Friend? Ask the Web!

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Cognate or False Friend? Ask the

Web!

Svetlin Nakov, Sofia University "St. Kliment Ohridski"

Preslav Nakov, University of California, Berkeley

Elena Paskaleva, Bulgarian Academy of Sciences

A Workshop on Acquisition andManagement of Multilingual Lexicons


Introduction Cognates and false friends

Cognates are pair of words in different languages that sound similar and are translations of each other

False friends are pairs of words in two languages that sound similar but differ in their meanings

The problem

Design an algorithm that can distinguish between cognates and false friends


Cognates and False Friends Examples of cognates

ден in Bulgarian = день in Russian (day)

idea in English = идея in Bulgarian (idea)

Examples of false friends

майка in Bulgarian (mother) ≠ майка in Russian (vest)

prost in German (cheers) ≠ прост in Bulgarian (stupid)

gift in German (poison) ≠ gift in English (present)


The Paper in One Slide Measuring semantic similarity

Analyze the words local contexts

Use the Web as a corpus

Similarities contexts similar words

Context translation cross-lingual similarity

Evaluation 200 pairs of words

100 cognates and 100 false friends

11pt average precision: 95.84%


Contextual Web Similarity What is local context?

Few words before and after the target word

The words in the local context of given word are semantically related to it

Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.

Stop words appear in all contexts

Need of sufficiently big corpus

Same day delivery of fresh flowers, roses, and unique gift baskets

from our online boutique. Flower delivery online by local florists for

birthday flowers.


Contextual Web Similarity Web as a corpus

The Web can be used as a corpus to extract the local context for given word

The Web is the largest possible corpus

Contains big corpora in any language

Searching some word in Google can return up to 1 000 excerpts of texts

The target word is given along with its local context: few words before and after it

Target language can be specified


Contextual Web Similarity Web as a corpus

Example: Google query for "flower"

Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...

Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.

Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...

Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.

Flowers, plants, roses, & gifts. Flowers delivery with fewer ...

Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.


Contextual Web Similarity Measuring semantic similarity

For given two words their local contexts are extracted from the Web

A set of words and their frequencies

Semantic similarity is measured as similarity between these local contexts

Local contexts are represented as frequency vectors for given set of words

Cosine between the frequency vectors in the Euclidean space is calculated


Contextual Web Similarity Example of context words frequencies

word countfresh 217

order 204

rose 183

delivery 165

gift 124

welcome 98

red 87

... ...

word: flower

word countInternet 291

PC 286

technology 252

order 185

new 174

Web 159

site 146

... ...

word: computer


Contextual Web Similarity Example of frequency vectors

Similarity = cosine(v1, v2)

# word freq.0 alias 3

1 alligator 2

2 amateur 0

3 apple 5

... ... ...

4999 zap 0

5000 zoo 6

v1: flower

# word freq.0 alias 7

1 alligator 0

2 amateur 8

3 apple 133

... ... ...

4999 zap 3

5000 zoo 0

v2: computer


Cross-Lingual Similarity We are given two words in different

languages L1 and L2

We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}

Measuring cross-lingual similarity:

1. We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2

2. We translate the context

3. We measure distance between C1* and C2

C1*C1G


Reverse Context Lookup Local context extracted from the Web can

contain arbitrary parasite words like "online", "home", "search", "click", etc.

Internet terms appear in any Web page

Such words are not likely to be associated with the target word

Example (for the word flowers)

"send flowers online", "flowers here", "order flowers here"

Will the word "flowers" appear in the local context of "send", "online" and "here"?


Reverse Context Lookup If two words are semantically related both

should appear in the local contexts of each other

Let #{x,y} = number of occurrences of x in the local context of y

For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows:

p(w, wc) = min{ #(w, wc), #(wc,w) }

We use p(w,wc) as vector coordinates when measuring semantic similarity


Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm*

We have a bilingual glossary G: L1 L2 of translation pairs and target words w1, w2

We search in Google the co-occurrences of the target words with the glossary entries

Compare the co-occurrence vectors

for each {p,q} ∈ G compare

max (google#("w1 p") and google#("p w1"))with

max (google#"w2 q") and google#("q w2"))

* P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998


Evaluation Data Set We use 200 Bulgarian/Russian pairs of

words:

100 cognates and 100 false friends

Manually assembled by a linguist

Manually checked in several large monolingual and bilingual dictionaries

Limited to nouns only


Experiments We tested few modifications of our

contextual Web similarity algorithm Use of TF.IDF weighting

Preserve the stop words

Use of lemmatization of the context words

Use different context size (2, 3, 4 and 5)

Use small and large bilingual glossary

Compared it with the seed words algorithm

Compared with traditional orthographic similarity measures: LCSR and MEDR


Experiments BASELINE: random

MEDR: minimum edit distance ratio

LCSR: longest common subsequence ration

SEED: the "seed words" algorithm

WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting

NO-STOP: WEB3 without stop words removal

WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5

LEMMA: WEB3 with lemmatization

HUGEDICT: WEB3 with the huge glossary

REVERSE: the "reverse context lookup" algorithm

COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup


Resources We used the following resources:

Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words

Huge bilingual glossary: 59 583 word pairs

A list of 599 Bulgarian stop words

A list of 508 Russian stop words

Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata

Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata


Evaluation We order the pairs of words from the

testing dataset by the calculated similarity

False friends are expected to appear on the top and the cognates on the bottom

We evaluate the 11pt average precision of the obtained ordering


Results (11pt Average Precision)

Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms



Comparing different context sizes; keeping the stop words



Comparing different improvements of the WEB3 algorithm


Results (Precision-Recall Graph)

Comparing the recall-precision graphs of evaluated algorithms


Results: The Ordering for WEB3

r CandidateBG

SenseRU

Sense Sim.Cogn.

? P@r R@r

1 муфта gratis muff0,008

5 no100.00

% 1.00%

2 багрене / багренье mottle gaff0,013

0 no100.00

% 2.00%

3 добитък / добыток livestock income0,014

3 no100.00

% 3.00%

4 мраз / мразь chill crud0,017

5 no100.00

% 4.00%

5 плет / плеть hedge whip0,018

2 no100.00

% 5.00%

… … … … … … … …

99 вулкан volcano volcano0,209

9 yes 81.82% 81.00%

100 година year time

0,2101 no 82.00% 82.00%

101 бут leg rubble

0,2130 no 82.18% 83.00%

… … … … … … … …

196

финанси / финансы finance finance

0,8017 yes 51.28%

100.00%

197 сребро / серебро silver silver

0,8916 yes 50.76%

100.00%

198 наука science science

0,9028 yes 50.51%

100.00%

199 флора flora flora

0,9171 yes 50.25%

100.00%

200 красота beauty beauty

0,9684 yes 50.00%

100.00%


Discussion Our approach is original because:

Introduces semantic similarity measure

Not orthographic or phonetic

Uses the Web as a corpus

Does not rely on any preexisting corpora

Uses reverse-context lookup

Significant improvement in quality

Is applied to original problem

Classification of almost identically spelled true/false friends


Discussion Very good accuracy: over 95% It is not 100% accurate

Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences

The Web as a corpus introduces noise Google returns the first 1 000 results only Google ranks higher news portals, travel

agencies and retail sites than books, articles and forums posts

Local context could contains noise


Conclusion and Future Work Conclusion

Algorithm that can distinguish between cognates and false friends

Analyzes words local contexts, using the Web as a corpus

Future Work

Better glossaries

Automatic augmenting the glossary

Different language pairs


Questions?

Cognate or FalseFriend? Ask the Web!

Svetlin Nakov - Cognate or False Friend? Ask the Web!

Technology