Top Banner
free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College November 11, 2006
27

Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Dec 16, 2015

Download

Documents

Robert Malone
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Free construction of a free dictionary of synonyms

using computer science

Viggo Kann and Magnus RosellKTH, Stockholm

Talk given by Viggo at Amherst College November 11, 2006

Page 2: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Examples of English synonyms

Smith: A Dictionary of Synonymous Words in the English Language [1889]

CLASS. Order. Rank. Degree. Classification. Grade.

Webster’s Dictionary of Synonyms [1942]

classify. Alphabetize, pigeonhole, assort, sort. Ana. Order, arrange, systematize, methodize, marshal.

Page 3: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Goals

To construct a Swedish dictionary of synonyms as a list of synonymous pairs

I don’t want to work a lot I don’t want to pay anyone to work The resulting list should be free

Page 4: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Ideas

Automatically construct a large set of word pairs that might be synonyms

Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

Page 5: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

More ideas

Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 17 M) of lookups each month

Users visit Lexin to translate words, and are thus probably motivated to help me

Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

Page 6: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

My plan

1. Construct lots of possible synonyms

2. Sort out bad synonym pairs automatically

3. Ask lots of users if the rest of the pairs are good synonyms

4. Analyze the gradings done by the users and decide which pairs to keep

Page 7: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Step 1: Construct lots of possible synonyms

If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish

{(w,v): y: ySE(w) vES(y)} or{(w,v): y: ySE(w) ySE(v)}

616 000 word pairs were generated

Page 8: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Step 2: Sort out bad synonym pairs automatically

Use RI (Random Indexing)[Kanerva, Kristoferson, Holst 2000]to measure the distance between words represented in a large vector space

Keep pairs that have small enough distance in the vector space

Page 9: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Random Indexing

Each word w is assigned a random label vector Lw of thousand elements

For each word w construct a context vector Cw by adding the random vectors

for the words appearing in the context of each occurrence of w in a large corpus

Page 10: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Random Indexing settings

Context: 4 words to the left and 4 to the rightStop words were removed

Dimensionality: 1800 5 corpora from different domains were

used, for example newspapers and medical texts

Page 11: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Number of pairs for different cos thresholds (435 000 of 616 000 pairs occurred in corpus)

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

-0.05 0.0 0.05 0.1* 0.15 0.2 0.25 0.3 0.35 0.4

pairs

Page 12: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Step 3: Ask lots of users if the rest of

the pairs are good synonyms

When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like:

Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

Page 13: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

After answering the user may

grade new randomly chosen word pair look up word in the synonym dictionary suggest new synonymous word pair download synonym dictionary in XML

Page 14: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.
Page 15: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Step 4: Analyzing the gradings done by the users

1.2 millions gradings were made in less than 2 months

Grading statistics were analyzed on several occasions

Some users sent comments

Page 16: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Keeping the users happy!

Many users said that there were too many bad pairs

Lots of pairs were graded 0 (not at all synonyms) by all users. After some weeks 25 000 such pairs were removed. Later 60 000 more pairs were removed, improving the quality of the remaining pairs considerably.

Page 17: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

User gradings first two months

0%

10%

20%

30%

40%

50%

60%

0 1 2 3 4 5 don'tknow

Page 18: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

More interesting gradings 2006

0%

10%

20%

30%

40%

50%

60%

0 1 2 3 4 5 don'tknow

20052006

Page 19: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Distribution of mean gradings of word pairs after two months

0%

5%

10%

15%

20%

25%

30%

35%

40%

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5

217 000 pairs

Page 20: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Distribution of mean gradings of word pairs 2006

0%

5%

10%

15%

20%

25%

30%

35%

40%

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5

20052006

Page 21: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Analysis of the pairs graded 0Distance (cosine) in RI space

0%

10%

20%

30%

40%

50%

0,1 0,2 0,3 0,4 0,5 0,6

0 pairsall pairs

Page 22: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Some statistics (November 2006)

2.5 M user gradings done 67 000 pairs (graded ≥ 2) in dictionary 90 000 pairs suggested by users 50 000 unique pairs suggested 14 000 of them have been accepted

Page 23: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Example: Synonyms to klass (class)5: rang (grade)

rank (rank)slag (kind)

4: kategori (category)stånd (social class)årskurs (grade)

3: fack (sphere)grad (degree)grupp (group)kvalitet (quality)nivå (level)ordning (order)

3: skikt (layer)sort (sort)standard (standard)stil (style)

2: storleksordning (magnitude)typ (type)

1: poäng (point)stadga (stability)

0: uppdrag (mission)utbilda (educate)

Page 24: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

How to prevent abuse?

Many gradings of a word pair are needed before it’s considered to be good

The pair to be graded is randomly picked from a very large list

Word pairs suggested by users are spell checked before they are added to the very large list

Page 25: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

People's definition of synonymy

Exact meaning of 'synonym' wasn’t defined

Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair

The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

Page 26: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

The people’s synonym dictionary on the web

http://lexin.nada.kth.se/cgi-bin/synlex

Page 27: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Lessons learned

The list of suggested synonyms should be huge

Try to improve the quality of the list automatically as much as possible,Random indexing is useful for this, also try tagging and using other dictionaries

Use the 0 answers early to remove bad pairs that only irritate the users