Top Banner
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005
28

Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005

Jan 11, 2016

Download

Documents

akando

Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval. Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005. Main research goals. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Integrating Semantic Dictionaries for English, French and Bulgarian

into the NooJ System for the Purposes of Information Retrieval

Svetla Koeva, Max Silbetztein

8th INTEX / NooJ Workshop,

30 May, 2005

Page 2: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Main research goals

• To provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system:– to create specialized Semantic Dictionaries for

English, French and Bulgarian based on WordNet semantic relations;

– to provide compete formalization of the inflection for simple and compound words included in the Wn structure.

Page 3: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

History

• The integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop.

• Later on the idea was advanced into the Joint research RILA project

Information retrieval based on semantic relations

– LASELDI, Université de Franche-Comté – Department of Computational Linguistics, IBL,

Bulgarian Academy of Sciences.

Page 4: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Language resources

• Bulgarian grammatical dictionary (BGD) – over 83 000 lemmas and 1 100 000 word forms;

• English WordNet 2.0 – 115 424 synonymous sets;• Bulgarian WordNet (BalkaNet project) – 22 867

synonymous sets;• French WordNet (EuroWordNet project) – 33 512

synonymous sets;• English dictionary – over 30 000 lemmas (not

inflected);• French dictionary – extracted with INTEX.

Page 5: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Implementation tasks

• To transform the format of the BGD into the NooJ standard;

• To create semantic dictionaries for Bulgarian and English;

• To associate lemmas from the Bulgarian semantic dictionaries with the corresponding inflection types;

• To add missing lemmas and inflection types in BGD, if any;

• To create extensive dictionaries and corresponding inflection types for compounds.

Page 6: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Information structure design

• Category information –6 classes: Noun, Verb, Adjective, Pronoun, Numeral, Others (Adverb, Preposition, Conjunction, Particle, Interjection) ;

• Paradigmatic information – Personal, Transitive, Perfective, Common, …;

• Grammatical information – Inflection, Conjugation, Sound alternations, ….

Page 7: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Grammatical subclasses

• Nouns - 22 subclasses with respect of their Type (Common, Proper, Singularia tantum, Pluralia tantum) and Gender;

• Verbs – 32 subclasses with respect of Transitivity, Perfectiveness, and Personality;

• Adjectives – 2 subclasses;• Pronouns – 26 subclasses with respect of their

Type and Possessor;• Numerals – 6 sunclasses.

Page 8: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Grammatical types

• Noun – Number, Definiteness, Counting form, Case, Optional forms – 266 types;

• Verb – Person, Number, Tense, Mood, Voice, Participles, Gender, Definiteness – 257 types;

• Adjective – Gender, Number, Definiteness – 30 types;

• Pronoun – Gender, Person, Number, Definiteness, Case, Clitic, Possessing – 28 types;

• Numeral – Gender, Number, Definiteness, Approximate form, Male form – 20 types.

Page 9: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

BGD – Dictionary format

а,ЧА,0 ПРИ, 7 sm0, Ok, ‘‘абсол`ютен, ПРИ, 7 smh, Ok, '2RCия‘`август, С+М, 10 sml, Ok, '2RCият‘авиокомп`ания, С+Ж, 1 sf0, Ok, '2RCа‘австр`ийски, ПРИ, 3 sfd, Ok, '2RCата‘автоб`ус, С+М, 11 sn0, Ok, '2RCо‘автомат`ичен, ПРИ, 7 snd, Ok, '2RCото‘адрес`ирам, Г+Н+Т, 4 p0, Ok, '2RCи‘агит`ирам, Г+Н+Т, 4 pd, Ok, '2RCите'

Page 10: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Transforming BGD

Perl Script

DictionaryGrammatical

types Transliteration

of labels

Page 11: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ dictionary

→aбсол`ютен, ПРИ, 7 aбсолютен,A+FLX=A-7

`август, С+М, 10 август,N+M+FLX=N_M-10

авиокомп`ания, С+Ж,1 авиокомпания,N+F+FLX=N_F-1

aвстр`ийски, ПРИ, 3 aвстрийски,A+FLX=A-3

автоб`ус, С+М, 11 автобус,N+M+FLX=N_M-11

автомат`ичен, ПРИ, 7 автоматичен,A+FLX=A-7

адрес`ирам,Г+Н+Т,4 адресирам,V+IT+FLX=V_IT-4

Page 12: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ formal descriptions

→sm0, Ok, ‘‘ A-7 = <E>/sm0 +smh, Ok, '2RCия‘ <L2><S><R>ия<S1>/smh + sml, Ok, '2RCият‘ <L2><S><R>ият<S1>/sml +sf0, Ok, '2RCа‘ <L2><S><R>а<S1>/sf0 +sfd, Ok, '2RCата‘ <L2><S><R>ата<S1>/sfd +sn0, Ok, '2RCо‘ <L2><S><R>о<S1>/sn0 +snd, Ok, '2RCото‘ <L2><S><R>ото<S1>/snd + p0, Ok, '2RCи‘ <L2><S><R>и<S1>/p0 + pd, Ok, '2RCите‘ <L2><S><R>ите<S1>/pd;

Page 13: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

WordNet semantic relations

ILR POS/POS EW2.0 BulNet

HYPERONYMY N/N V/V 94 844 15 838

NEAR ANTONYMY N/N A/A V/V 7 642 1 847

PART MERONYMY N/N 8 636 1 241

MEMBER MERONYMY N/N 12 205 841

PORTION MERONYMY N/N 787 107

SUBEVENT V/V 409 162

CAUSES V/V 439 104

SIMILAR TO A/A V/V 22 196 1 479

VERB GROUP V/V 1 748 848

ALSO SEE A/A V/V 3 240 895

Page 14: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Other relations

ILR POS/POS EW2.0 BulNet

BE IN STATE A/N 1 296 591

BG DERIVATIVE N/V 36 630 6 469

DERIVED A/N 6 809 1 071

PARTICIPLE A/V 401 56

REGION DOMAIN N/N V/N A/N B/N 1 280 4

USAGE DOMAIN N/N V/N A/N B/N 983 22

CATEGORY DOMAIN N/N V/N A/N B/N 6 166 638

Page 15: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Selected relations

• Synonymy (reflexive, symmetric, and transitive relation of equivalence);

• Hypernymy (inverse, asymmetric, and transitive relation between synonym sets),

• Meronymy (inverse, asymmetric, and transitive relation between synonym sets):

Part meronymy;

Member meronymy;

Portion meronymy.

Page 16: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Selected relations

• Similar to (symmetric relation between similar adjectival synsets);

• Verb group (symmetric relation between semantically related verb synsets);

• Also see (symmetric relation between synsets - verbs or adjectives, that are close in meaning);

• Category domain (asymmetric extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to).

Page 17: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

DELAF semantic dictionaries

• These dictionaries consist of pairs of literals defined for the corresponding semantic relation:– car,automobile.N

– auto,automibile.N

• All possible combinations between literals in the given synsets are listed: – car,automobile.N

– cars,automobile.N

– auto,automibile.N

– autos,automibile.N

Page 18: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Semantic dictionaries

Synonymy relation‘a plant consisting of buildings with facilities for

manufacturing’

фабрика,N+FLX=ENG20-03196165-nпредпрятие,N+FLX=ENG20-03196165-n

factory,N+FLX=ENG20-03196165-nmill,N+FLX=ENG20-03196165-nmanufacturing plant,N+FLX=ENG20-03196165-nmanufactory,N+FLX=ENG20-03196165-n

Page 19: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Semantic dictionaries

Hypernymy relation‘the organized action of making of goods and services

for sale’

производство,N+FLX=ENG20-00859333-nпромишленост,N+FLX=ENG20-00859333-nиндустрия,N+FLX=ENG20-00859333-n

production,N+FLX=ENG20-00859333-nindustry,N+FLX=ENG20-00859333-nmanufacture,N+FLX=ENG20-00859333-n

Page 20: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Inflecting wordnet<SYNSET>

<ID>...</ID><POS>...</POS><SYNONYM>

<LITERAL>otstranqwam (to remove)<SENSE>…</SENSE><LNOTEGR>ГНТ12</LNOTEGR>

</LITERAL></SYNONYM><ILR>...<TIPE>...</TYPE></ILR><DEF>

remove something concrete, as by lifting, pushing, taking off, etc. or remove something abstract </DEF><BCS>...</BCS>

</SYNSET>

Page 21: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Semantic descriptions

‘the organized action of making of goods and services for sale’

ENG20-00859333-n = <E>/Hs0 + то/Hsd + <L1>а<S1>/Hp0 + <L1>ата<S1>/Hpd + <L9>мишленост<S9>/Ss0 + <L9>мишлеността<S9>/Ssd + <L9>мишлености<S9>/Sp0 + <L9>мишленостите<S9>/Spd + <B12>индустрия/Ss0 + <B12>индустрията/Ssd + <B12>индустрии/Sp0 + <B12>индустриите/Spd;ENG20-00859333-n = <E>/Hs + <B10>industry/Ss + <B10>industries/Sp0+ <B10>manifactures/Ss + <B10>manifactures/Sp;

Page 22: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

After the nice solutions

• Lemmas which are not included in the BGD:– Lemmas classification to existing inflection types;– Formal description of new inflection types– Literals in Latin;– Validating WordNet.

• Semantic ambiguity - literals with two inflectional descriptions in BGD;

• Compound words– Formal description of inflection types;– Compounds classification.

Page 23: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

NooJ Compound semantic descriptions

ENG20-04182583-n = <E>/Ss0 + <P>та/Ssd + <B>и<P><B>(и/p0 +ите/pd) + <B7>завод<P><B2>ен/Ss0 + <B7>завод<P><B2>ния/Ssh + <B7>завод<P><B2>ният/Ssl + <B7>заводи<P><B2>ни/Sа0 + <B7>заводи<P><B2>ните/Sа0 + <B7>рафинерия/Ss0 + <B7>рафинерия<P>та/Ssd + <B7>рафинерии<P><B>и/Sp0 + <B7>рафинерии<P><B>ите/Spd;

Page 24: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005
Page 25: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005
Page 26: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005
Page 27: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Applications of the Semantic Dictionaries

• Information retrieval by means of semantic equivalence with synonymy dictionaries;

• Information retrieval by means of semantic specification with hyperonymy and meronymy dictionaries;

• Information retrieval by means of similarity;• Information retrieval by means thematic domains

affiliations;• Validation WordNet structure against its

completeness and consistency.

Page 28: Svetla Koeva, Max Silbetztein   8th INTEX / NooJ Workshop,  30 May, 2005

Future directions

• Extensions and enhancements of the semantic dictionaries by means of:– Extension of the dictionaries coverage;– Addition of other semantic relations;– Inclusion of additional information to the entries.

• Integration of multilingual semantic extraction with NooJ using the Inter-Lingual-Index relation.