Top Banner
Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" Project co-financed by the European Regional Development Fund General Word Sense Disambiguation System applied to Romanian and English Languages - SenDiS - Alin Ştefănescu, Oana Șoica, Andrei Mincă & SenDiS team June 27, 2013 Word sense disambiguation using lexicon nets
15

Project co-financed by the European Regional Development Fund

Feb 23, 2016

Download

Documents

kolya

Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future". General Word Sense Disambiguation System applied to Romanian and English Languages - SenDiS -. Project co-financed by the European Regional Development Fund. Word sense disambiguation - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Project co-financed by the European Regional Development Fund

Sectoral Operational Programme "Increase of Economic Competitiveness""Investments for your future"

Project co-financed by the European Regional Development Fund

General Word Sense Disambiguation System applied to Romanian and English Languages- SenDiS -

Alin Ştefănescu, Oana Șoica, Andrei Mincă & SenDiS team

June 27, 2013

Word sense disambiguation using lexicon nets

Page 2: Project co-financed by the European Regional Development Fund

Alin Ştefănescu

Introduction

Page 3: Project co-financed by the European Regional Development Fund

Page 3

SenDiSThe ambiguous hen

„Găina cea nouă ne ouă nouă nouă ouă.“

Image from aliexpress.com

Page 4: Project co-financed by the European Regional Development Fund

Page 4

SenDiSNatural Language Processing (NLP)

NLP develops systems that allow computers to communicate with people using everyday language.

An important area, natural language understanding Subproblem: word sense disambiguation

Page 5: Project co-financed by the European Regional Development Fund

Page 5

SenDiS

NLP is an active research area at Softwin Research

biometrics is the other active area

previously, antivirus research in the same R&D department led to the creation of a award-winning, internationally certified internet security and antivirus software

NLP @ SOFTWIN Research

Page 6: Project co-financed by the European Regional Development Fund

Page 6

SenDiSNLP @ Softwin Reseach – SenDiS project

SenDiS project at Softwin Research „A general Word Sense Disambiguation System applied to Romanian and English languages“ 2010-2013 co-financed through Sectoral Operational Programme

“Increase of Economic Competitiveness” (POS-CCE) team of 7-10 computer scientists and linguists method: use of structured linguistic knowledge encoded with

Softwin‘s GRAALAN formalism previous projects: PALIROM & LINCOR (with collaborators

from UB, ILIR, UPB etc)

Page 7: Project co-financed by the European Regional Development Fund

Page 7

SenDiSNLP system - GRAALAN

1. Linguistic theoretical background

2. GRAALAN Grammar Abstract Language

3. Linguistic tools

4. Linguistic knowledge bases

5. Linguistic applications

SenDiS

SenDiS builds upon and further develops the NLP system GRAALAN at Softwin Research

Page 8: Project co-financed by the European Regional Development Fund

Page 8

SenDiSWord Sense Disambiguation (WSD)

identify the meaning of words in context in a computational manner

very difficult problem

three main approaches: supervised disambiguation unsupervised disambiguation knowledge-based disambiguation

SenDiS

“Tower of Babel” by Brueghel

Page 9: Project co-financed by the European Regional Development Fund

Page 9

SenDiS

GRAALAN knowledge bases can encode several types of ambiguities:

multiword expression (MWE) ambiguity

morphologic ambiguity (synthetic & analytic)

lexical ambiguity (synthetic & analytic)

morphemic ambiguity

syntactic ambiguity

Dealing with ambiguity

SenDiS

Page 10: Project co-financed by the European Regional Development Fund

Page 10

SenDiS

a simple and intuitive knowledge-based WSD approach

computes the word overlap between sense definitions of context target words

For a two-word context (w1,w2) and S1 in Senses(w1) and S2 in Senses(w2):

scoreLesk (S1,S2) = | gloss(S1) ∩ gloss(S2) |

another variant, less computational intensive, computes the word overlap between a word sense definition and other context words

scoreLeskVar (S) = | context(w) ∩ gloss(S) |

Lesk Algorithm - basic idea

Page 11: Project co-financed by the European Regional Development Fund

Page 11

SenDiSOur approach: Lesk Algorithm extended

1W 2W nW

1W 2W mW...1W 2W mW...

1W 2W mW...1W 2W mW...

1W 2W mW...1W 2W mW...

1W 2W mW...1W 2W mW...

1W 2W mW...

...

Text:

1S

kS

1S 2S 2S1S

kS kS

...

sense definition

annotated/WSD selected definition

link to a lexicon entry/senselink to an annotated lexicon entry/sense

link to a non annotated lexicon entry/sense

Our approach: Lesk algorithm reasoning extended.Every annotated sense is extended with its definition that also has words with disambiguated senses and so on.

Page 12: Project co-financed by the European Regional Development Fund

Page 12

SenDiSLesk Algorithm extended - example

Generic example (Principle):

<lemma>…= Sense 1 : <word> <word> <word> <word>

Sense 2 : <word> <word> <word> <word>

Sense 3 : <word> <word> <word> <word>

<lemma>…= Sense 1 : <word> <word> <word> <word>

Sense 2 : <word> <word> <word> <word>

Sense 3 : <word> <word> <word> <word>

<lemma>…= Sense 1 : <word> <word> <word> <word>

Sense 2 : <word> <word> <word> <word>

Sense 3 : <word> <word> <word> <word>

Page 13: Project co-financed by the European Regional Development Fund

Page 13

SenDiS

Romanian example:"radio" =

“0” : "Aparat de receptie radiofonica; radioreceptor."“1” : "Instalatie de transmitere a sunetelor prin unde electromagnetice, cuprinzând aparatele de emisiune

şi pe cele de receptie.""aparat" =

"0" : "Sistem de piese care serveste pentru o operatie mecanica, tehnica, stiintifica etc.""1" : "Sistem tehnic care transforma o forma de energie în alta.""2" : "Ansamblu de organe anatomice care servesc la îndeplinirea unei functiuni fundamentale.""3" : "Totalitatea serviciilor sau a personalului care asigura bunul mers al unei institutii sau al unui

domeniu de activitate. ""4" : "Ansamblul mijloacelor care servesc penrtu un anumit scop."

"receptie" ="0" : "Operatie de luare în primire a unui material sau a unei lucrari, pe baza verificarii lor cantitative şi

calitative.""1" : "Serviciu într-o întreprindere hoteliera care are evidenta persoanelor aflate în hotel, face

repartizarea în camere a solicitatorilor etc.""2" : "(Tehn) Primire a unei anumite forme de energie pentru a o transforma în alta forma de energie.""3" : "Reuniune, banchet cu caracter, festiv (În cercurile oficiale)."4" : "Primire, întâmpinare (cu caracter ceremonios) a unui oaspete."

"radiofonic" ="0" : "Care aparţine radiofoniei, privitor la radiofonie, care utilizeaza radiofonia."

"radioreceptor" ="0" : "Aparat folosit pentru receptionarea undelor radiofonice (prin antene), pentru

transformarea lor în semnale sonore şi transmiterea lor prin intermediul difuzoarelor; radio."

Lesk Algorithm extended - example

Page 14: Project co-financed by the European Regional Development Fund

Page 14

SenDiSWSD using a specific lexicon network

definesdefined by

“gloss tagged” relation

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S C-SC-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-SC-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-SC-S

C-S

C-S

C-S C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-S

C-SC-S

C-S

C-S

C-S

C-S

C-S

a LARGE lexicon net

Word Sense

Word Sense

Page 15: Project co-financed by the European Regional Development Fund

Page 15

SenDiSSenDiS - workflow