Top Banner
An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí
24

An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

Apr 01, 2015

Download

Documents

Jarrett Gillett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

An Unsupervised WSD Algorithm for a NLP

System

Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí

Page 2: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

2

INDEX

Introduction Architecture for the NLP System WSD Method Evaluation Conclusions Future Work

Page 3: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

3

Introduction

Natural Language Processing (NLP) techniques are necessary for current information systems.

One problem of natural language is the ambiguity (phonological, morphological, syntactic, semantic or pragmatic).

The resolution of lexical ambiguity is necessary for certain NLP applications: Machine Translation, Information Retrieval, Information Extraction, etc.

Page 4: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

4

Introduction

Word Sense Disambiguation (WSD) is an intermediate task that attemps to resolve lexical ambiguity problem, assigning to each word its appropriate meaning.

WSD uses two information sources: Context. External Knowledge Sources.

WSD approaches: Knowledge-driven. Data-driven.

Page 5: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

5

Introduction

WSD method characteristics: Knowledge-driven. Unsupervised. Information sources:

EuroWordNet. Untagged large corpus.

Sense assignment uses paradigmatic information.

Easily adaptable to other languages.

Page 6: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

6

Architecture for the PLN System

POS-analyser (MACO)

POS-tagger (RELAX)

Shallow parser

(TACAT)

WSD

module

INPUT

OUTPUT

Corpus

Sense Discriminators EWN

Untagged text

Extracts all possible POS-tags

Selects only one

morphosyntactic category

Identifies sentence’s

constituents

Text annotated with POS-

tags, chunks and noun

senses

Set of nouns derived from

lexical-semantic

relations of EWN

Page 7: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

7

WSD method

It operates on paradigmatic information. It extracts paradigmatic information for

an ambiguous occurrence and it maps this information to the paradigmatic information from the lexicon.

It lays on the base that semantically similar words can substitute each other in the same context and, inversely, words that can commute in a context have a good probability to be close semantically.

Page 8: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

8

WSD method

It uses a POS-tagged corpus for searching syntactic patterns (the corpus of EFE News Agency, over 70M words).

For the identification of patterns, it follows a structural criterion, using a list of basic patterns and search schemes.

Each syntactic pattern is identified at the lemmas and POS levels.

Page 9: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

9

WSD method Syntactic patterns: X-R-Y X and Y are lexical content units (nouns, adjectives, verbs and adverbs). R is a relational element (functional

words: prepositions, conjunctions, ). The pattern expresses a syntactic

relation between X and Y. Examples:

grano - noun de - preposition azúcar - noun pasaje - noun subterráneo - adjective

Page 10: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

10

WSD method Definition of basic patterns:

N, N N C N N P N N A N V A N V N

Conjunctions = {y, e, o, u}

N NounR AdverbA AdjectiveV Participle VerbC* ConjunctionD Determinant

Page 11: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

11

WSD method Each basic pattern has discontinuous

realisations in texts. We pre-establish morphosyntactic schemes for

the search of patterns; e.g.:

N (((R) R) A/V) , ((D) D) (((R) R) A/V) N N (((R) R) A/V) C* ((D) D) (((R) R) A/V) NN (((R) R) A/V) P ((D) D) (((R) R) A/V) NN ((R) R) A (C* ((R) R) A/V)N ((R) R) V (C* ((R) R) A/V)(A/V C* ((D) D) (((R) R)) A N(A/V C* ((D) D) (((R) R)) V N

The units between brackets are optional, those separated by a bare are alternatives for a position.

Page 12: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

12

WSD method

For each search scheme, we define decomposition rules in order to extract the basic patterns.

Example:

Each unit of the sequence is considered also at the lemma level.

NAC*A

NA NA

Coronas danesas y suecas

Corona danesa Corona sueca

Page 13: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

13

WSD method

Information is extracted from two sources: Corpus (paradigmatic information). Sentences (syntagmatic information).

Paradigmatic information is extracted by exploiting the syntactic patterns

Example:

obra

concierto

pieza

Paradigmaticrelations

para órgano

Syntagmatic relations

Page 14: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

14

WSD method

Sense discriminators obtained from EWN: Selection of all nouns related to each

sense along the different lexical-semantic relations.

Elimination of the common elements between different senses.

Disjunctive sets of nouns for the senses of a word.

Page 15: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

15

WSD method

Commutative test: Hypothesis: If two words can commute

in a given context, they have a good probability to be semantically close.

Application: If the ambiguous word can be substituted with a sense discriminator inside a syntactic pattern, then it has the sense corresponding to that discriminator.

The algorithm operates with words from a sense-untagged corpus

Page 16: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

16

WSD method

Commutative Test Algorithm

X – R - Y __ – R - Y Xk – R - Y Xk

dij

di0j

dnj

SD1

SDi0

SDn

X_i0 – R - Y

X_? – R - Y

Corpus

YES

NO

Page 17: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

17

WSD method

WSD module has two heuristics: H1: Commutative Test Algorithm applied

on the paradigmatic information (the nouns obtained from substituting the ambiguous occurrence in the pattern).

H2: Commutative Test Algorithm applied on the syntagmatic information (the nouns obtained from the sentence).

The two heuristics act as voters for the sense assignment.

Page 18: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

18

WSD method

Example:Los enormes y continuados progresos científicos y técnicos de la Medicina actual han logrado hacer descender espectacularmente la mortalidad infantil, erradicar multitud de enfermedades hasta hace poco mortales, sustituir mediante trasplante o implantación delcuerpo inutilizadas y alargar las expectativas de vida.

1. Input text POS-tagging.

2. Syntactic patterns identification.

2.1. Use of search schemes. 2.2. Use of decomposition rules.

3. Extraction of information.

3.1. From corpus. 3.2. From sentence.

órganos dañados o partes

órganos dañados o partes

NACN

NA NCN

órgano dañado órgano o parte

Scheme

Decomposition

Rules

FinalResult

mediador, terreno, chófer, árbol, cabeza, planeta, parte,

incremento, totalidad, guerrilla, programa, mitad, país, temporada, artículo,

tercio

progreso, científico, mortalidad, multitud, enfermedad, mortal,

trasplante, implantación, órgano, parte, cuerpo,

expectativa, vida

From corpus From sentence 4. Extraction of Sense Discriminators.

Sense 1: órgano vegetal, espora, flor, pera, manzana, bellota, hinojo, semilla, poro, píleo, carpóforo, ...

Sense 2: agencia, unidad administrativa, banco central, servicio secreto, seguridad social, FBI, ...

Sense 3: parte del cuerpo, trozo, músculo, riñón, oreja, ojo, glándula, lóbulo, tórax, dedo, articulación, rasgo, facción, ...

Sense 4: instrumento de viento, instrumento musical, mecanismo, aparato, teclado, pedal, corneta, ...

Sense 5: periódico, publicación, medio de comunicación, método, serie, serial, número, ejemplar, ...

Sense Discriminators Sets 5. Commutative

Test. 6. Final sense asignmentórgano#3: A fully differentiated structural and functional unit in an animal that is specialized for some particular function.

S1 SD1 = S1 SD2 = S1 SD3 S1 SD4 = S1 SD5 =

S2 SD1 = S2 SD2 = S2 SD3 S2 SD4 = S2 SD5 =

Heuristic 1

Heuristic 2

Page 19: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

19

Evaluation

The WSD method was tested with the Spanish Lexical Sample task of Senseval-2.

For the evaluation, we selected all 17 nouns of this task.

We used the two heuristics H1 & H2.

Page 20: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

20

Evaluation

Results obtained:

  Precision Recall Coverage

H1 0,54 0,11 0,21

H2 0,59 0,04 0,07

H1 + H2

0,56 0,15 0,27

Page 21: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

21

Evaluation

In Senseval-2, the values for the individual words reached the following level:

Precision = 51,4% - 71,2% Recall = 50,3% - 71,2% Coverage = 98% – 100%

Page 22: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

22

Conclusions

This WSD method can be used as a module in a NLP system to prepare an input text to a real application.

It is independent of any corpus tagging at syntactic or semantic level.

It requires only a minimal preprocessing phase (POS-tagging) of the input text and of the search corpus.

Page 23: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

23

Future work

Study of different possibilities to improve the WSD process.

Aplication of new algorithms over information associated to the ambiguous occurrence.

Combination with other data-driven WSD methods.

Page 24: An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí.

An Unsupervised WSD Algorithm for a NLP

System

Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí

Thank you!!