Finding Domain Terms using Wikipedia

Post on 09-Jan-2016

28 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Finding Domain Terms using Wikipedia. Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu. Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es. Outline. Introduction Related approaches - PowerPoint PPT Presentation

Transcript

Finding Domain Terms using Wikipedia

Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra

jorge.vivaldi@upf.edu

Horacio Rodríguez Hontoria TALP Research Center

Universitat Politécnica de Catalunyahoracio@lsi.upc.es

2

Outline

• Introduction• Related approaches• Methodology• Evaluation• Conclusions and future work

Introduction

• Problem: to automatically extract terminological units from specialized texts

• Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

4

Related approaches

• Magnini et al., 2000 • Montoyo et al., 2001• Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004• Bernardini et al., 2006 • Cui et al., 2008

Graph structure of Wikipedia

WP categories WP pages

A B

C D E

F

G

P1

P2

P3

Redirectiontable

… …

… …

… …

Disamb. pagesInterwiki linksExternal links

InfoBox

Methodology: overview

domain

Pages

top categories

domain categories

domain pagesfinal domain

term setfiltering

filtering

Categories

bootstrapping

1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops4) Remove proper names and service classes5) Filter categories and pages

Main steps:

WP

Methodology: filtering

• Category level

• Page level

Methodology: filtering

• Category levelTop Category of the Domain

CatSet1

C

Direct super-categories CatSet1Direct super-categories CatSet1 Direct neutral super-categories

Category Score

Methodology: filtering

• Page levelTop Category of the Domain

CatSet2

C

categories CatSet2

Pages C ... ...

neutral categories

Page Score

P

categories CatSet2

Methodology: category filtering

categories descendant filtered of set:CatSet2

} )21( if accept

11#2

11#1

of oriessupercateg direct of set# :1

{ 1

categories descendant of set:CatSet1

nnc

CatSetaCatSetn

CatSetaCatSetn

caCatSet

CatSetc

Methodology: page filtering

2 if 02 if 1

:)(

)(

a to assigned categories of set :

CatSetcCatSetc

cinCatPathToDoma

inCat(c)PathToDoma

inCat(c)PathToDomadtcWPDC

dtctermCats

termCatsc

termCatsc

Additional category filtering using pages scores:

catTerm: set of pages associated to a category

-MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring

-MicroLoose: Idem with greater or equal test.

-Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.

Page filtering example: “semantics” (in Computing domain)

theoretical computer science Computing semantics

softwaresoftware engineering

formal methods

semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic}

WPCD(semantics) = 0.25

Category filtering example using pages score: “chemistry”

# DTCMicroStrict

MicroLoose

MacroVote Result

ok ko ok ko ok ko

1 electroquímica(electrochemistry)

13 5 16 2 36 12 +3 Accept

2 quesos(cheeses)

0 8 6 2 8 12 -1 Reject

3 óxidos de carbono(carbon monoxide)

1 1 2 0 4 3 +2 Accept

Evaluation

• Partial evaluation: “chemistry” and “astronomy”:– Test against Magnini et al., 2000 (WordNet 1.6)– Low coverage: 25% for Chemistry and 15% for

Astronomy

• Full evaluation. “Medicine”– Test against SNOMED-CT Spanish Edition (2009)– Wide coverage of the clinical domain: 800K terms

Partial evaluationDomain Chemistry Astronomy Language EN ES EN ES Initial Categories 188374 2070 188816 44631 #Categories after pruning 1334 557 790 143 Categories 49 43 5 6

Precision 93,9 62,8 0 16,7 Loose 833 1038 284 119

Pages found Strict 580 700 284 81 Loose 61,3 52,6 34,8 31,9 Ite

ratio

n #

1

Prec. [%] Strict 62,7 56,6 37.2 27,2

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-loose

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strict

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strictES-loose

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strictES-looseES-strict

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-loose

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strict

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strictES-loose

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strictES-looseES-strict

Full evaluationEvaluation using WN SNOMED-CT Initial Categories 2431 Categories after pruning 839 Categories 174 394

Precision 27,6 54 Loose 2091 4182

Page Strict 1724 3492 Loose 21,0 58 It

era

tion

#1

Prec. [%] Strict 23,2 62

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WN

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

ES-loose-SNOMED

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

ES-loose-SNOMEDES-strict-SNOMED

Validation issues

Accepts Reject

whisky

cigar

udder

fire

oral cancer

renal colic

phoniatrics

surgical instruments

17

Conclusions

• Good results when evaluated against a specialised resource

• Term list filtering must be improved (ex. Eliminate proper names)

18

Future work

• Apply this method to other languages/domains

• Improve filtering using in/out links of selected pages

• Improve filtering using also the page content

• Use this WP knowledge to improve a term extractor

19

Finding Domain Terms using Wikipedia

Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra

jorge.vivaldi@upf.edu

Horacio Rodríguez Hontoria TALP Research Center

Universitat Politécnica de Catalunyahoracio@lsi.upc.es

top related