Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems.

Post on 17-Dec-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

Unsupervised Ontology Acquisition from plain texts: The OntoGain method

Efthymios DrymonasKalliopi ZervanouEuripides G.M. Petrakis

Intelligent Systems Laboratoryhttp://www.intelligence.tuc.gr

Technical University of Crete (TUC), Chania, Greece

OntoGain

A platform for unsupervised ontology acquisition from text Application independent Ontology of multi-word term concepts Adjusts existing methods for taxonomy &

relation acquisition to handle multi-word concepts

Outputs ontology in OWL Good results on Medical, Computer science

corpora

2

Why multi-word term concepts?

Majority of terminological expressions Convey classificatory information,

expressed as modifierse.g. “carotid artery disease” denotes a type

of “artery disease” which is a type of “disease”

Leads to more expressive and compact ontology lexicon

3

Ontology Learning Steps

Concept Extraction C/NC-value

Taxonomy Induction Clustering, Formal Concept Analysis

Non-taxonomic Relations Association Rules, Probabilistic algorithm

4

5

The C/NC-Value method [Frantzi et.al. , 2000] Identifies multi-word term phrases

denoting domain concepts Noun phrases are extracted first ((adj | noun)+ | ((adj | noun) * (noun prep)?)

(adj | noun) *) noun C-Value: Term validity criterion, relying

on the hypothesis that multi-word terms tend to consist of other terms

NC-Value: Uses context information (valid terms tend to appear in specific context and co-occur with other terms)

C-Value: Statistical Part For candidate term a

f(a): Total frequency of occurrence f(b): Frequency of a as part of longer termsP(Ta): number of these longer terms

|a|: The length of the candidate string

otherwisebf

TPafa

nestednotaafaavalueC

aTba

,))()(

1)((||log

:),(||log)(

2

2

Concept Extraction

C/NC-Value sample resultsoutput term c-nc value

web page 1740.11

information retrieval 1274.14

search engine 1103.99

machine learning 727.70

computer science 723.82

experimental result 655.125

text mining 645.57

natural language processing 582.83

world wide web 557.33

large number 530.67

artificial intelligence 515.73

relevant document 468.22

similarity measure 464.64

information extraction 443.29

knowledge discovery 435.79

7

Ontology Learning Steps

Preprocessing Concept ExtractionTaxonomy Induction Non-taxonomic Relations

8

Taxonomy Induction

Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms

Two methods in OntoGainAgglomerative clustering Formal Concept Analysis (FCA)

Agglomerative Clustering

Proceeds bottom-up: at each step, the most similar clusters are merged

Initially each term is considered a cluster Similarity between all pairs of clusters is

computed The most similar clusters are merged as

long as they share terms with common heads

Group average for clusters, Dice like formula for terms

10

Formal Concept Analysis (FCA) [Ganter et al., 1999]

FCA relies on the idea that the objects (terms) are associated with their attributes (verbs)

Finds common attributes (verbs) between objects and forms object clusters that share common attributes

Formal concepts are connected with the sub-concept relationship

)(),(),( 21212211 AAOOAOAO

FCA Example

Takes as input a matrix showing associations between terms (concepts) and attributes (verbs)

submit test describe print compute search

Html form * * *

Hierarchical clustering

* *

Text retrieval *

Root node * * * *

Single cluster * * *

Web page * *

FCA Taxonomy

13

Formal concepts ({hierarchical

clustering, root node, single cluster}, {compute, search})

({html form, web page}, {print, search})

Not all dependencies c,v are interesting

tvf

vcfvcP

)(

),()|(

Non-Taxonomic Relations extraction phase

14

Concept Extraction Taxonomy InductionNon-Taxonomic Relations

Non-Taxonomic Relations

Concepts are also characterized by attributes and relations to other concepts in the hierarchy

Typically expressed by a verb relating pair of concepts

Two approaches Associations rules Probabilistic

Association Rules [Aggrawal et.al., 1993]

Introduced to predict the purchase behavior of customers

Extract terms connected with some relation subject-verb-object

Enhance with general terms from the taxonomy

Eliminate redundant relations:

predictive accuracy < t

Association Rules: ExampleDomain Range Label

chiasmal syndrome pituitary disproportion cause by

medial collateral ligament surgical treatment need

blood transfusion antibiotic prophylaxis result

lipid peroxidation cardiopulmonary bypass lead to

prostate specific antigen prostatectomy follow

chronic fatigue syndrome cardiac function yield

right ventricular infraction radionuclide ventriculography analyze by

creatinine clearance arteriovenous hemofiltration achieve

cardioplegic solution superoxide dismutase give

bacterial translocation antibiotic prophylaxis decrease

accurate diagnosis clinical suspicion depend

ultrasound examination clinical suspicion give

total body oxygen consumption epidural analgesia attenuate by

coronary arteriography physician perform by

17

Probabilistic approach [Cimiano et.al. 2006]

Collect verbal relations from the corpus Find the most general relation wrt verb

using frequency of occurrence Suffer_from(man, head_ache)Suffer_from(woman, stomach_ache)Suffer_from(patient,ache)

Select relationships satisfying a conditional probability measureAssociations > t become accepted

18

Evaluation

Relevance judgments are provided by humans

Precision - Recall We examined the 200 top-ranked

concepts and their respective relations in 500 lines

Results from OhsuMed & Computer Science corpus

19

Results

20

Processing Layer Method

Precision –

OhsuMed

Recall -

OhsuMed

Precision –

Comp. Science

Recall –

Comp. Science

Concept Extraction C/NC-Value 89.7% 91.4% 86.7% 89.6%

Taxonomic Relations

Formal Concept Analysis

47.1% 41.6% 44.2% 48.6%

Hierarchical Clustering 71.2% 67.3% 71.3% 62.7%

Non-Taxonomic Relations

Association Rules 71.8% 67.7% 72.8% 61.7%

Probabilistic 62.7% 55.9% 61.6% 49.4%

Comparison with Text2Onto [Cimiano & Volker, 2005]

21

Huge lists of plain single word terms, and relations lacking of semantic meaning

Text2Onto cannot work with big texts Cannot export results in OWL

Conclusions OntoGain

Multi-word term concepts Exports ontology in OWL Domain independent

Results C/NC-Value yields good results Clustering outperforms FCA Association Rules perform better than

Verbal Expressions

22

Future Work

Explore more methods / combinations e.g., clustering, FCA Hearst patterns for discovering additional

relation types (Part-of)

Discover attributes and cardinality constraints

Incorporate term similarity information from WordNet, MeSH

Resolve term ambiguities

23

Thank you!

Questions ?

24

Preprocessing

Tokenization, POS tagging, Shallow parsing (OpenNLP suite)

Lemmatization (WordNet Java LibraryApply to all steps of OntoGainShallow parsing is used in relations

acquisition for the detection of verbal dependencies

26

Terms sharing a head tend to be similar e.g. hierarchical method and agglomerative

method are both methods Nested terms are related to each other

e.g. agglomerative clustering method and clustering method should be associated)

top related