Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems.
Post on 17-Dec-2015
216 Views
Preview:
Transcript
Unsupervised Ontology Acquisition from plain texts: The OntoGain method
Efthymios DrymonasKalliopi ZervanouEuripides G.M. Petrakis
Intelligent Systems Laboratoryhttp://www.intelligence.tuc.gr
Technical University of Crete (TUC), Chania, Greece
OntoGain
A platform for unsupervised ontology acquisition from text Application independent Ontology of multi-word term concepts Adjusts existing methods for taxonomy &
relation acquisition to handle multi-word concepts
Outputs ontology in OWL Good results on Medical, Computer science
corpora
2
Why multi-word term concepts?
Majority of terminological expressions Convey classificatory information,
expressed as modifierse.g. “carotid artery disease” denotes a type
of “artery disease” which is a type of “disease”
Leads to more expressive and compact ontology lexicon
3
Ontology Learning Steps
Concept Extraction C/NC-value
Taxonomy Induction Clustering, Formal Concept Analysis
Non-taxonomic Relations Association Rules, Probabilistic algorithm
4
5
The C/NC-Value method [Frantzi et.al. , 2000] Identifies multi-word term phrases
denoting domain concepts Noun phrases are extracted first ((adj | noun)+ | ((adj | noun) * (noun prep)?)
(adj | noun) *) noun C-Value: Term validity criterion, relying
on the hypothesis that multi-word terms tend to consist of other terms
NC-Value: Uses context information (valid terms tend to appear in specific context and co-occur with other terms)
C-Value: Statistical Part For candidate term a
f(a): Total frequency of occurrence f(b): Frequency of a as part of longer termsP(Ta): number of these longer terms
|a|: The length of the candidate string
otherwisebf
TPafa
nestednotaafaavalueC
aTba
,))()(
1)((||log
:),(||log)(
2
2
Concept Extraction
C/NC-Value sample resultsoutput term c-nc value
web page 1740.11
information retrieval 1274.14
search engine 1103.99
machine learning 727.70
computer science 723.82
experimental result 655.125
text mining 645.57
natural language processing 582.83
world wide web 557.33
large number 530.67
artificial intelligence 515.73
relevant document 468.22
similarity measure 464.64
information extraction 443.29
knowledge discovery 435.79
7
Ontology Learning Steps
Preprocessing Concept ExtractionTaxonomy Induction Non-taxonomic Relations
8
Taxonomy Induction
Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms
Two methods in OntoGainAgglomerative clustering Formal Concept Analysis (FCA)
Agglomerative Clustering
Proceeds bottom-up: at each step, the most similar clusters are merged
Initially each term is considered a cluster Similarity between all pairs of clusters is
computed The most similar clusters are merged as
long as they share terms with common heads
Group average for clusters, Dice like formula for terms
10
Formal Concept Analysis (FCA) [Ganter et al., 1999]
FCA relies on the idea that the objects (terms) are associated with their attributes (verbs)
Finds common attributes (verbs) between objects and forms object clusters that share common attributes
Formal concepts are connected with the sub-concept relationship
)(),(),( 21212211 AAOOAOAO
FCA Example
Takes as input a matrix showing associations between terms (concepts) and attributes (verbs)
submit test describe print compute search
Html form * * *
Hierarchical clustering
* *
Text retrieval *
Root node * * * *
Single cluster * * *
Web page * *
FCA Taxonomy
13
Formal concepts ({hierarchical
clustering, root node, single cluster}, {compute, search})
({html form, web page}, {print, search})
Not all dependencies c,v are interesting
tvf
vcfvcP
)(
),()|(
Non-Taxonomic Relations extraction phase
14
Concept Extraction Taxonomy InductionNon-Taxonomic Relations
Non-Taxonomic Relations
Concepts are also characterized by attributes and relations to other concepts in the hierarchy
Typically expressed by a verb relating pair of concepts
Two approaches Associations rules Probabilistic
Association Rules [Aggrawal et.al., 1993]
Introduced to predict the purchase behavior of customers
Extract terms connected with some relation subject-verb-object
Enhance with general terms from the taxonomy
Eliminate redundant relations:
predictive accuracy < t
Association Rules: ExampleDomain Range Label
chiasmal syndrome pituitary disproportion cause by
medial collateral ligament surgical treatment need
blood transfusion antibiotic prophylaxis result
lipid peroxidation cardiopulmonary bypass lead to
prostate specific antigen prostatectomy follow
chronic fatigue syndrome cardiac function yield
right ventricular infraction radionuclide ventriculography analyze by
creatinine clearance arteriovenous hemofiltration achieve
cardioplegic solution superoxide dismutase give
bacterial translocation antibiotic prophylaxis decrease
accurate diagnosis clinical suspicion depend
ultrasound examination clinical suspicion give
total body oxygen consumption epidural analgesia attenuate by
coronary arteriography physician perform by
17
Probabilistic approach [Cimiano et.al. 2006]
Collect verbal relations from the corpus Find the most general relation wrt verb
using frequency of occurrence Suffer_from(man, head_ache)Suffer_from(woman, stomach_ache)Suffer_from(patient,ache)
Select relationships satisfying a conditional probability measureAssociations > t become accepted
18
Evaluation
Relevance judgments are provided by humans
Precision - Recall We examined the 200 top-ranked
concepts and their respective relations in 500 lines
Results from OhsuMed & Computer Science corpus
19
Results
20
Processing Layer Method
Precision –
OhsuMed
Recall -
OhsuMed
Precision –
Comp. Science
Recall –
Comp. Science
Concept Extraction C/NC-Value 89.7% 91.4% 86.7% 89.6%
Taxonomic Relations
Formal Concept Analysis
47.1% 41.6% 44.2% 48.6%
Hierarchical Clustering 71.2% 67.3% 71.3% 62.7%
Non-Taxonomic Relations
Association Rules 71.8% 67.7% 72.8% 61.7%
Probabilistic 62.7% 55.9% 61.6% 49.4%
Comparison with Text2Onto [Cimiano & Volker, 2005]
21
Huge lists of plain single word terms, and relations lacking of semantic meaning
Text2Onto cannot work with big texts Cannot export results in OWL
Conclusions OntoGain
Multi-word term concepts Exports ontology in OWL Domain independent
Results C/NC-Value yields good results Clustering outperforms FCA Association Rules perform better than
Verbal Expressions
22
Future Work
Explore more methods / combinations e.g., clustering, FCA Hearst patterns for discovering additional
relation types (Part-of)
Discover attributes and cardinality constraints
Incorporate term similarity information from WordNet, MeSH
Resolve term ambiguities
23
Thank you!
Questions ?
24
Preprocessing
Tokenization, POS tagging, Shallow parsing (OpenNLP suite)
Lemmatization (WordNet Java LibraryApply to all steps of OntoGainShallow parsing is used in relations
acquisition for the detection of verbal dependencies
26
Terms sharing a head tend to be similar e.g. hierarchical method and agglomerative
method are both methods Nested terms are related to each other
e.g. agglomerative clustering method and clustering method should be associated)
top related