Ehsan Asgarian Ontology Learning from Text
Definition of Ontology
‘A formal, explicit specification of a shared conceptualization’
must be
machine
understandable
types of concepts and
constraints must be clearly
defined
not private to some individual,
but accepted by a group
an abstract model of some
phenomenon in the world formed
by identifying the relevant
concepts of that phenomenon
or simply, a data model describing of a domain.
Main elements of an ontology
Hierarchy of concepts
(is-a relations)
Object property
(relation)
domain range
domain
xsd:string
range
datatype property
(attribute)
hasTitle
wasWrittenBy
Applications of Ontologies
Knowledge representation and knowledge management systems
Intelligent query-answering systems
Information retrieval and extraction
Semantic Web
• Web pages annotated with ontologies
• User queries for Web pages analysed at
knowledge level and answered by inferencing on
ontological knowledge
Definition of Ontology Learning
The application of a set of methods and techniques used for building an ontology from scratch
Uses distributed and heterogeneous knowledge and information sources
Allows a reduction in the time and effort needed in the ontology development process
Ontology Learning (Construction)
Manual construction
• Corpus is not necessary
• Small scale
Automatic or semiautomatic construction
• Domain specific corpus
• Good domain knowledge coverage
Ontology Learning methods from…
Unstructured sources
• Involves NLP techniques, morphological and syntactic
analysis, etc.
Semi-structured source
• elicit an ontology from sources that have some predefined
structure, such as XML Schema
Structured data
• Extracting concepts and relations from knowledge contained
in structured data, such as databases
Ontology Learning ‘Layer Cake’
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Termsdisease, illness, hospital
{disease, illness}
Disease:=<I, E, L>
is_a (Doctor, Person)
cure (domain:Doctor, range:Disease)
x, y (sufferFrom(x, y) ill(x))
Subtasks in ontology learning
Extract the relevant domain terminology and synonyms from a
text collection
Discover concepts which can be regarded as abstractions of
human thought
Derive a concept hierarchy organizing these concepts
Extend an existing concept hierarchy with new concepts
Learn non-taxonomic relations between concepts
Populate the ontology with instances of relations and concepts
Discover other axiomatic relationships or rules involving
concepts and relations
Sample (partial) Ontology –
Electronic Voting Domain
Concepts: person, voter, worker, poll watcher, location, county, precinct, vote, ballot, machine, voting machine, manufacturer, etc.
Attributes: name of person, model of machine, etc.
Taxonomical relations:
• Voter is a person; precinct is a location; voting
machine is a machine, etc.
Non-hierarchical relations:
• Voter cast ballot; voter trust machine; county
adopt machine; equipment miscount ballot, etc.
ConceptNet — a practical commonsense reasoning
Open Mind Common Sense (OMCS) is an artificial intelligence
project based at the Massachusetts Institute of Technology (MIT)
Media Lab whose goal is to build and utilize a large
commonsense knowledge base from the contributions of many
thousands of people across the Web.
ConceptNet is a multilingual knowledge base, representing
words and phrases that people use and the common-sense
relationships between them.
Since its founding in 1999, it has
accumulated more than a million
English facts from over 15,000
contributors in addition to knowledge
bases in other languages.
ConceptNet — a practical commonsense reasoning
The knowledge base is a semantic network presently consisting
of over 1.6 million assertions of commonsense knowledge
encompassing the spatial, physical, social, temporal, and
psychological aspects of everyday life.
It is built from nodes representing concepts, in the form of words
or short phrases of natural language, and labeled relationships
between them. These are the kinds of things computers need to
know to search for information better, answer questions, and
understand people's goals.
ConceptNet is generated automatically from the 700 000
sentences of the Open Mind Common Sense Project — a World
Wide Web based collaboration with over 14 000 authors.
Challenges in Text Processing
Unstructured texts
Ambiguity in English text• Multiple senses of a word
• Multiple parts of speech – e.g., “like” can occur in 8 PoS:• Verb: “Fruit flies like banana”
• Noun: “We may not see its like again”
• Adjective: “People of like tastes agree”
• Adverb: “The rate is more like 12 percent”
• Preposition: “Time flies like an arrow”
• etc
Lack of closed domain of lexical categories
Noisy texts
Requirement of very large training text sets
Lack of standards in text processing
Part 1 Terms Extraction
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Termsdisease, illness, hospital
Terms
Linguistic realizations of domain-specific concepts
Are the basis of the ontology learning process
Term extraction implies:
• Linguistic processing part-of-speech tagging,
morphological analysis, etc.
• Statistical processing compares the distribution of
terms between corpora
Terms Extraction: Process
Run a Part-Of-Speech (POS) tagger over the domain
corpus
Identify possible terms by constructing patterns, such
as: Adj-Noun, Noun-noun, Adj-Noun-Noun,…
Ignore Names
Identify only the relevant to the text terms by applying
statistical metrics
Linguistic Analysis: an example
Discourse
Analysis
Dependency Structure
(S)
Dependency Structure
(Phrases)
Phrase Recognition
Morphological Analysis (stemming)
Part of Speech & Semantic Tagging
Tokenization (incl. Named-Entity Rec.)[table] [2005-06-01] [John Smith]
[[the] [large] [table] NP] [[in] [the] [corner] PP]
[table N:ARTIFACT] [table N:furniture]
[work~ing V]
[[the SPEC] [large MOD] [table HEAD] NP]
[[He SUBJ] [booked PRED] [[this] [table HEAD] NP:DOBJ]S]
[[He SUBJ] [booked PRED] [[this] [table HEAD]NP:DOBJ:X1]…]…
[[It SUBJ:X1] [was PRED] still available…]
Statistical Analysis
Statistical metrics used in terms extraction:
2 ( exp)
exp
obs
Chi-square
Term weighting (TFIDF) ( ) log( )( )
Ntfidf w tf
df w
Mutual Information ( , )( , )
( ) ( )
P x ymi x y
P x P y
TFIDF
( ) ( ) log( )( )
Ntfidf w tf w
df w
tf(w) term frequency (number of words occurrences in a document)
df(w) document frequency (number of documents containing the word
N number of all documents
tfidf(w) relative importance of the word in the document
Most popular weighting schema
The word is more popular when it appears
several times in a document The word is more important if it appears
in less documents
Part 2 Synonyms
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
{disease, illness}
Synonyms
Identification of terms that share
semantics, i.e., potentially refer to the
same concept
Methods for extracting synonyms
• Based on WordNet
• Latent Semantic Indexing (LSI)
WordNet
A lexical database for the English language
Nouns, verbs, adjectives & adverbs are grouped into sets of
synonyms (synsets)
Synsets are interlinked by means of conceptual-semantic
and lexical relations
Adapting WordNet to specific domain
Partition the set of synonymy relations defined in WordNet in
three classes:
• Relations irrelevant in the specific domain
• Relations that are relevant but incorrect in the specific
domain
• Relations that are relevant and correct in the specific
domain
Remove relations from the first two classes and include
relations from the third class
Rank the rest sets according to their frequency in corpus
Latent Semantic Indexing (LSI)
LSI is a technique in NLP of analyzing relationships
between a set of documents and the terms they contain
Uses a term-document matrix which describes the
occurrences of terms in documents – Vector Space Model
Example: doc1 doc2
database X
computer X X
access X
Part 3 Concepts
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
Disease:=<I, E, L>
Concepts
Intension, Extension, Lexicon
A term may be indicate a concept if we can define its:
Intension:
Extension:
Lexical realizations:
(in)formal definition of the set of objects that this concept
describes
a set of objects that the definition of this concept
describes (the name of the nearest common ancestor)
the term itself and its multilingual synonyms
Example: a disease is an impairment of health or a condition of abnormal functioning
Example: influenza, cancer, heart disease
Example: disease, illness, maladie
Part 4 Taxonomy Induction
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
is_a (Doctor, Person)
Concept Hierarchy Extraction
With the use of WordNet
Lexico-syntactic patterns
Machine Readable Dictionaries
Co-occurrence Analysis
Unsupervised hierarchical clustering techniques
Linguistic-approaches
Basic methods used for taxonomy extraction:
Taxonomy Extraction with WordNet
Given two terms t1 and t2, check if they stand in a
hypernym relation with regard to WordNet
Normalize the number of hypernym paths by dividing
by the number of senses of t11 2
1 2
1
| ( ( ), ( )) |( , ) min( ,1)
| ( ) |
paths senses t senses tisa t t
senses t
path: a sequence of edges connecting the two synsets
Example: - 4 different hypernym paths between synsets ‘country’ and ‘region’- ‘country’ has 5 senses
value of isa (country, region) = 0.8
Lexico-syntactic patterns - Hearst
Aim: the acquisition of hyponym lexical relations from text
Uses a set of predefined lexico-syntactic patterns which
• occur frequently and in many text genres
• indicate the relation of interest
• can be recognized with little or no pre-encoded knowledge
Principle idea: match these patterns in texts to retrieve
is_a relations
Precision with respect to WordNet: 55,45%
Lexico-syntactic patterns - Hearst
NPo such as {NP1, NP2,…, (and | or)} NPn
‘Vehicles such as cars, trucks and bikes….’
such NP as {NP,} * { (or | and) } NP
‘Such fruits as oranges, nectarines or apples…’
NP {, NP} * { , } { or | and } other NP
‘Swimming, running, or/and other activities…’
vehicle
carbike
truck
is-ais-a is-a
fruit
applenectarine
orange
is-ais-a is-a
is-a
activity
swimmingrunning
is-a
NP { , } including {NP, } * { or | and } NP
‘Injuries, including broken bones, wounds and bruises…’
NP { , } especially {NP, } * { or | and } NP
‘Publications, especially papers and books…’publication
bookpaper
is-ais-a
Lexico-syntactic patterns - Hearst
injury
bruisewound
broken bone
is-ais-a is-a
Machine Readable Dictionaries
A method for extracting taxonomies which goes back
to the 80’s
Main idea: exploit the regularity of dictionary entries to
find a suitable hypernym for the defined word
spring “the season between winter and summer and in which
leaves and flowers appear”
Example:
is_a (spring, season)
MRDs: Exceptions
The hypernym can be preceded by an expression such as ‘a kind of’,
‘a sort of’, or ‘a type of’
The problem is solved by keeping an exception list with words such as
‘kind’, ‘sort’, ‘type‘ and taking the head of the NP following the
preposition ‘of’
The word can be defined in terms of a part-of or membership relation
republican : “a member of a political party advocating republicanism” Example:
is_a (republican, political party) part_of (republican, political party)
hornbeam: “a type of tree with a hard wood, sometimes used in hedges” Example:
is_a (hornbeam, tree)
Co-occurrence analysis
A certain term t1 is more special that a term t2, if
t2 also appears in all the documents in which t1
appears.
( , )( | )
( )
n x yP x y
n y
Term x subsumes term y iff P(x | y) 1, where
n(x,y) the number of documents in which x and y co-occur
n(y) the number of documents that contain y
Document-based subsumption
Unsupervised hierarchical
clustering techniques
Unsupervised hierarchical clustering techniques
known from machine learning research
• very noisy as they highly depend on the frequency and
behavior of the terms in the text collection under consideration
• learn concepts at the same time since they also group terms
(the most related to each other)
• can be regarded as abstractions over words and thus, to
some extent, as concepts
It is unclear which specific relation actually holds
between the involved words.
Semantic_relatedness (cut, knife)
Example:
Linguistic Approaches
Modifiers typically restrict or narrow down the meaning
of the modified noun.
Syntactic structure analysis and dependency analysis
words and modifiers in syntactic structures (noun/verb/
prepositional/… phrases) are analyzed to discover
potential terms and relations e.g. the head-modifier principle:
the heads of the terms assuming the hypernym role
In dependency analysis, grammatical relations, such as
subject, object, adjunct, and complement, are used for
determining more complex relations
is_a (international credit card, credit card)
Example:
Extending Concept Hierarchy
with new Concepts
…by adding a new concept at an appropriate position in the existing taxonomy
Supervised methods:
• classifiers need to be trained which predict membership for every
concept in the existing concept hierarchy.
• need a considerable amount of training data for each concept,
• such approaches do typically not scale to arbitrary large ontologies.
Unsupervised approaches:
• assume a similarity function which computes a measure of fit between
the new concept and the concepts existing in the ontology.
• rely on an appropriate contextual representation of the different
concepts on the basis of which similarity can be computed.
• the hierarchical structure of the ontology needs to be considered and
somehow integrated into the similarity measure
Part 5 Relations (non-taxonomic)
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
cure (domain:Doctor, range:Disease)
Extracting relations (the interactions
between concepts) & attributes
Specific relations
• Part-of
• Qualia (Formal, Constitutive, Telic, Agentive)
General relations
• Exploiting linguistic structure
Attributes
Learning attributes: Introduction
Attributes relations with a datatype as range
Typically expressed in texts using preposition of, the verb have or
genitive constructs, e.g. ‘the color of the car’, ‘the car’s color’, ‘every
car has a color’
Values of attributes are expressed using copula constructs,
adjectives or expressions specific to the attribute in question, e.g.,
• ‘the car is red’ (copula + value)
• ‘the red car’ (adjective)
• ‘the baby weights 3 kgr’ (specific expressions)
Classification of attributes
To systematize the learning process attributes are classified according to their range
An approach to learning attributes
Tokenize & part-of-speech tag the corpus
Apply the following patterns to extract adjective/noun pairs
(\w+{DET})? (\w+{NN}) + is{VBZ} \w + {JJ}
(\w+{DET})? \w + {JJ} (\w+{NN}) +
These pairs are weighted using conditional probability:
For each of the adjectives we look up the corresponding
attributes in WordNet
f(n,a): joint frequency of adjective a and noun nf(n): the frequency of noun n
JJ: adjective DET: determiner
NN: noun VBZ: verb, 3rd person singular present
“meronymy” / “part-of” relations
whole NN[-PL] ‘s POS part NN[-PL]
part NN[-PL] of PREP {the|a} DET mods [JJ|NN]* whole NN
Format type_of_word TAG type_of_word TAG…
NN = Noun NN-PL = Plural Noun
PREP = Preposition POS = Possessive
JJ = Adjective
e.g. …building’s basement…
e.g. …basement of a building… 55% accuracy
Given a “seed” word find parts of that word in a large corpus of text
Qualia structures
The meaning of a lexical element is described in terms of four roles:
Constitutive
Agentive
Formal
Telic
physical properties of a object (e.g., weight, material, parts)
typically a verb denoting an action which brings the object in existence
normally consists in typing information about the object (e.g., hypernym)
the purpose or function of an object either by a verb or by a nominal
Formal: artifact_tool
Constitutive: blade, handle,…
Telic: cut_act
Agentive: make_act
Example:
Qualia structures for knife
Qualia Structures: Learning Approach
aim: to automatically learn qualia
structures from the WWW
Based on the idea of matching certain
lexico-syntactic patterns conveying a
standard relation
Clues: search engine queries
indicating the relation of
interest
Calculate the weight of a
candidate qualia element e for
the term t using Jaccard
coefficient:
Qualia Structures: Learning Process
Generate Clues
Download Google
Abstracts
POS-tagging
Matching regular
expressions
Statistical Weighting
Word
Weighted QS
( )
( ) ( ) ( )
GoogleHits e t
GoogleHits e GoogleHits t GoogleHits e t
Relations by syntactic analysis
SubjToClass_PredToSlot_DObjToRange
Maps a subject to the domain, the predicate or verb to a slot or
relation and the object to its range.
Example:
OntoLT
‘The player kicked the ball to the net’
relation: kick (domain: player, range: ball)
Relations by linguistic theory
Example:‘Joe wrote a letter’
relation: write (subject: Joe, object: letter)
The subcategorization frame of a word is the number
and kinds of other words that it selects when appearing
in a sentence.
E.g. identify verbs in text as indicators of a relation
between their arguments (object properties)
Person restrictions of selection (for the subject and object of the verb “write”)
written-communication
Part 6 Axioms & Rules
Axioms & Rules
Relations
Taxonomy (Concept hierarchies)
Concepts
Synonyms
Terms
x, y (sufferFrom(x, y) ill(x)
DIRT
Discovery of Inference Rules from Text
an unsupervised method for discovering inference rules
from text, such as
X is author of Y X wrote Y,
X caused Y Y is blamed on X
X manufactures Y X’s Y factory
Is based on the assumption that:
Words that occurred in the same contexts tend to be similar
Distributional Hypothesis
DIRT: Distributional Hypothesis
Distributional Hypothesis is applied to
dependency tress
If two paths tend to link the same sets of
words, their meanings are hypothesized to be
similar
DIRT: Dependency trees
The inference rules
discovered by DIRT are
between paths in
dependency trees
Are generated by Minipar
parser
Minipar represents its
grammar as a network where
nodes represent grammatical
categories and links syntactic
relationships A subset of the dependency relations in Minipar output
DIRT: Dependency trees
“John found a solution to the problem”
pcomp
found
a
solution
to
problem
the
John
moddet
subj obj
det
Links represent dependency relationships
Direction: from the head to the modifier
Labels represent types of dependency relations
Each link between two words represents a direct
semantic relationship
Path between “John” and “problem”
N:subj:V find V:obj:N solution N:to:N
meaning “X finds solution to Y”
DIRT: Paths in Dependency Trees
Connect the prepositional complement directly to the words
modified by the preposition
transformation rule
Each link between two words represent a direct semantic relationship
A path represents indirect semantic relationships between two content words
Evaluation Ontology Learning Techniques
1) Task-based evaluation (improve quality): the first
approach evaluates the adequacy of ontologies in the
context of other applications.
2) Corpus-based evaluation : the second approach uses
domain-specific data sources to determine to what
extent the ontologies are able to cover the
corresponding domain.
3) Criteria-based evaluation : The third approach,
assesses ontologies by determining how well they
adhere to a set of criteria.
Task-based evaluation
How well an ontology meets their systems’
requirements.
An ontology designed to improve the performance of
document retrieval more relevant when the ontology
is used
the use of ontological relations in the context of speech
recognition compared with a gold standard
generated by humans.)
Corpus-based evaluation
methods for evaluating the ‘fit’ between an ontology and
the domain knowledge in the form of text corpora.
In this approach, natural language processing (e.g.,
latent semantic analysis, clustering) or information
extraction (e.g., named-entity recognition) techniques
are used to analyze the content of the corpus and
identify terms.
Criteria-based evaluation
the average number of terms that were aggregated to
form a concept in an ontology : This criterion may be used to
realize the perception that the more variants of a term used to form
a concept, the more fully encompassing or complete the concept is.
Other measurement
Evaluation approaches can also be distinguished by the
layers of an ontology :
• term,
• concept,
• relation
Evaluations can be performed to assess the :
• correctness at the terminology layer,
• coverage at the conceptual layer,
• wellness at the taxonomy layer,
• adequacy of the non-taxonomic relations.