Computer Science, building 42.1 Roskilde University Universitetsvej 1 P.O. Box 260 DK-4000 Roskilde Denmark Phone: +45 4674 2000 Fax: +45 4674 3072 www.dat.ruc.dk Ontology-based Information Retrieval PhD. Defense Henrik Bulskov
Computer Science, building 42.1Roskilde University
Universitetsvej 1 P.O. Box 260
DK-4000 RoskildeDenmark
Phone: +45 4674 2000Fax: +45 4674 3072
www.dat.ruc.dk
Ontology-based Information Retrieval
PhD. Defense
Henrik Bulskov
Henrik Bulskov 2November 3th, 2006
Research Question
• How do we recognize concepts in information objects and queries, represent these in the information retrieval system, and use the knowledge about relations between concepts captured by ontologies in the querying process?
How to:
• Describe Recognize and map information in documents and queries into the ontologies,
• CompareImprove the retrieval process by use of similarity measures derived from knowledge about relations between concepts in ontologies, and
• Retrieve Introduce ontological indexing and ontological similarity into a realistic information retrieval scenario.
Henrik Bulskov 3November 3th, 2006
Outline
• Introduction• Ontologies
• Information Retrieval
• Knowledge Representation
• Ontological Indexing• Instantiated Ontologies• Ontological Similarity• Query Evaluation• Prototype• Conclusion and Further Work
Henrik Bulskov 4November 3th, 2006
Ontologies
• In Philosophy:“A science or study of being”“First philosophy”
• In Knowledge Engineering:“A formal, explicit specification of a shared conceptualization”
Henrik Bulskov 5November 3th, 2006
Information Retrieval Models
• Boolean Model• Vector Model• Probabilistic Model• Fuzzy Retrieval Model• Fuzzy Sets
classical set d = {t1, t2, t3}
fuzzy set dfuzzy = {1.0/t1, 0.5/t2, 0.1/t3, 0.0/t4}
Henrik Bulskov 6November 3th, 2006
Knowledge Representation
• A knowledge representation is a surrogate. Most of the things that we want to represent cannot be stored in a computer, e.g. bicycles, birthdays, motherhood, etc.
• A knowledge representation is a set of ontological commitments. Representations are imperfect approximations of the world, each attending to some things and ignoring others.
• A knowledge representation is a fragmentary theory of intelligent reasoning. To be able to reason about the things represented, the representation should also describe their behavior and intentions.
• A knowledge representation is a medium for efficient computation. Remarks on useful ways to organize information are given.
“What is a knowledge representation?” [Davis et al., 1993]
Henrik Bulskov 7November 3th, 2006
A Graphical Representation
• We aim at reasoning by means of a “nearness” principle, where increased nearness entails increased degree of similarity
• Can be based on any formalism that has a suitable network or graphical representation encompassing the semantics expressed
Henrik Bulskov 8November 3th, 2006
Representation Formalisms
• Semantic Networks• Frames• Descriptions Logic• Lattice Algebra• OntoLog
A lattice-algebra with “attribution” (by means of Peirce Product)
Henrik Bulskov 9November 3th, 2006
OntoLog - A Lattice-algebraic Approach
• Compound concepts are built from • Atomic concepts of the ontology, and• Attribution:
• Attribute features using semantic relations like• WRT: with respect to • CHR: characterized by (property ascription) • CBY: caused by • PNT: patient of act or process • LOC: location, position • ...
• Concept Examples: • “The black cat”
• cat[CHR: black]
• “The noise caused by the black dog”• noise[CBY: dog[CHR: black]]
Henrik Bulskov 11November 3th, 2006
Ontological Resource• WordNet• A large lexical database organized in terms of
meanings.
• Nouns, Adjectives, Adverbs, and Verbs
• Synonym words are grouped into synset{car, auto, automobile, machine, motorcar}
{food, nutrient}{police, police force, constabulary, law}
• Number of words, synsets, and senses
Henrik Bulskov 12November 3th, 2006
Ontological Indexing
• Describing Content• Conventional approach: bag of keywords• Aim: Moving from keywords Concepts
• Description Properties• Fidelity
Ability to represent the content
• ExhaustivityDegree of recognized concepts
• SpecificityGeneric level of the concepts
• Level of abstractionComplexity of descriptions
Henrik Bulskov 13November 3th, 2006
Descriptions Example
“physical well-being caused by a balanced diet”
{{well-being[CHR:physical]}, {diet[CHR:balanced]}}
well-being[CHR:physical, CBY:diet[CHR:balanced]]
{{“physical”, “well-being”}, {“balanced”, “diet”}}
{“physical”, “well-being”, “balanced”, “diet”}
Henrik Bulskov 14November 3th, 2006
Simple Natural Language Processing
• Part-of-speech tagging
Example: “The black book on the small table”
The/DT black/JJ book/NN on/IN the/DT small/JJ table/NN
• Noun phrase recognition
<PP [NP The/DT black/JJ book/NN] on/IN [NP the/DT small/JJ table/NN]
>
• Extracting Descriptions
book[CHR:black, LOC:table[CHR:small]]
Henrik Bulskov 15November 3th, 2006
Word Sence Disambiguation
• Guessing (about 45% correct senses)
• Frequencies (increased from 45% to 69%)
• Selectional Restrictions
e,x,y(Drinking(e) ^ Agent(e,x) ^ Theme(e,y)) ISA(y, DrinkableThing)
Example: “He drank gin”
{gin} (strong liquor flavored with juniper berries) {snare, gin, noose} (a trap for birds or small mammals; … a slip noose) {cotton gin, gin} (a machine that separates the seeds … cotton fibers) {gin, gin rummy, knock rummy} (a form of rummy in which a player …)
Henrik Bulskov 16November 3th, 2006
Relational Connections
• Noun phrasesExample:
“Physical well-being caused by a balanced diet”
well-being[CHR:physical, CBY:diet[CHR:balanced]]
• Relations derived from verbsThe CBY relation between “well-being” and “diet”
• Relations inside simple NP’sThe CHR between “diet” and “balanced”
Henrik Bulskov 17November 3th, 2006
Relational Connections, cont.
• PremodifierExamples: ”black book“ and “criminal lawyer”book[CHR:black] is correct, butlawyer[CHR:criminal] is not.
Use knowledge from ontologeis to differentiate between descriptive and relational adjective, and use WRT relation for the latter
lawyer[WRT:criminal]
• Preposition“The black book on the small table”
book[CHR:black, LOC:table[CHR:small]]“The meeting on Monday”
meeting[TMP:Monday]
Henrik Bulskov 19November 3th, 2006
General Ontologies
• Modeling of concepts in a generative ontology based on different conceptualization and dictionaries.
• We assume a set of atomic concepts A and a set of semantic relations R
• We define here the set of well-formed terms L of the OntoLog language is recursively defined as follows:
if x isa y then x ≤ yif x[…] ≤ y[…] then also
x[…, r:z] ≤ y[…], andx[…, r:z] ≤ y[…, r:z],
if x ≤ y then alsoz[…, r:x] ≤ z[…, r:y]
Definition: O = (L, ≤, R)
Henrik Bulskov 20November 3th, 2006
Ontological Similarity
• We aim at reasoning by means of a “nearness” principle, where increased nearness entails increased degree of similarity
• Properties• Commonality, Difference, Identity
• Generalization• sim(animal,poodle) ≠ sim(poodle,animal)
• Depth
• Multiple-Paths
Henrik Bulskov 21November 3th, 2006
Ontological Similarity Measures
• Knowledge-based• Shortest Path
• Corpus-based• Co-occurrences
• Integrated • Information Content
• probability of encountering an instance: p(c) = freq(c)/N
• Shared Nodes (on instantiated ontologies)
Henrik Bulskov 22November 3th, 2006
Shared Nodes
• A simplified “all-possible-paths” approach where shared nodes are nodes that are upwards reachable from both concepts, and where similarity is dependent on the number of shared nodes
Henrik Bulskov 24November 3th, 2006
Shared Nodes, cont.
sim(dog[CHR:gray ], cat[CHR:gray ]) > sim(dog[CHR:gray ], dog[CHR:large ])
Counterintuitive: • concept-inclusion (ISA) should have higher importance than characterized-by (CHR) property
Henrik Bulskov 25November 3th, 2006
Weighted Shared Nodes• Not all nodes are equally important
sim(dog[CHR:gray ], cat[CHR:gray ]) > sim(dog[CHR:gray ], dog[CHR:large ])
solution: attach weights in [0,1] to relations so the nodes upwards reachable from x: (x) becomes a fuzzy
set
Henrik Bulskov 26November 3th, 2006
Query Evaluation - Simple Fuzzy Retrieval
• Assign to each pair (di,cj) a value which define the relevance of cj to di
• Compute the relevance of documents to a given query as
d1 = {1/c1 + 1/c2 + 0/c3}
d2 = {1/c1 + 0/c2 + 1/c3} andQ = {c1, c2}
d3 = {0/c1 + 0/c2 + 1/c3}
Qc Q
QdQcQQ c
cc
Q
QdddRSV
)(
))(),(min(
||
||)()(
Henrik Bulskov 27November 3th, 2006
Query Evaluation - Hierarchical Aggregation
<q1(d),
<q2(d), q3(d),
<q4(d), q5(d), q6(d) : M3 : K3>
: M2 : K2>,
: M1 : K1>
M1
K1
q1(d) M2
K2
M3
K3
q2(d) q3(d)
q4(d) q5(d) q6(d)