A Semantic Web Search and Metadata Engine Roi Adadi David Ben-David
A Semantic Web Search and Metadata Engine
Roi AdadiDavid Ben-David
Semantic Web Document (SWD)◦ A web page that serializes an RDF graph.◦ Uses one of the recommended RDF syntax languages, i.e. RDF/XML,
N-TRIPLE or N3. Semantic Web Term (SWT)
◦ An RDF resource that represents an instance of rdfs:Class or rdf:Property, and can be universally referenced by its URI reference (URIref).
Semantic Web Ontology (SWO)◦ An SWD is considered to be an SWO when a significant proportion of
the statements it makes defines new SWTs. Semantic Web Database (SWDB)
◦ An SWD that does not define or extend a significant number of terms.◦ Introduces individuals and makes assertions about them.◦ Make assertions about individuals defined in other SWDs.
Glossary<rdf:RDF> … <rdfs:Class rdf:ID=”Department” /> <rdfs:Class rdf:ID=”Course” /> <rdf:Property rdf:ID=“name” > <rdfs:domain> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <rdfs:Class rdf:about=# Department /> <rdfs:Class rdf:about=#Course /> </owl:unionOf> </owl:Class> </rdfs:domain> <rdfs:range rdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Property rdf:ID=“number” > <rdfs:domain rdf:resource=“#Course”/> <rdfs:range rdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Property rdf:ID=“department” > <rdfs:domain rdf:resource=“#Course”/> <rdfs:range rdf:resource=“#Department”> </rdf:Property> <rdf:Property rdf:ID=“creditPts” > <rdfs:domain rdf:resource=“#Course”/> <rdfs:range rdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <Department rdf:ID=“dept_cs”> <name>Computer Science</name> </Department> <Course rdf:ID=“cs236703” > <name>Object Oriented Programming</name> <department rdf:Resource=“#dept_cs” /> <creditPts>3.0</creditPts> </Course> …</rdf:RDF>
SWD
SWTSWT
SWT
SWT
SWT
SWT
SWO
Class Document
Class Organization
Property mbox
FOAFhttp://xmlns.com/foaf/spec/index.rdfContain 12 classes and 51 properties (in 466 triples)
(No individuals)
SWDB
Name statement
Nick Name statement
FOAF description for Tim Finin
www.cs.umbc.edu/~finin//foaf.rdfDefines three individuals and make statements about them
(No classes or properties)
Current form of the Semantic Web◦ web of Semantic Web Documents (SWD)
Navigating the Semantic Web is difficult◦ Paucity of explicit hyperlinks (beyond NS in URIrefs).◦ Relations such as rdfs:seeAlso and owl:imports are
rare.
There is a need for a search engine customized for SWD◦ Find and analyze SWDs on the web.◦ Suggest a measure for SWDs’ importance (ranking).
Motivation
Semantic Web researchers◦ Search for SWTs and SWOs for publishing their
knowledge.
Software Agents◦ Search SWDs for external knowledge.◦ Retrieve SWOs to fully understand SWTs.
Who needs it?Find the most popular ontology
to publish a personal profile
Conventional web navigation and ranking models are not suitable for the Semantic Web.
They do not differentiate SWDs from other web pages.
They do not parse and use the internal structure of SWD and the external semantic links among SWDs◦ Designed to work with NL and unstructured text
Why don’t just use Google?
The FOAF ontology is not among the 10 search results in Google for “person ontology”
Finding appropriate ontologies◦ Qualified search (Terms + Types)◦ Ontologies are sorted by their popularity.
Finding instance data◦ Querying SWDs with constraints on the classes
and properties used by them.◦ Helps to integrate Semantic Web data on the web.
Characterizing the Semantic Web◦ Structural properties
Swoogle Objectives
Ontology Based Annotation Systems◦ SHOE, Ontobroker, webKB, QuizRDF, CREAM, …◦ Annotating online documents.◦ Document indexes based on the annotations, but
not on the entire document.◦ Use their own ontologies that might not suit some
SWDs
Related Work
Ontology Repositories◦ DAML Ontology Library, SemWebCentral, Schema
Web, …◦ Collect ontologies (simply store the entire RDF
document).◦ Do not automatically discover SWDs but rather
require people to submit URLs.◦ Constitute a small portion of the Semantic Web.
Related Work – cont.
Semantic Web Browsers◦ W3C’s Ontaria
Searchable and browsable directory of RDF documents developed by the W3C.
◦ Do not automatically discover SWDs.◦ Stores the full RDF graphs.◦ Indexes individuals of well known classes
e.g. foaf:Person, rss:Item
Related Work– cont.
Experiments show:
outperforms them all!
Crawler-based indexing and retrieval system for the Semantic web.
Discover semantic web documents Computes relations between documents Store and reason over extracted metadata
◦ The system is designed to scale up to handle tens of millions of documents
Enables rich query constraints on semantic relations
Swoogle
Swoogle Architecture
Collects candidate URLs to find and cache SWDs◦ Submitted URLs.◦ A Web crawler.◦ A customized meta-crawler (using conventional
search engines).◦ SwoogleBot Semantic Web Crawler .
Analyzes SWDs to produce new candidates.
Swoogle Architecture - Discovery
Up until now Swoogle
has found over 1.7M
SWDs with more than 1G
triples!
Analyzes the discovered SWDs Generates the bulk of Swoogle’s metadata
about the Semantic Web◦ Characterizes features associated with SWDs and
SWTs.◦ Tracks relations among SWDs and SWTs.
Swoogle Architecture – Indexing
How SWDs use/define/populate a given
SWT?
How two SWTs are associated?…
Analyzes the generated metadata. ◦ Classification of SWOs and SWDBs.
Hosts the modular ranking mechanisms.◦ Ontology Rank.
Swoogle Architecture – Analysis
provides search services to software agents and users, allowing them to access metadata and navigate the semantic web◦ Swoogle Search – searches SWDs using
constraints on URLs, SWTs being used or defined, etc.
◦ Ontology Dictionary – searches ontologies at the term level and offers more navigational paths.
Swoogle Architecture – Services
SWD metadata is collected to make SWD search more efficient and effective.
Derived from the content of SWD as well as the relations among SWDs
3 categories of metadata:◦ Basic metadata◦ Relations among SWDs◦ Analytical results
SWD Metadata
Language Features – properties describing the syntactic or semantic features of an SWD. ◦ Encoding – syntactic encoding of an SWD.
“RDF/XML”, “N-TRIPLE” and “N3”.◦ Language – the language used by an SWD.
“OWL”, “DAML+OIL”, “RDFS” and “RDF”.◦ OWL Species – the language species of an SWD
written in OWL. “OWL-LITE”, “OWL-DL” and “OWL-FULL”
Basic Metadata
RDF Statistics – properties summarizing node distribution of the RDF graph of an SWD.◦ How an SWD defines new classes, properties and
individuals.◦ Let foo be an SWD and let C(foo), P(foo), I(foo) be
the set of classes, properties and individuals defined in the SWD foo respectively. The onology-ratio R(foo) is calculated by:
◦ R(foo) ranges from 0 to 1, where 0 implies that foo is a pure SWDB and 1 implies that foo is a pure SWO.
Basic Metadata – cont.
𝑅ሺ𝑓𝑜𝑜ሻ= ȁ�𝐶ሺ𝑓𝑜𝑜ሻȁ�+ ȁ�𝑃ሺ𝑓𝑜𝑜ሻȁ�ȁ�𝐶ሺ𝑓𝑜𝑜ሻȁ�+ ȁ�𝑃ሺ𝑓𝑜𝑜ሻȁ�+ ȁ�𝐼ሺ𝑓𝑜𝑜ሻȁ�
<rdf:RDF> <rdfs:Class rdf:ID=”Department” /> <rdfs:Class rdf:ID=”Course” /> <rdf:Property rdf:ID=“name” > <rdfs:domain> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <rdfs:Class rdf:about=# Department /> <rdfs:Class rdf:about=#Course /> </owl:unionOf> </owl:Class> </rdfs:domain> <rdfs:range rdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Property rdf:ID=“number” > <rdfs:domain rdf:resource=“#Course”/> <rdfs:range rdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <rdf:Property rdf:ID=“department” > <rdfs:domain rdf:resource=“#Course”/> <rdfs:range rdf:resource=“#Department”> </rdf:Property> <rdf:Property rdf:ID=“creditPts” > <rdfs:domain rdf:resource=“#Course”/> <rdfs:range rdf:resource= http://www.w3.org/2000/01/rdf-schema#Literal/> </rdf:Property> <Department rdf:ID=“dept_cs”> <name>Computer Science</name> </Department> <Course rdf:ID=“cs236703” > <name>Object Oriented Programming</name> <department rdf:Resource=“#dept_cs” /> <creditPts>3.0</creditPts> </Course></rdf:RDF> 2 4 0.75
2 4 2R A
_ , 236703I A dept cs cs
, , ,P A name number department creditPts
,C A Department Course
Ontology Annotations– properties that describe an SWD as an ontology.◦ The SWD has an instance of OWL:Ontology◦ Swoogle records the following properties:
label (rdfs:label) comment (rdfs:comment) versionInfo (owl:versionInfo/daml:versionInfo)
Basic Metadata – cont.
Capturing and analyzing relations at the RDF node level is hard.
Swoogle generalizes RDF node level relations and Focuses on SWD level relations.
Swoogle captures the following SWD level relations:◦ TM/IN – SWD is using terms defined by some other SWDs.◦ IM – an ontology imports another ontology.◦ EX – an ontology extends another ontology◦ PV – an ontology is a prior version of another.◦ CPV – an ontology is a prior version of another and is
compatible with it.◦ IPV - an ontology is a prior version of another and is
incompatible with it.
Relations Among SWDs
Inter-Ontology relations
Indicators of inter-ontology relation
OntologyRank inspired by Google’s PageRank algorithm.
Underlying Random Surfing Model:◦ Surfer jumps to a random URL◦ With probability d randomly chooses a link to
follow.◦ With probability 1-d jumps to another random URL.
Ranking SWDs
Given a document A, A’s Page rank is computed by:
where are web documents that link to A; C(T) is the total outlinks of T; and d is a damping factor, typically set to 0.85.
Page Rank
𝑃𝑅ሺ𝐴ሻ= 𝑃𝑅𝑑𝑖𝑟𝑒𝑐𝑡ሺ𝐴ሻ+ 𝑃𝑅𝑙𝑖𝑛𝑘ሺ𝐴ሻ 𝑃𝑅𝑑𝑖𝑟𝑒𝑐𝑡ሺ𝐴ሻ= ሺ1− 𝑑ሻ 𝑃𝑅𝑙𝑖𝑛𝑘ሺ𝐴ሻ= 𝑑⋅ ቀ𝑃𝑅ሺ𝑇1ሻ𝐶ሺ𝑇1ሻ + ⋯+ 𝑃𝑅ሺ𝑇𝑛ሻ𝐶ሺ𝑇𝑛ሻቁ 𝑇1,…,𝑇𝑛
PageRank
1 0.15directPR A d
1 15PR T 2 8PR T 3 18PR T
0.15 8.5 8.65PR A
15 8 18 8.55 4 6linkPR A d
1 5C T 2 4C T 3 6C T
The graph formed by SWDs has a richer set of relations.◦ The edges have explicit semantics
Users can navigate the Semantic Web whithin or across the web and RDF graph through 7 groups of navigational paths
The SW Navigation Model
The SW Navigation Model
The semantics of links lead to a non-uniform probability of following a particular outgoing link.
Given SWD’s A and B, Swoogle classifies inter-SWD links into four categories:◦ imports(A,B) – A import all content of B.◦ uses-term(A,B) – A uses some of the terms defined by B
(without importing B).◦ extends(A,B) – A extends the definitions of terms defined by
B.◦ asserts(A,B) – A makes assertions about the individuals
defined by B. Each category is assigned a different weight, which
represents the probability of following that kind of link.
OntologyRank
Given an SWD ȁ, Swoogle computes its raw rank by:
where L(a) is the set of SWDs that link to a, T(x) is the set of SWDs that x links to.
OntologyRank – cont.
𝑟𝑎𝑤𝑃𝑅ሺ𝑎ሻ=ሺ1− 𝑑ሻ+ 𝑑⋅ σ 𝑟𝑎𝑤𝑃𝑅ሺ𝑥ሻ𝑓ሺ𝑥,𝑎ሻ𝑓ሺ𝑥ሻ𝑥∈𝐿ሺ𝑎ሻ 𝑓ሺ𝑥,𝑎ሻ= σ 𝑤𝑒𝑖𝑔ℎ𝑡ሺ𝑙ሻ𝑙∈𝑙𝑖𝑛𝑘𝑠ሺ𝑥,𝑎ሻ 𝑓ሺ𝑥ሻ= σ 𝑓ሺ𝑥,𝑎′ሻ𝑎′ ∈𝑇ሺ𝑥ሻ
Then, Swoogle computes the rank for SWDB and SWO by:
where T(c) is the transitive closure of SWOs imported by a.
OntologyRank – cont.
𝑃𝑅𝑆𝑊𝐷𝐵ሺ𝑎ሻ= 𝑟𝑎𝑤𝑃𝑅ሺ𝑎ሻ 𝑃𝑅𝑆𝑊𝑂ሺ𝑎ሻ= σ 𝑟𝑎𝑤𝑃𝑅ሺ𝑥ሻ𝑥∈𝑇𝐶(𝑎)
The problem of Indexing and Searching SWDs◦ Significant semantic information encoded in
marked documents.◦ Reasoning over large collection of documents can
be expensive.
Traditional information retrieval techniques◦ Faster (coarse view of the text).◦ Can quickly retrieve a set of SWD’s based on
similarities of the source text alone.
Indexing and Retrieval of SWDs
SWDs are not entirely markup.◦ Search should be applied to both structured and
unstructured components of the document.
We may want SWDs to be available to commonly used search engins◦ Documents must be transformed to a form that a
standard IR engine can understand and manipulate.
Well researched methods for ranking matches, computing similarities between documents and employing relevance feedback.
Applying IR Techniques
Look at a document as a collection of either tokens or N-Grams.
URIrefs of classes, properties and individuals corresponds to words in natural languages.
Apply the following process to an SWD◦ Reduce it to triples.◦ Extract URIrefs (with duplicates).◦ Discard URIrefs of blank nodes.◦ Hash each URI to a token.◦ Index the document.
Applying IR Techniques
indexesby either N-Gram
or URIrefs
Matching “time” to:
http://foo.com/timeont.owl#timeInterval
http://foo.com/timeont.owl#calendarClockInterval
http://purl.org/upper/temporal/t13.owl#timeThing
Swoogle Demo…