ISP 433/533 Week 11 XML Retrieval
ISP 433/533 Week 11
XML Retrieval
Structured Information
• Traditional IR – Unit of information: terms and documents– No structure
• Need more granularity
• Document has structure– E.g. title, sections, footnotes, etc
• A markup language is a mechanism to identify structures in a document– Data + Metadata
Extensible Markup Language XML
• Markup (tags – not a fixed set)• Content• Nested, named trees with attributes
<?xml version="1.0" encoding="UTF-8" ? >
<bookinfo><book><title>One Fish Two Fish</title>
<author>John Meyer</author> <author >Peter Smith</author> <price>7.95</price></book>
<book><title>Goodnight Moon</title> <author >Margaret Brown</author> <price>10.55</price></book> ....
</bookinfo>
Elements
• Delimited by angle brackets
• Identify the nature of the content they surround
• Elements can be nested within another element– A tree structure
• Element may have attributes– E.g. <div class="preface">
Unit of Retrieval
• Traditional IR– Document
• XML IR– Element or fragment of element
Example Retrieval Units
1 2 3
4 5
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query
Lang. XQL
section
We describesyntax of
XQL
chapter
Requirements for XML Retrieval
• Basic needs for XML retrieval
– Query both Data and Metadata
– express the query in an user convenient way
– return proper document fragments
– rank the results according to their relevance
INEX
The initiative for evaluating XML retrieval– international, coordinated effort to promote evaluation
procedures for content-based XML retrieval– provides large test collection of XML documents (12,000
articles in IEEE CS publications since 1995)– introduces both content-only (CO) and content-and-
structure (CAS) topics– designed to be a long-term initiative with workshops held
on a yearly basis (currently in the second year)
INEX CO Topic example<Title>
<cw>semantic web</cw></Title> <Description>
Research and business opportunities and challenges in developing and
deploying the concept of the Semantic Web and the associated idea of web services.
</Description> <Narrative>
To be relevant, a document/component must either discuss the technical issues and opportunities associated with the semantic web, or it must discuss the business challenges, especially the question of viable business models for web services.
</Narrative> <Keywords> semantic web, ontologies, SOAP, UDDI,
RDF…</Keywords>
INEX CAS Topic example
<Title> <te>//fig, //p, //ip1</te> <cw>Corba architecture</cw> <ce>//fgc</ce> <cw>Figure Corba Architecture</cw> <ce>//p, //ip1</ce>
</Title> <Description>
Find figures that describe the Corba architecture and the paragraphs that refer to those figures.
</Description> <Narrative>
To be relevant a figure must describe the standard Corba architecture or a system architecture that relies heavily on Corba…Retrieved components would ideally contain both the figure and the paragraph referring to it.
</Narrative> <Keywords> CORBA Object Request Broker Architecture
…</Keywords>
An Inverted Indexing for XML
(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …
(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …
<section>
<title>
(1, 3, 2) … …
(1, 4, 2) … … “retrieval”
“information”
Element index
Text index
<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>
1
2 3 4 5 6 7
89 10 11 12 13
14
15 16 17 18 19 20
21
22
23
XPath
• XPath is a non-XML language for identifying particular parts of XML documents
– picking nodes and sets of nodes• Similar to Unix file system expression
• “/people/person/name/first_name”• “*” wildcard• “..” parent• “.” context node
– “//” descendents – “@” attribute– [] predicate,specify a condition
XPath Example
chapter/heading
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
XPath Example
chapter//heading
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
XPath Example
//chapter[heading]
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
XPath Example
/document[@class="H.3.3" author="John Smith"]
document
class="H.3.3"
author
John Smith
title
XML Retrieval Introduction
chapter
heading This. . .
heading
SyntaxExamples
heading
sectionheading
XML Query Language XQL
section
We describesyntax of XQL
chapter
More XPath Examples
• //@id/..
– All the elements that have attribute “id”
• //middle_initial/../first_name
– All the first_name elements that are siblings of middle_initial elements
• //person[profession=‘physicist’]
– All person elements that have a profession child element with the value “physicist”
XQuery
• A language to query data that is similar to XML in structure– nested, named trees with attributes
• Based on XPath
FOR/LET PathExpression
WHERE AdditionalSelectionCriteria
RETURN ResultConstruction
XQuery Example
• Find the name(s) of customers who have ordered the part whose part_id is "xx"
FOR $c IN customers FOR $o IN orders WHERE $c.cust_id=$o.cust_id AND
$o.part_id="xx" RETURN $c.name
More XQuery Example
• Find titles and prices of books by ‘Meyer’ or ‘Smith’
FOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author
contains ‘Smith’RETURN <result>
<title> $b/title </title><price> $b/price </price>
</result>
One Document Structure
• Previous XQuery works
bookinfo
Just Lost
book
titleauthor
author
price
Mercy Meyer
Gina Meyer
$5.75
book
titleprice
Brown Hedi
$13.95
Another Document Structure
• Same XQuery doesn’t work
author
name
Dr. Meyer
author
namebook
M. Brown
Goodnight Moon
title
book
titleprice
One Fish Two Fish
$12.50
book
title price
Cat in the Hat
$14.95
bookinfo
Problem with XQuery
• Requires knowledge of document structure
• Dependent on document structure
• Difficult for naive user
• Need extensions to solve the problem
• Still in active research
Don’t know the tags?
• Integrating with full-text keywords search
• Automatically identifying tag names
• Translate query terms to tag names
• Query expansion
Don’t know the structure?
• Schema-free XQuery
– Automatically identifying minimum, meaningful set of nodes that can provide answer
Just Lost
title
bookinfo
book
namename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Querying XML with Natural Language
• Translate natural language query to Schema-free XQuery
• NaLIX demo
Relevance Scoring
• Query: articles about “search engine”
secti on
chapter
ti tl e
“ Search andretri eval ”
“ . . . search engi ne . . .retri eval of semanti c
i nformati on . . . ”
p
“ . . . i nformati onretri eval . . . search
engi ne . . . ”
p
secti on secti on
. . .
TermJoin
• User-defined score function generates the score based on term occurrences and other information
• They are then joined
secti on
chapter
ti tl e
“ Search andretri eval ”
“ . . . search engi ne . . .retri eval of semanti c
i nformati on . . . ”
p
“ . . . i nformati onretri eval . . . search
engi ne . . . ”
p
secti on secti on
. . .score = 1
score = 2score = 2
score = 4
score = 5