Lecture 21: XML Retrieval

2007.04.12 - SLIDE 1IS 240 – Spring 2007

Prof. Ray Larson University of California, Berkeley

School of InformationTuesday and Thursday 10:30 am - 12:00 pm

Spring 2007http://courses.ischool.berkeley.edu/i240/s07

Principles of Information Retrieval

Lecture 21: XML Retrieval

2007.04.12 - SLIDE 2IS 240 – Spring 2007

Mini-TREC

• Proposed Schedule– February 15 – Database and previous

Queries– February 27 – report on system acquisition

and setup– March 8, New Queries for testing…– April 19, Results due (Next Thursday)– April 24 or 26, Results and system rankings– May 8 Group reports and discussion

2007.04.12 - SLIDE 3IS 240 – Spring 2007

Announcement

• No Class on Tuesday (April 17th)

2007.04.12 - SLIDE 4IS 240 – Spring 2007

Today

• Review– Geographic Information Retrieval– GIR Algorithms and evaluation based on a

presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.

• XML and Structured Element Retrieval– INEX– Approaches to XML retrieval

Credit for some of the slides in this lecture goes to Marti Hearst

2007.04.12 - SLIDE 5IS 240 – Spring 2007

Today

• Review– Geographic Information Retrieval– GIR Algorithms and evaluation based on a

presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.

• Web Crawling and Search Issues– Web Crawling– Web Search Engines and Algorithms

Credit for some of the slides in this lecture goes to Marti Hearst

2007.04.12 - SLIDE 6IS 240 – Spring 2007

Introduction

• What is Geographic Information Retrieval?– GIR is concerned with providing access to

georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval.

– It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

2007.04.12 - SLIDE 7IS 240 – Spring 2007

Example: Results display from CheshireGeo:

http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html

2007.04.12 - SLIDE 8IS 240 – Spring 2007

1) Minimum Bounding Circle (3) 2) MBR: Minimum aligned Bounding rectangle (4)

3) Minimum Bounding Ellipse (5)

6) Convex hull (varies)5) 4-corner convex polygon (8)4) Rotated minimum bounding rectangle (5)

Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation

After Brinkhoff et al, 1993b

Other convex, conservative Approximations

2007.04.12 - SLIDE 9IS 240 – Spring 2007

Our Research Questions

• Spatial Ranking– How effectively can the spatial similarity between a

query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions?

• Geometric Approximations & Spatial Ranking:– How do different geometric approximations affect the

rankings?• MBRs: the most popular approximation • Convex hulls: the highest quality convex approximation

2007.04.12 - SLIDE 10IS 240 – Spring 2007

Spatial Ranking: Methods for computing spatial similarity

2007.04.12 - SLIDE 11IS 240 – Spring 2007

Probabilistic Models: Logistic Regression attributes

• X1 = area of overlap(query region, candidate GIO) / area

of query region

• X2 = area of overlap(query region, candidate GIO) / area

of candidate GIO

• X3 = 1 – abs(fraction of overlap region that is onshore

fraction of candidate GIO that is onshore)

• Where:

Range for all variables is 0 (not similar) to 1 (same)

2007.04.12 - SLIDE 12IS 240 – Spring 2007

CA Named Places in the Test Collection – complex polygons

Counties Cities

National Parks

National Forests

Water QCB Regions

Bioregions

2007.04.12 - SLIDE 13IS 240 – Spring 2007

CA Counties – Geometric Approximations

MBRs

Ave. False Area of Approximation:MBRs: 94.61% Convex Hulls: 26.73%

Convex Hulls

2007.04.12 - SLIDE 14IS 240 – Spring 2007

CA User Defined Areas (UDAs) in the Test Collection

2007.04.12 - SLIDE 15IS 240 – Spring 2007

Test Collection Query Regions: CA Counties

42 of 58 counties referenced in the test collection metadata

• 10 counties randomly selected as query regions to train LR model

• 32 counties used as query regions to test model

2007.04.12 - SLIDE 16IS 240 – Spring 2007

LR model

• X1 = area of overlap(query region, candidate GIO) / area of

query region

• X2 = area of overlap(query region, candidate GIO) / area of

candidate GIO

• Where: Range for all variables is 0 (not similar) to 1 (same)

2007.04.12 - SLIDE 17IS 240 – Spring 2007

Some of our Results

Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

For metadata indexed by CA named place regions:

For all metadata in the test collection:

These results suggest:•Convex Hulls perform better than MBRs

•Expected result given that the CH is a higher quality approximation

•A probabilistic ranking based on MBRs can perform as well if not better than a non-probabiliistic ranking method based on Convex Hulls

•Interesting•Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go.

2007.04.12 - SLIDE 18IS 240 – Spring 2007

Some of our Results

Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

For metadata indexed by CA named place regions:

For all metadata in the test collection:

BUT:

The inclusion of UDA indexed metadata reduces precision.

This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa

2007.04.12 - SLIDE 19IS 240 – Spring 2007

Shorefactor Model

• X1 = area of overlap(query region, candidate GIO) / area of query region

• X2 = area of overlap(query region, candidate GIO) / area of candidate GIO

• X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore)

– Where: Range for all variables is 0 (not similar) to 1 (same)

2007.04.12 - SLIDE 20IS 240 – Spring 2007

Some of our Results, with Shorefactor

These results suggest:

• Addition of Shorefactor variable improves the model (LR 2), especially for MBRs

• Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls.

For all metadata in the test collection:Mean Average Query Precision:

the average precision values after each new relevant document is observed in a ranked list.

2007.04.12 - SLIDE 21IS 240 – Spring 2007

Pre

cisi

on

Recall

Results for All Data - MBRs

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Hill

Walker

Beard

LR 1

LR 2

2007.04.12 - SLIDE 22IS 240 – Spring 2007

Results for All Data - Convex Hull

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Hill

Walker

Beard

LR 1

LR 2

Pre

cisi

on

Recall

2007.04.12 - SLIDE 23IS 240 – Spring 2007

XML Retrieval

• The following slides are adapted from presentations at INEX 2003-2005 and at the INEX Element Retrieval Workshop in Glasgow 2005, with some new additions for general context, etc.

2007.04.12 - SLIDE 24IS 240 – Spring 2007

INEX Organization

Organized By:– University of Duisburg-Essen, Germany

• Norbert Fuhr, Saadia Malik, and others

– Queen Mary University of London, UK• Mounia Lalmas, Gabriella Kazai, and others

• Supported By:– DELOS Network of Excellence in Digital

Libraries (EU)– IEEE Computer Society– University of Duisburg-Essen

2007.04.12 - SLIDE 25IS 240 – Spring 2007

XML Retrieval Issues

• Using Structure?

• Specification of Queries

• How to evaluate?

2007.04.12 - SLIDE 26IS 240 – Spring 2007

Cheshire SGML/XML Support

• Underlying native format for all data is SGML or XML

• The DTD defines the database contents

• Full SGML/XML parsing

• SGML/XML Format Configuration Files define the database location and indexes

• Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

2007.04.12 - SLIDE 27IS 240 – Spring 2007

SGML/XML Support

• Configuration files for the Server are SGML/XML:– They include elements describing all of the

data files and indexes for the database.– They also include instructions on how data is

to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

2007.04.12 - SLIDE 28IS 240 – Spring 2007

Indexing

• Any SGML/XML tagged field or attribute can be indexed:– B-Tree and Hash access via Berkeley DB (Sleepycat)– Stemming, keyword, exact keys and “special keys”– Mapping from any Z39.50 Attribute combination to a

specific index– Underlying postings information includes term

frequency for probabilistic searching

• Component extraction with separate component indexes

2007.04.12 - SLIDE 29IS 240 – Spring 2007

XML Element Extraction

• A new search “ElementSetName” is XML_ELEMENT_

• Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request

• The matching elements are extracted from the records matching the search and delivered in a simple format..

2007.04.12 - SLIDE 30IS 240 – Spring 2007

XML Extraction

% zselect sherlock372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372}% zfind topic mathematics{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}% zset recsyntax XML% zset elementset XML_ELEMENT_Fld245% zdisplay{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} {<RESULT_DATA DOCID="1"><ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"><Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245></ITEM><RESULT_DATA> … etc…

2007.04.12 - SLIDE 31IS 240 – Spring 2007

P(R | Q,D) e logO(R |Q,C )

1 e logO(R |Q,C ) b0 biX i

i1

6

Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:

For the 6 For the 6 XX attribute measures shown on the next slide attribute measures shown on the next slide

TREC3 Logistic Regression

2007.04.12 - SLIDE 32IS 240 – Spring 2007


logO(R | C,Q) bo b1 1

Qc

logqtf j

j1

Qc

b2 Q b3

1

Qc

log tf j

j1

Qc

b4 cl

b5 1

Qc

logN nt j

nt jj1

Qc

b6 logQc

Average Absolute Query Frequency

Query Length

Average Absolute Component Frequency

Document Length

Average Inverse Component Frequency

Number of Terms in both query and Component

2007.04.12 - SLIDE 33IS 240 – Spring 2007

Okapi BM25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was

derived• dl and avdl are the document length and the average

document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

2007.04.12 - SLIDE 34IS 240 – Spring 2007

Combining Boolean and Probabilistic Search Elements

• Two original approaches:– Boolean Approach

– Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

P(R | Q,D) P(R | Qbool ,D)P(R | Qprob ,D)

P(R | Qbool ,D) 1: if Boolean eval successful for D

0 : Otherwise

2007.04.12 - SLIDE 35IS 240 – Spring 2007

Subquery

INEX ‘04 Fusion Search

• Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets– Major components merged are Articles, Body,

Sections, subsections, paragraphs

Subquery

Subquery

Subquery

Comp.QueryResultsComp.

QueryResults

Fusion/Merge

FinalRanked

List

2007.04.12 - SLIDE 36IS 240 – Spring 2007

Merging and Ranking Operators

• Extends the capabilities of merging to include merger operations in queries like Boolean operators

• Fuzzy Logic Operators (not used for INEX)– !FUZZY_AND– !FUZZY_OR– !FUZZY_NOT

• Containment operators: Restrict components to or with a particular parent – !RESTRICT_FROM– !RESTRICT_TO

• Merge Operators– !MERGE_SUM– !MERGE_MEAN– !MERGE_NORM– !MERGE_CMBZ

2007.04.12 - SLIDE 37IS 240 – Spring 2007

New LR Coefficients

Index b0 b1 b2 b3 b4 b5 b6

Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010

topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880

topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837

abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600

alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696

sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632

para words

-8.632 1.258 -1.654 1.485 -0.084 1.143 4.004

Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component

2007.04.12 - SLIDE 38IS 240 – Spring 2007

INEX CO Runs

• Three official, one later run - all Title-only– Fusion - Combines Okapi and LR using the

MERGE_CMBZ operator– NewParms (LR)- Using only LR with the new

parameters– Feedback - An attempt at blind relevance

feedback

– PostFusion - Fusion of the new LR coefficients and Okapi

2007.04.12 - SLIDE 39IS 240 – Spring 2007

Query Generation - CO

• # 162 TITLE = Text and Index Compression Algorithms

• QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms})

• @+ is Okapi, @ is LR• !MERGE_CMBZ is a normalized score

summation and enhancement

2007.04.12 - SLIDE 40IS 240 – Spring 2007

INEX CO Runs

Generalized Strict

Avg PrecFUSION = 0.0642NEWPARMS = 0.0582FDBK = 0.0415POSTFUS = 0.0690

Avg PrecFUSION = 0.0923NEWPARMS = 0.0853FDBK = 0.0390POSTFUS = 0.0952

2007.04.12 - SLIDE 41IS 240 – Spring 2007

INEX VCAS Runs

• Two official runs– FUSVCAS - Element fusion using LR and

various operators for path restriction– NEWVCAS - Using the new LR coefficients

for each appropriate index and various operators for path restriction

2007.04.12 - SLIDE 42IS 240 – Spring 2007

Query Generation - VCAS

• #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)]

• Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning navigation system for automobiles}))

• Target elements: sec|ss1|ss2|ss3

2007.04.12 - SLIDE 43IS 240 – Spring 2007

VCAS Results

Generalized Strict

Avg PrecFUSVCAS = 0.0321NEWVCAS = 0.0270

Avg PrecFUSVCAS = 0.0601NEWVCAS = 0.0569

2007.04.12 - SLIDE 44IS 240 – Spring 2007

Heterogeneous Track

• Approach using the Cheshire’s Virtual Database options– Primarily a version of distributed IR– Each collection indexed separately– Search via Z39.50 distributed queries– Z39.50 Attribute mapping used to map query

indexes to appropriate elements in a given collection

– Only LR used and collection results merged using probability of relevance for each collection result

2007.04.12 - SLIDE 45IS 240 – Spring 2007

INEX 2005 Approach

• Used only Logistic regression methods

• “TREC3” with Pivot

• “TREC2” with Pivot

• “TREC2” with Blind Feedback

• Used post-processing for specific tasks

2007.04.12 - SLIDE 46IS 240 – Spring 2007

P(R | Q,D) e logO(R |Q,C )

1 e logO(R |Q,C ) b0 biX i

i1

m

Probability of relevance is based on Logistic Probability of relevance is based on Logistic regression from a sample set of documents to regression from a sample set of documents to determine values of the coefficients. determine values of the coefficients.

At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:

For some set of For some set of m m statistical measures, Xstatistical measures, Xii, derived from , derived from

the collection and querythe collection and query

Logistic Regression

2007.04.12 - SLIDE 47IS 240 – Spring 2007

TREC2 Algorithm

logO(R | C,Q) co c1

1

Qc 1

qtf i

ql 35i1

Qc

c2

1

Qc 1log

tf i

cl 80i1

Qc

c3

1

Qc 1log

ctf i

N ti1

Qc

c4 Qc

Query

Document

Collection

MatchingTerms

TermFreq for:

2007.04.12 - SLIDE 48IS 240 – Spring 2007

Blind Feedback

• Term selection from top-ranked documents is based on the classic Robertson/Sparck Jones probabilistic model:

Document Relevance

Documentindexing

+ -

+ Rt Nt -Rt Nt

- R-Rt N-Nt-R+R N-Nt

R N-R N

For each term t

2007.04.12 - SLIDE 49IS 240 – Spring 2007

Blind Feedback

• Top x new terms taken from top y documents– For each term in the top y assumed relevant set…

– Terms are ranked by termwt and the top x selected for inclusion in the query

termwt log

Rt

R Rt

N t Rt

N N t R Rt

2007.04.12 - SLIDE 50IS 240 – Spring 2007

Pivot method

• Based on the pivot weighting used by IBM Haifa in INEX 2004 (Mass & Mandelbrod)

• Used 0.50 as pivot for all cases

• For TREC3 and TREC2 runs all component results weighted by article-level results for the matching article

P(R | Q,Cnew) (X P(R | Q,Ccomp )) ((1 X)P(R | Q,Carticle))

Where X is the " pivot value" with X 0 and X 1

2007.04.12 - SLIDE 51IS 240 – Spring 2007

Subquery

Adhoc Component Fusion Search

• Merge multiple ranked component types– Major components merged are Article

Body, Sections, paragraphs, figures

Subquery

Subquery

Subquery

Comp.QueryResultsComp.

QueryResults

Fusion/Merge

RawRanked

List

2007.04.12 - SLIDE 52IS 240 – Spring 2007

P(R | Q,D) b0 biX i

i1

n

Probability of relevance is based onProbability of relevance is based onLogistic regression from a sample set of documentsLogistic regression from a sample set of documentsto determine values of the coefficients. to determine values of the coefficients. At retrieval the probability estimate is obtained by:At retrieval the probability estimate is obtained by:


2007.04.12 - SLIDE 53IS 240 – Spring 2007

TREC3 Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query FrequencyAverage Absolute Query Frequency

Query LengthQuery Length

Average Absolute Component FrequencyAverage Absolute Component Frequency

Document LengthDocument Length

Average Inverse Component FrequencyAverage Inverse Component Frequency

Inverse Component FrequencyInverse Component Frequency

Number of Terms in common between Number of Terms in common between query and Component -- logged query and Component -- logged

2007.04.12 - SLIDE 54IS 240 – Spring 2007

TREC3 LR Coefficients

Index b0 b1 b2 b3 b4 b5 b6

Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010

topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880

topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837

abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600

alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696

sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632

para words

-8.632 1.258 -1.654 1.485 -0.084 1.143 4.004

Estimates using INEX ‘03 relevance assessments forEstimates using INEX ‘03 relevance assessments forb1 = Average Absolute Query Frequencyb1 = Average Absolute Query Frequencyb2 = Query Lengthb2 = Query Lengthb3 = Average Absolute Component Frequencyb3 = Average Absolute Component Frequencyb4 = Document Lengthb4 = Document Lengthb5 = Average Inverse Component Frequencyb5 = Average Inverse Component Frequencyb6 = Number of Terms in common between queryb6 = Number of Terms in common between query and Component and Component

2007.04.12 - SLIDE 55IS 240 – Spring 2007

CO.Focused• Generalized & Strict

2007.04.12 - SLIDE 56IS 240 – Spring 2007

COS.Focused

• Generalized & Strict

2007.04.12 - SLIDE 57IS 240 – Spring 2007

CO.Thorough


2007.04.12 - SLIDE 58IS 240 – Spring 2007

COS.Thorough


2007.04.12 - SLIDE 59IS 240 – Spring 2007

CAS

• Generalize & Strict

2007.04.12 - SLIDE 60IS 240 – Spring 2007

Het. Element Retr. Overview

• The Problem

• Issues with Element Retrieval and Heterogeneous Retrieval

• Possible Approaches– XPointer– Generic Metadata systems

• E.g., Dublin Core

– Other Metadata Systems

2007.04.12 - SLIDE 61IS 240 – Spring 2007

The Problem

• The Adhoc track in INEX has dealt with a single DTD for one type of data (computer science journal articles)

• In “real-world” environments, XML retrieval must deal with different DTDs, different genres of data and widely varying topical content

2007.04.12 - SLIDE 62IS 240 – Spring 2007

The Heterogeneous Track

• Research Questions (2004):– For content-oriented queries, what methods are possible for– What methods can be used to map structural criteria onto

other DTDs?– Should mappings focus on element names only, or also

deal with element content or semantics?– What are appropriate evaluation criteria for heterogeneous

collections?

– determining which elements contain reasonable answers? Are pure

– statistical methods appropriate, or are ontology-based approaches also helpful?

2007.04.12 - SLIDE 63IS 240 – Spring 2007

INEX 2004 Het Collection TagsCollection Author tag Title tag Abstract tag

INEX (IEEE) fm/au fm/tig/atl fm/abs

Berkeley Fld100 Fld700

Fld245 Fld500 (rarely)

compuscience author title abstract

bibdbpub author altauthor

title abstract

dblp author editor

title booktitle

None

hcibib author title abstract

qmulcspub AUTHOR EDITOR

TITLE ABSTRACT

2007.04.12 - SLIDE 64IS 240 – Spring 2007

Issues with Element Retrieval for Heterogeneous Retrieval

• Conceptual Issues (user view)– To actually specify structural elements for retrieval

requires that the user know the structure of the items to be retrieved

– As the number of DTDs or schemas increase this task becomes more complex for both specification and for understanding

– For “real world” XML retrieval, specifying structure effectively requires omniscience on the part of the user

– The collection itself must be specified in some way (can the user know all of the collections?)

– Users of INEX can’t do correct specifications for even one DTD…

2007.04.12 - SLIDE 65IS 240 – Spring 2007

Issues with Element Retrieval for Heterogeneous Retrieval

• Practical Issues (programmers view)– Most of the same problems as the user view– As seen in an earlier papers today the system must

provide an interface that the user can understand, but maps to the complexities of the DTD(s)

– But, once again, as the number of DTDs or schemas increase this task becomes increasingly complex for the specification of the mappings

– For “real world” XML retrieval, specifying structure effectively requires omniscience on the part of the programmer to provide exhaustive mappings of the document elements to be retrieved

• As Roelof noted earlier today, this rapidly can become a system that has too many options for a user to understand or use

2007.04.12 - SLIDE 66IS 240 – Spring 2007

Postulate of Impotence

• In summation we might suggest another ``Postulate of Impotence'' like those suggested by Swanson– You can either have heterogeneous retrieval,

or precise element specifications in queries, but you cannot have both simultaneously

2007.04.12 - SLIDE 67IS 240 – Spring 2007

Possible Approaches

• Generalized structure– Parent/child as in Xpath/Xpointer– What about flat structures? (like most

collections in the Het track)

• Abstract query elements– Use semantic representations in queries

rather than structural representations• E.g. “Title” instead of //fm/tig/atl

– What semantic representations can/should be used?

2007.04.12 - SLIDE 68IS 240 – Spring 2007

XPointer

• Can specify collection-level identification– Basically a URN attached to an Xpath

• Can also specify various string-matching constraints on Xpath

• Might be useful in INEX Het Track for specifying relevance judgements

• But, it doesn’t address (or worsens) the larger problem of dealing with large numbers of heterogeneous structures

2007.04.12 - SLIDE 69IS 240 – Spring 2007

Abstract Data Elements

• The idea is to remove the requirement of precise and explicit specification of structural elements and replace them with abstract and implied specifications

• Used in other heterogeneous retrieval systems– Z39.50/SRW (attributesets and elementsets)– Dublin Core (limited set of elements for

search or retrieval)

2007.04.12 - SLIDE 70IS 240 – Spring 2007

Dublin Core

• Simple metadata for describing internet resources

• For “Document-Like Objects”

• 15 Elements (in base DC)

2007.04.12 - SLIDE 71IS 240 – Spring 2007

Dublin Core Elements

• Title

• Creator

• Subject

• Description

• Publisher

• Other Contributors

• Date

• Resource Type

• Format

• Resource Identifier

• Source

• Language

• Relation

• Coverage

• Rights Management

2007.04.12 - SLIDE 72IS 240 – Spring 2007

Issues in Dublin Core

• Lack of guidance on what to put into each element

• How to structure or organize at the element level?

• How to ensure consistency across descriptions for the same persons, places, things, etc.

Lecture 21: XML Retrieval

Documents