RMIT University at INEX 2004 Heterogeneous Track Experiments

RMIT University at INEX 2004Heterogeneous Track Experiments

Jovan Pehcevski

Email: [email protected]

School of Computer Science and Information Technology, RMIT University,

Melbourne, Australia.

Overview

Research questionsCollection statisticsTopicsRetrieval systems

Zettair (using two similarity measures)Hybrid (Zettair with eXist, using two retrieval heuristics)

Runs: all automatic, title-only runs#1: Zettair (Okapi BM25)#2: Zettair (Pivoted Cosine)#3: Hybrid (MpE heuristic)#3: Hybrid (PME heuristic)

ResultsEfficiencyEffectiveness (for the IEEE collection)

Final thoughts

Research Questions

The goal of the Heterogeneous track at INEX 2004 is to set up a test collection (a heterogeneous XML document collection, suitable retrieval topics, and relevance assessments that correspond to these topics) and to explore new retrieval challenges

Our group at RMIT focuses on answering the following questions:

For CO queries, what methods are feasible for determining elements that would be reasonable answers? Should the data be organised (and indexed) as a single heterogeneous collection, or is it better to treat this collection as a set of homogeneous sub-collections?

Methods that can be used to map structural criteria from one DTD to another are NOT considered in this work

Heterogeneous collection

The heterogeneous XML collection at INEX 2004 consists of the following sub-collections:

QMULDCSDBPub - Publications database of QMUL Department of Computer ScienceBibDBPub - BibTeX converted to XML by the IS group at the University of Duisburg-EssenHCIBIB - Human-Computer Interaction Resources, bibliography from www.hcibib.org Berkeley – library catalog records of books in the area of computer and information science from BerkeleyDBLP - from the Digital Bibliography & Library Project in Trier CompuScience - from the Computer Science database of FIZ KarlsruheIEEE – IEEE Computer Society publications in the period between 1995 - 2002

Collection statistics

We analyse and pre-process each sub-collection to determine the concept of a Document

Collection Size (MB)

Document

(tag name)

Number of Documents

QMULDCSDBPub 1.2 DOCUMENT 2024

BibDBPub 2.4 entry 3465

HCIBIB 32.4 entry 26399

Berkeley 34.4 USMARC 12800

DBLP 239.3 article | book | phdthesis | mastersthesis | proceedings | inproceedings | incollection | www

501102

CompuScience 338.6 article | book | inbook | dissertation | proceedings |

inproceedings | incollection | techreport | misc

250987

IEEE 494.5 article 12107

Het Collection

1142.8 (all the distinct tags above) 808884

Topics

Four types of retrieval topics are considered for the Heterogeneous track at INEX 2004

CO (Content-Only) – plain queries, no structural constraints and target elements (10 topics)

Example: XML information retrievalBCAS (Basic Content-And-Structure) – queries using single structural and content-based constraints to enable synonym matches (1 topic)

Example: //article[about(., XML information retrieval)]CCAS (Complex Content-And-Structure) – queries using complex structural and content-based constraints to enable a wide range of path transformations and partial mappings (13 topics)

Example: //article[about(.//sec, XML information retrieval)]ECCAS (Extended Complex Content-And-Structure) – queries using probability likelihood of a structural constraint (0 topics)

Example: //article(0.8)[about(.//sec(0.5), XML information retrieval)]

CCAS Topic Example

<inex_topic topic_id="3" query_type="CCAS"><title> //article[about(.//abs, Web usage mining) or about(.//sec, "Web mining" traversal navigation patterns)]</title><content_description> We are looking for documents that describe capturing and mining Web usage, in particular the

traversal and navigation patterns; motivations include Web site redesign and maintenance.</content_description><structure_description> Article is a tag identifying a document, which can also be represented as a book tag, an inproceedings

(or incollection) tag, an entry tag, etc. Abs is a tag identifying abstract of a document, which can be represented as an abstract tag, an abs tag, etc. Sec is a tag identifying an informative document component, such as section or paragraph. It can also be represented as sec, ss1, ss2, p, ip1 or other similar tags. </structure_description>

<narrative> To be relevant, a document must describe methods for capturing and analysing web usage, in

particular traversal and navigation patterns. The motivation is using Web usage mining for site reconfiguration and maintenance, as well as providing recommendations to the user. Methods that are not explicitly applied to the Web but could apply are still relevant. Capturing browsing actions for pre-fetching is not relevant.</narrative>

<keywords> Web usage mining, Web log analysis, browsing pattern, navigation pattern, traversal pattern, Web

statistics, Web design, Web maintenance, user recommendations </keywords></inex_topic>

Retrieval Systems

Our runs use two systems

Zettair – a compact and fast full-text search engineHybrid – a modular system using best retrieval features from Zettair and eXist (a native XML database), and a top-up module to identify the appropriate units of retrieval

Unconstrained, plain text queries are used by each retrieval system. For each topic, the structural constraints and the target element are removed. Terms from the <title> are used to formulate the queries

The systems use two different strategies to index the terms in the heterogeneous XML collection

Zettair

From zetta (1021) and IRA scalable, fast search engine server

Supports ranked, simple Boolean, and phrase queriesIndexes HTML, XML, plain text, and TREC-formatted documentsUsable as a C and python libraryNative support for TREC experiments (not yet for INEX)Documented. Includes easy-to-follow examples

BSD licenseEmphasis on simplicity and efficiency

One executable does everythingUnder continued development

Ported to Mac OS X, FreeBSD, MS Windows, Linux, SolarisAvailable from www.seg.rmit.edu.au/zettair

Zettair Indexing

With Zettair, the seven homogeneous XML collections are indexed as a single heterogeneous XML collectionSingle-pass, sort-merge schemeDocument-ordered, word position inverted indexesEfficient, variable-byte index compressionIndexed the HET collection (1.14 GB) in under 5 minutes on a single AUD$2000 Intel P4 machine.

Throughput: 230MB/minuteFast configurable parser. Handles badly-formed HTML:

Validates each tag by matching < with > within a characterHTML comments are not indexed but are validatedEntity references translatedNo support for internationalised text

Zettair Querying

B-tree vocabulary bulk-loaded at index construction time

For a 1.14 Gb collection, average query time is 10 milliseconds (without explicit caching or other optimisations)

Single-threaded, blocking I/O, and relatively unoptimised

Provides query-biased summaries of documents (see Tombros and Sanderson, “Advantages of query biased summaries in information retrieval”, SIGIR 1998)

Supports Pivoted Cosine and Okapi BM25 similarity measuresWorking on further measuresMeasures can be manipulated externally

Zettair Querying…

The Pivoted Cosine similarity measure is:

where:

and:

Wd = document length WAL = average document length

s = 0.25 (the slope) N = number of docs in collection

ft = collection frequency fd,t = within-document frequency

(# of docs that t occurs in)

te

Qttde

QD f

Nf

WW1loglog1

1,

AL

dD W

WssW 0.1

Qt teQ f

NW

2

1log

Zettair Querying…

The Okapi BM25 similarity measure is:

where:

and:Wd = document length WAL = average document length

k1 = 1.2 k3 = 1000 (effectively infinite)

b = 0.75 N = number of docs in collection

fq,t = query-term frequency fd,t = within-document frequency

ft = collection frequency

(# of docs that t occurs in)

tq

tq

Qt td

tdt fk

fk

fK

fkw

,3

,3

,

,1 11

5.0

5.0log

t

tet f

fNw

AL

d

W

WbbkK 11

Hybrid

Utilising best features from Zettair and eXist

With eXist, the seven homogeneous XML collections are indexed separately, but queries can span across the XML collections

The Hybrid system uses a “fetch and browse” approach, where heterogeneous Documents are first retrieved and ranked by Zettair (the fetch phase), and the most specific elements from the highly ranked Documents are then extracted by eXist (the browse phase)

The system also uses a retrieval module that identifies and ranks Coherent Retrieval Elements (CREs) (more on next slides)

Coherent Retrieval ElementsDefinition:

A Coherent Retrieval Element (CRE) is an element that contains at least two matching elements (extracted by eXist), or at least two other Coherent Retrieval Elements, or a combination of a matching element and a Coherent Retrieval Element.

In plain words:

The list of matching elements, extracted by eXist, is a document-ordered list (see Table 1 on the next slide). The list is processed by considering a pair of elements, starting from the first element down to the last. In each step, a CRE is identified as the most specific ancestor of the two matching elements that constitute this pair.

Matching Elements

Table 1. eXist list of matching elements

Matching versus CREs

Figure 1. Matching versus Coherent Retrieval Elements

Ranking the CREs

To determine the final ranks of CREs, the retrieval module uses a combination of the following heuristics:

The number of times a CRE appears in the absolute path of each extracted element in the eXist list of matching elements - more matches (M) or fewer matches (m)The length of the absolute path of the CRE, taken from the root element - longer path (P) or shorter path (p)The ordering of the XPath sequence in the absolute path of the CRE - nearer to beginning (B) or nearer to end (E)

For INEX 2003 test set, MpE yields best performance, although PME is more suitable for some metrics

Ranking the CREs…

Article Answer element Matches Length Sequence

ic/1999/w4095 /article[1] 12 1 1

ic/1999/w4095 /article[1]/bdy[1] 9 2 11

ic/1999/w4095 /article[1]/bdy[1]/sec[4] 4 3 114


ic/1999/w4095 /article[1]/bm[1]/app[1] 3 3 111

ic/1999/w4095 /article[1]/bdy[1]/sec[2]/ss1[1] 2 4 1121

ic/1999/w4095 /article[1]/bm[1]/app[1]/sec[2] 2 4 1112

Table 2. Ranked list of Coherent Retrieval elements (using the MpE heuristic)

Ranking the CREs…

Table 3. Ranked list of Coherent Retrieval elements (using the PME heuristic)

Article Answer element Matches Length Sequence

ic/1999/w4095 /article[1]/bdy[1]/sec[2]/ss1[1] 2 4 1121

ic/1999/w4095 /article[1]/bm[1]/app[1]/sec[2] 2 4 1112



ic/1999/w4095 /article[1]/bm[1]/app[1] 3 3 111

ic/1999/w4095 /article[1]/bdy[1] 9 2 11

ic/1999/w4095 /article[1] 12 1 1

Runs

Four runs: automatic, title-only

Zettair_BM25, using Zettair with Okapi BM25 similarity measureZettair_PCosine, using Zettair with Pivoted Cosine similarity measureHybrid_MpE, using the hybrid system with MpE heuristic combinationHybrid_PME, using the hybrid system with PME heuristic combination

The two hybrid runs use Zettair with Pivoted Cosine similarity measure

We use each of the above runs in each topic category (except ECCAS), resulting in 12 runs in total*

* Our official INEX 2004 submission had 9 runs, since Hybrid_MpE was not initially considered

Efficiency Results

The following efficiency results apply for Zettair only

HET collection indexed on a single $2000 Intel P4 machine808884 documents, 1.14 GB of text5 minutes to index, at 230 MB/minute10 milliseconds per query to search (on average)

No stopping or stemmingLimited accumulators with “continue” strategy

Interesting statistics:Full text index size, with full word positions, was 38.4% of the collection size (438.5 MB)Distinct terms: 1.94 millionTerm occurrences: 1.06 billion

Efficiency Results…

Detailed statistics (per collection):

Collection Size (MB) Index size (MB)

Index time (sec)

Distinct terms

QMULDCSDBPub

1.2 0.73 0.43 8816

BibDBPub 2.4 1.2 0.67 19200

HCIBIB 32.4 16.8 5.26 115321

Berkeley 34.4 9.2 4.49 126761

DBLP 239.3 128.1 65.98 1021698

CompuScience 338.6 128.8 89.95 389266

IEEE 494.5 162 102.63 694894

Het Collection

1142.8 438.5 282.12 1935988

Effectiveness Results

The following results consider the IEEE collection only

RUNCO Topics CCAS Topics

MAP P@10 MAP P@10

Zettair_BM25 0.0123 0.0875 0.0771 0.1500

Zettair_PCosine 0.0122 0.0500 0.0887 0.1667

Hybrid_MpE 0.0420 0.0875 0.1251 0.1167

Hybrid_PME 0.0227 0.0625 0.0484 0.0500

Effectiveness Results…

Quantitative, rather than qualitative analysis for the IEEE collection (although we will perform a detailed qualitative, query-and-run oriented analysis once Het relevance assessments are ready)With P@10 for the IEEE collection, the hybrid runs are (on average) NOT substantially better than the full text runsCO topics

Okapi better than Pivoted CosineMpE heuristic better than PME heuristicHybrid_MpE is best, although with P@10 Zettair_BM25 is competitive

CCAS topicsPivoted Cosine better than OkapiMpE heuristic (again) better than PME heuristicHybrid_MpE is best (with MAP), but Zettair_PCosine is best (with P@10)

With P@10, for either CO or CCAS topic type the best Zettair run is equal or better than the best Hybrid run

Final Thoughts

Four very different runs, exploring different similarity measures and retrieval heuristics (Okapi BM25 versus Pivoted Cosine, MpE heuristic versus PME heuristic)

Surprises in the resultsPlain full-text search engine very competitive More evaluation and follow up after INEX 2004

Research questionsFor CO queries, what methods are feasible for determining elements that would be reasonable answers?

The MpE heuristic in the CRE module appears to be a feasible method

Should the data be organised (and indexed) as a single heterogeneous collection, or is it better to treat this collection as a set of homogeneous sub-collections?

Indexing the data as a single heterogeneous collection appears to be both an efficient and an effective choice

RMIT University at INEX 2004 Heterogeneous Track Experiments

Documents

topics example

information science

structure queries

structural constraints

test collection

collection statisticswe

suitable retrieval topics

contentbased constraints