Top Banner
Searching in an XML Corpus Using Content and Structure INEX 2003, Germany INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza, Jonathan Mamou, Yehoshua Sagiv, Benjamin Sznajder, Efrat Twito The Hebrew University of Jerusalem
27

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Searching in an XML CorpusUsing Content and Structure

INEX 2003, GermanyINEX 2003, Germany

Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza, Jonathan Mamou, Yehoshua Sagiv,

Benjamin Sznajder, Efrat Twito

The Hebrew University of Jerusalem

Page 2: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Approach

• IR techniques were extended in the context of XML corpus– The granularity of the retrieval is refined:

fragments of document (and not necessarily whole document) are considered as potential results

– The additional information provided by the structure of the document, and of the query, is exploited when retrieving results

Page 3: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Approach (cont’d)

• An extensible system was built– E.g., new ranking techniques can be added

easily

• The system was implemented in a short time– E.g., topics are translated into XSL stylesheets

• Programming language: Java• Operating System: Windows XP

Page 4: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic

• Only the title of the topic, denoted T, is used for retrieval

• We denote– T+ the list of terms in T that are preceded by a

+ sign– T- the list of terms that are preceded by a - sign– To the list of optional terms

• We have implemented our retrieval system only for CO and SCAS topics (not VCAS)

Page 5: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic Processor

Filter

Indexer

Extractor

Relevant documents

Ranker Merger

Relevant fragments

Fragments augmented with ranking scores

Topic Result

Indices

IEEE Digital Library

Ranker 5

Ranker 4

Ranker 3

Ranker 2

Ranker 1

Page 6: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Preprocess

• XML documents and topics– Terms are stemmed (using Porter stemmer)– Stopwords are eliminated

• Indices are built

Page 7: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic Processor

Filter

Indexer

Extractor

Relevant documents

Ranker Merger

Relevant fragments

Fragments augmented with ranking scores

Topic Result

Indices

IEEE Digital Library

Ranker 5

Ranker 4

Ranker 3

Ranker 2

Ranker 1

Page 8: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Index

• Inverted Keyword Index– Associates each term with the list of documents

(id’s) containing it

• Keyword-Distance Index– Stores information about distance between two

terms over all the sentences in all the documents of the corpus

Page 9: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Index

• Tag Index– Associates to each tag a weight, according to the

“importance” of its content– E.g., the information provided by the front matter is

more important than the information provided by a subsection

• Inverse Document Frequency Index– Associates to each term its IDF, classical in IR– IDF is the fraction of documents in the corpus

containing the term

Page 10: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic Processor

Filter

Indexer

Extractor

Relevant documents

Ranker Merger

Relevant fragments

Fragments augmented with ranking scores

Topic Result

Indices

IEEE Digital Library

Ranker 5

Ranker 4

Ranker 3

Ranker 2

Ranker 1

Page 11: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Filter

• Documents not containing all the terms of T+ are considered as irrelevant

• Documents containing all the terms of T+ are extracted from the corpus

Page 12: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic Processor

Filter

Indexer

Extractor

Relevant documents

Ranker Merger

Relevant fragments

Fragments augmented with ranking scores

Topic Result

Indices

IEEE Digital Library

Ranker 5

Ranker 4

Ranker 3

Ranker 2

Ranker 1

Page 13: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Relevant Fragments

• Relevant fragments from each document that passed the filtering are extracted

• Relevant fragments– CAS: determined by the topic title– CO: the system determines potentially relevant

fragments• whole document• front matter• abstract• any section• any subsection

Page 14: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Extracting Relevant Fragments from a Document

• An XPath processor is not suitable, since the syntax of CAS topics is more general than that of XPath.

• The relevant fragments are extracted by means of an XSL stylesheet that is generated from T– For CAS topics, the stylesheet also checks that the

returned fragments satisfy the predicates of the title

• The implementation of the translator of topics to XSL stylesheets, is fast

Page 15: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic Processor

Filter

Indexer

Extractor

Relevant documents

Ranker Merger

Relevant fragments

Fragments augmented with ranking scores

Topic Result

Indices

IEEE Digital Library

Ranker n

Ranker …

Ranker …

Ranker …

Ranker 1

Page 16: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

An Overview of the Ranking Process

• n different rankers give scores based on the structure and the content of the fragments– In our implementation, 5 rankers– For some rankers, the weights of tags are incorporated

into the score– Each ranker gives scores to all the fragments returned

by the extractor– For each result, the scores of all the relevant

fragments are aggregated

Page 17: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Word-Number Ranker

• This ranker counts the number of terms from T- and To appearing in the fragment

• The score is – increased when the number of terms from To is

increased– decreased when the number of terms from T- is

increased

Page 18: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

IDF Ranker

• We measure the “rarity” of a term using the classical formula of IDF

• The score is– increased when the number of rare terms from To is

increased– decreased when the number of rare terms from T- is

increased

Page 19: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

TFIDF Ranker

• It is an extension of the Vector Space Model to XML documents

• TF counts the number of occurrences of a term in the fragment (and not the whole document)– Each occurrence is multiplied by the weight of its tag

• TFIDF = TF * IDF

• The score of a fragment is computed by– adding the TFIDF of terms from T+ and To– subtracting the TFIDF of terms from T-

Page 20: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Proximity Ranker

• This ranker is based on the correlation between pairs of words from T+ and To appearing in a single phrase in a sliding window containing 5 terms

• Such a pair is called lexical affinity (LA)• The score of a fragment is computed by counting

the number of LA’s• The score is increased when a LA appears under

“important” tags

Page 21: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Similarity Ranker

• Idea: If two terms appear frequently in the same sentence in the corpus, they should be considered as related

• It is a sort of blind query refinement• The score of a fragment is based on

– Distance between the terms of the query and the terms of the fragment

– Increases when the pair appears under “important” tags

Page 22: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Topic Processor

Filter

Indexer

Extractor

Relevant documents

Ranker Merger

Relevant fragments

Fragments augmented with ranking scores

Topic Result

Indices

IEEE Digital Library

Ranker 5

Ranker 4

Ranker 3

Ranker 2

Ranker 1

Page 23: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Merger

• The scores of the various rankers are merged into a single rank

• The main problem is how to determine the relative weight of each ranker

• The scores of the 5 rankers are lexicographically sorted as follows– An order among the rankers is determined– A tuple of the 5 scores is created for each result– The tuples are lexicographically sorted

Page 24: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Merger (cont’d)

• Our submitted results use different orderings of the rankers

• E.g.,– Word Number– Idf– Similarity– Proximity– TFIDF

Page 25: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Conclusion

• Our system builds and uses indices• It combines different rankers• The rankers use both the content and the

structure• The system is extensible

– The implementation uses configuration files– New rankers can be added easily– The system can be easily adapted to changes in

the formal syntax of queries

Page 26: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Future Works

• We still need to experiment thoroughly with the system– Modify the merger by using a single formula to

combine the scores of the different rankers– How to determine the relative weight of each

ranker?– Add and modify rankers

Page 27: INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Thank You.

Questions?