Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Post on 04-Jan-2016
222 Views
Preview:
Transcript
Gökay Burak AKKUŞ
Ece AKSU
XRANK
XRANK: Ranked Keyword Search overXML Documents
Ece AKSUGökay Burak AKKUŞ
Gökay Burak AKKUŞ
Ece AKSU
This Paper...
Describes the architecture, implementation and evaluation of the XRANK system
The contributions of the paper are: (a) the problem definition and system
architecture (b) an algorithm for computing the
ranking of XML elements (c) new inverted list index structures and
associated query processing algorithms (d) an experimental evaluation of XRANK
Gökay Burak AKKUŞ
Ece AKSU
Overview
Problem: Efficiently producing ranked results for keyword search queries over hierarchical XML documents.
New challanges1. Returns deeply nested XML elements.2. Ranking is at the granularity of an XML
element (not the document)3. Keyword proximity is more complex.
Gökay Burak AKKUŞ
Ece AKSU
Overview - 2
This paper pesents XRANK system to handle these features of XML keyword search.
XRANK offers both space & performance benefits
XRANK generalizes a hyperlink based HTML search engine such as Google.
XRANK can be used to query both HTML and XML documents.
Gökay Burak AKKUŞ
Ece AKSU
Keyword Search Querying - 1
Keyword search queryingAdv: simple users do not have to learn a complex query
language can issue queries without any prior
knowledge about the structure of the underlying data.
Consequence: Interface is fexible Queries may not always be precise and can
return large number of query results.
Gökay Burak AKKUŞ
Ece AKSU
Keyword Search Querying - 2
An important requirement for keyword search is to rank the query results so that the most relevant results appear first.
Certain limitations of the HTML data model make such systems ineffective in many domains. HTML is a presentation language HTML cannot capture much semantics
Gökay Burak AKKUŞ
Ece AKSU
Keyword Search Querying - 3
The XML data model addresses this limitation by allowing for extensible element tags. (Example: Figure.1)
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
Querying XML Documents
One approach is the sophisticated query language XQUERY Effective in some cases Users have to learn a complex query language and
understand the schema of underlying XML An alternative approach is XRANK
Retain the simple keyword search query interface Exploit XML’s tagged and nested structure during query
processing.
Gökay Burak AKKUŞ
Ece AKSU
New Challanges
Keyword searching over XML introduces many new challenges.1. The result of the keyword search query can be a deeply nested XML element.
return the ‘deepest’ node2. Ranking is not solely based on hyperlinks.
semantics of containment links (relating parent and child elements) is very different from that of hyperlinks (such as IDREFs and XLinks)
Gökay Burak AKKUŞ
Ece AKSU
New Challanges
3. The notion of proximity among keywords is more complex
In HTML, proximity among keywords translates directly to the distance between keywords in a document.
For XML there is a 2-dimensional proximity metric.
Keyword distance Ancestor distance
Gökay Burak AKKUŞ
Ece AKSU
XML Data Model
XML is a hierarchical format for data representation and exchange.
An XML document consists of: Root element, nested sub-elements,
attributes and values, supports intra-document and inter-
document references.
Gökay Burak AKKUŞ
Ece AKSU
XML Data Model-2
Intra-document referencees are represented using IDREFs.
Inter-document references are represented using XLink.
Both IDREFs and XLinks are reffered as hyperlinks!
Gökay Burak AKKUŞ
Ece AKSU
Definitions
A collection of hyperlinked XML documents can be defined as a directed graph:G = (N, CE, HE)N : The set of nodes N = NE U NVNE : The set of elementsNV : The set of valuesCE : The set of containment edges relating nodesHE : The set of hyperlink edges relating nodes
Gökay Burak AKKUŞ
Ece AKSU
Definitions - 2
The edge (u, v) CE iff v is a value/nested sub-element of u.
The edge (u, v) HE iff u contains a hyperlink reference to v.
An element u is a sub-element of an element v if (v,u) CE.
An element u is the parent of node v if (u,v) CE.
The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k.
Gökay Burak AKKUŞ
Ece AKSU
Keyword Query Results
There are two possible semantics for keyword search queries:
conjunctive keyword query semantics contain all of the query
keywords are returned. disjunctive keyword query semantics
contain at least one of the query keywords are returned
This paper focuses on conjunctive keyword query semantics.
Gökay Burak AKKUŞ
Ece AKSU
Keyword Query Results - 2
Q={k1,…, kn}. R0 = {v v NE k Q(contains*(v,k))}
the set of elements that directly or indirectly contain all of the query keywords.
Result(Q)={v k Q c N ((v,c) CE c R0 contains*(c,k))}
ensures that only the most specific results are returned.
ensures that an element that has multiple independent occurrences of the query keywords is returned,
CE are considered for result set, HE are considered for ranking
Gökay Burak AKKUŞ
Ece AKSU
Keyword Query Results - 3
XML elements provides more context information
Also poses interesting user-interface challenges. One solution is to allow the user to navigate up to
the ancestors of the query result Another solution, is to predefine a set of “answer
nodes” AN. XRANK supports both
may require knowledge of the domain and underlying XML schema
Gökay Burak AKKUŞ
Ece AKSU
Ranking Keyword Query Results
Desired Properties of Ranking Function:1) Result specificity: more specific results higher than less specific results. one dimension of result proximity.2) Keyword proximity: another dimension of result proximity.3) Hyperlink Awareness: hyperlinked structure of XML documents.
Gökay Burak AKKUŞ
Ece AKSU
Ranking Function: Definition
ElemRank is defined at the granularity of an element and takes the nested structure of XML into account.
Similar to Google’s PageRank Q = (k1, k2, …, kn) R = Result(Q) A result element v1 R First define the ranking of v1 with respect
to one query keyword ki, r(v1,ki) before defining the overall rank, rank(v1, Q).
Gökay Burak AKKUŞ
Ece AKSU
Ranking with respect to one keyword
There exists a sub-element/value node v2 of v1 such that
v2 R0 and contains*(v2, ki). There is a sequence of containment edges
in CE of the form (v1, v2), (v2, v3), …, (vt, vt+1) such that vt+1 is a value node that directly contains the keyword ki.
Gökay Burak AKKUŞ
Ece AKSU
Ranking with respect to one keyword
r(v1, ki) does not depend on the ElemRank of the result node v1, except when v1 = vt for 2 reasons:1. less specific results indeed get lower ranks.2. in fact related to ElemRank(v1) due to
certain properties of containment edges.For multiple occurences of ki in v1 combined
rank is:
f = max
Gökay Burak AKKUŞ
Ece AKSU
Overall Ranking
The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v1, k1, k2, …, kn).
Gökay Burak AKKUŞ
Ece AKSU
XRANK System Architecture
Gökay Burak AKKUŞ
Ece AKSU
XRANK System Architecture-2
ElemRank Computation Module Computes the ElemRanks of XML elements Combined with ancestor info
HDIL Generates an index structure called HDIL
The Query Evaluator Module Evaluates queries using HDIL Returns ranked results.
Gökay Burak AKKUŞ
Ece AKSU
ElemRank Computational Module
ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs.
PageRank function is sum of 2 probabilities
Visiting v at random (d=0.85) Visiting v by navigating
Gökay Burak AKKUŞ
Ece AKSU
ElemRank Computational Module
PageRank is unidirectional Forward ElemRank propagation
Paper section Reverse ElemRank propagation
Paper -- > workshop
Gökay Burak AKKUŞ
Ece AKSU
Refinements of PageRank
Bi-directional transfer of ElemRanks Discrimination between containment
and hyperlink edges Aggregate ElemRanks for reverse
containment relationships
Gökay Burak AKKUŞ
Ece AKSU
Bi-directional Transfer of ElemRanks
A simple solution is to add reverse containment edges,
does not distinguish between containment and hyperlink edges
Gökay Burak AKKUŞ
Ece AKSU
Discrimination between containment and hyperlink edges
It weights forward and reverse containment relationships similarly.
Gökay Burak AKKUŞ
Ece AKSU
Aggregate ElemRanks for reverse containment relationships
XRANK System
Efficiently Evaluating XML Keyword Search Queries
Gökay Burak AKKUŞ
Ece AKSU
Efficiently Evaluating XML Keyword Search Queries
Naïve Approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL)
Gökay Burak AKKUŞ
Ece AKSU
Naïve Approach
Main Difference between XML and HTML keyword search: The granularity of query results XML keyword search returns elements HTML keyword search returns
documents One way to do XML keyword search
Treat each element as a document
Gökay Burak AKKUŞ
Ece AKSU
Problems of Naïve Approach
Space Overhead Spurious Query Results Inaccurate ranking of results
Gökay Burak AKKUŞ
Ece AKSU
Space Overhead An inverted list contains for each
keyword, the list of documents that contain the keyword
For XML documents, the list of elements A large space overhead; because each
inverted list contains XML element that directly contains the
keyword(1) All of (1)s ancestors redundantly
Gökay Burak AKKUŞ
Ece AKSU
Spurious Query Results
The naïve approach ignores ancestor-descendant relationships. All elements treated as independent
documents Results will not correspond to the
desired semantics for XML keyword search
Gökay Burak AKKUŞ
Ece AKSU
Inaccurate Ranking of Results
Existing approaches do not take result specificity into account when ranking results.
Gökay Burak AKKUŞ
Ece AKSU
Dewey Inverted List (DIL)
Naïve approach has drawbacks: Decouples representation of
ancestors and descendants. Dewey encoding of Element IDs
jointly captures ancestor and descendant information.
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
DIL
An interesting feature: ID of an ancestor is a prefix of the ID
of a descendant. Ancestor-descendant relationships
are implicitly captured in the Dewey ID.
Gökay Burak AKKUŞ
Ece AKSU
DIL Data Structure
The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k.
For multiple documents : First component of each Dewey ID is
the document ID
Gökay Burak AKKUŞ
Ece AKSU
DIL Data Structure -2
An entry in DIL: ElemRank of corresponding XML
element The list of all positions where the
keyword k appears in that element. Entries are sorted by Dewey IDs The size of DIL is smaller than that
of Naïve Approach.
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
DIL Query Processing
An algorithm that works in a single pass over the query keyword inverted lists.
The key idea: Merge the query keyword inverted lists Simultaneously compute the longest
common prefix of the Dewey IDs in different lists.
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
Ranked Dewey Inverted List (RDIL)
“If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results.”
Gökay Burak AKKUŞ
Ece AKSU
RDIL -2
One solution: Order the inverted lists by the
ElemRank instead of by the Dewey ID. Higher ranked results will appear first
in the inverted list. Threshold Algorithm.
Gökay Burak AKKUŞ
Ece AKSU
RDIL Data Structure
RDIL is similar to DIL except that:
Inverted lists are ordered by ElemRank,
Each inverted list has a B+-tree index of the Dewey ID field.
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
RDIL Query Processing
Consider an entry retrieved from the inverted list of keyword k i .
The entry contains the Dewey ID d of a top-ranked element that directly contains the query keyword k i .
To determine a query result the longest prefix of d that also contains the other query keywords needs to be determined.
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
Hybrid Dewey Inverted List (HDIL)
In many cases RDIL is likely to perform well.
It may perform worse than DIL when there is a query where keywords are not correlated.
Gökay Burak AKKUŞ
Ece AKSU
HDIL -2 The individual query keywords occur
relatively frequently in the document collection but rarely occur together in the same document.
Since the number of results is small: RDIL has to scan most (or all) of the
inverted lists to produce the output. Can we combine the benefits of DIL and
RDIL without replicating the entire inverted list index?
Gökay Burak AKKUŞ
Ece AKSU
Gökay Burak AKKUŞ
Ece AKSU
HDIL Query Processing An adaptive strategy:
Periodically monitor performance. Calculate;
Time spent – t The number of results above the threshold – r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results
If estimated time is more than the expected time for DIL, then switch to DIL.
Gökay Burak AKKUŞ
Ece AKSU
Experimental Evaluation
Experimental Setup Quality and Ranking Function Space requirements Query Performance
(1) the number of query keywords; (2) the correlation between the keywords; (3) the desired number of query results; (4) the selectivity of the keywords.
top related