Top Banner
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik 30th VLDB Conference Toronto ,Canada,2004 Presented By Abhishek Jamloki CSE@UB
34
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IR-ranking

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Vagelis Hristidis, Florida International University

Gerhard Weikum, MPI Informatik

30th VLDB Conference Toronto ,Canada,2004

Presented By Abhishek JamlokiCSE@UB

Page 2: IR-ranking

Realtor DB: Table D=(TID, Price , City, Bedrooms, Bathrooms,

LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:Select * From D Where City=Seattle AND View=Waterfront

Page 3: IR-ranking

Consider a database table D with n tuples {t1, …, tn} over a set of m

categorical attributes A = {A1, …, Am}

a query Q: SELECT * FROM DWHEREX1=x1 AND … AND Xs=xs

where each Xi is an attribute from A and xi is a value in its domain.

specified attributes: X ={X1, …, Xs}

unspecified attributes: Y = A – X

Let S be the answer set of Q

How to rank tuples in S and return top-k tuples to the user?

Page 4: IR-ranking

IR TreatmentQuery Reformulation

Automatic RankingCorrelations are ignored in high dimensional spaces of IR • Automated Ranking function proposed based on1)A global score of unspecified attributes2)A conditional score (strength of correlation between specified

and unspecified attributes) Automatic estimation using workload and data analysis

Page 5: IR-ranking
Page 6: IR-ranking

• Bayes’ Rule

• Product Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Page 7: IR-ranking

Each tuple t is treated as a document Partition t into two parts

t(X): contains specified attributes

t(Y): contains unspecified attributes Replace t with X and Y Replace R with D

Page 8: IR-ranking
Page 9: IR-ranking

Comprehensive dependency models have unacceptable preprocessing and query processing costs

Choose a middle ground. Given a query Q and a tuple t, the X (and Y) values

within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )|()|(

Yy

CypCYp )|()|(

Page 10: IR-ranking
Page 11: IR-ranking

Workload W: a collection of ranking queries that have been executed on our system in the past.

Represented as a set of “tuples”, where each tuple represents a query and is a vector containing the corresponding values of the specified attributes.

We approximate R as all query “tuples” in W that also request for X (approximation is novel to this paper)

Properties of the set of relevant tuples R can be obtained by only examining the subset of the workload that contains queries that also request for X

Substitute p(y | R) as p(y | X,W)

Page 12: IR-ranking
Page 13: IR-ranking

p(y | W) the relative frequencies of each distinct value y in the workload

p( y | D) relative frequencies of each distinct value y in the database (similar to IDF concept in IR) p(x | y,W) confidences of pair-wise association rules in the

workload, that is:(#of tuples in W that contains x, y)/total # of tuples in

W p(x | y,D):

(#of tuples in D that contains x, y)/total # of tuples in D

Stored as auxiliary tables in the intermediate knowledge representation layer

Page 14: IR-ranking

p(y | w) {AttName, AttVal, Prob}

◦ B+Tree index on (AttName, AttVal)

p(y | D) {AttName, AttVal, Prob}

◦ B+Tree index on (AttName, AttVal)

p(x | y,W) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

◦ B+Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

p(x | y,D) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

◦ B+Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Page 15: IR-ranking

Preprocessing - Atomic Probabilities Module Computes and Indexes the Quantities

P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-Tuple Return Top-K Tuples

Page 16: IR-ranking

Trade off between pre-processing and query processing Pre-compute ranked lists of the tuples for all possible “atomic”

queries. Then at query time, given an actual query that specifies a set of values X, we “merge” the ranked lists corresponding to each x in X to compute the final Top-K tuples.

We should be able to perform merging without scanning the entire ranked lists

Threshold algorithm can be used for this purpose A feasible adaptation of TA should keep the number of sorted

streams small Number of sorted streams will depend on number of attributes in

database

Page 17: IR-ranking

At query time we do a TA-like merging of several ranked lists (i.e. sorted streams)

The required number of sorted streams depends only on the number of specified attribute values in the query and not on the total number of attributes in the database

Such a merge operation is only made possible due to the specific functional form of our ranking function resulting from our limited independence assumptions

Page 18: IR-ranking

Index Module: takes as inputs the association rules and the database, and for every distinct value x, creates two lists Cx and Gx, each containing the tuple-ids of all data tuples that contain x, ordered in specific ways.

Conditional List Cx: consists of pairs of the form <TID, CondScore>, ordered by descending CondScoreTID: tuple-id of a tuple t that contains x

Global List Gx: consists of pairs of the form <TID, GlobScore>, ordered by descending GlobScore, where TID is the tuple-id of a tuple t that contains x and

Page 19: IR-ranking

At query time we retrieve and multiply the scores of t in the lists Cx1,…,Cxs and in one of Gx1,…,Gxs. This requires only s+1 multiplications and results in a score2 that is proportional to the actual score.Two kinds of efficient access operations are needed:

First, given a value x, it should be possible to perform a GetNextTID operation on lists Cx and Gx in constant time, tuple-ids in the lists should be efficiently retrievable one-by-one in order of decreasing score. This corresponds to the sorted stream access of TA.

Second, it should be possible to perform random access on the lists, that is, given a TID, the corresponding score (CondScore or GlobScore) should be retrievable in constant time.

Page 20: IR-ranking

These lists are stored as database tables – CondList Cx

{AttName, AttVal, TID, CondScore}B+Tree index on (AttName, AttVal, CondScore)

GlobList Gx

{AttName, AttVal, TID, GlobScore}B+Tree index on (AttName, AttVal, GlobScore)

Page 21: IR-ranking
Page 22: IR-ranking
Page 23: IR-ranking

Space consumed by the lists is O(mn) bytes (m is the number of attributes and n the number of tuples of the database table)

We can store only a subset of the lists at preprocessing time, at the expense of an increase in the query processing time.

Determining which lists to retain/omit at preprocessing time done by analyzing the workload.

Store the conditional lists Cx and the corresponding global lists Gx only for those attribute values x that occur most frequently in the workload

Probe the intermediate knowledge representation layer at query time to compute the missing information

Page 24: IR-ranking

The following Datasets were used:MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)Internet Movie Database (http://www.imdb.com)

Software and Hardware:Microsoft SQL Server2000 RDBMSP4 2.8-GHz PC, 1 GB RAMC#, Connected to RDBMS through DAO

Page 25: IR-ranking

Evaluated using two ranking methods1) Conditional2) Global

Several hundred workload queries were collected for both the datasets and ranking algorithm trained on this workload

Page 26: IR-ranking

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Page 27: IR-ranking

Users were given the Top-5 results of the two ranking methods for 5 queries (different from the previous survey), and were asked to choose which rankings they preferred

Page 28: IR-ranking

Compared performance of the various implementations of the Conditional algorithm: List Merge, its space-saving variant and Scan

Datasets used:

Page 29: IR-ranking
Page 30: IR-ranking
Page 31: IR-ranking
Page 32: IR-ranking
Page 33: IR-ranking

Completely automated approach for the Many-Answers Problem which leverages data and workload statistics and correlation

Probabilistic IR models were adapted for structured data.

Experiments demonstrate efficiency as well as quality of the ranking system

Page 34: IR-ranking

Many relational databases contain text columns in addition to numeric and categorical columns. Whether correlations between text and non-text data can be leveraged in a meaningful way for ranking ?

Comprehensive quality benchmarks for database ranking need to be established