Top Banner
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA 1
35

Probabilistic Ranking of Database Query Results

Feb 24, 2016

Download

Documents

dareh

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri , Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis , Florida International University Gerhard Weikum , MPI Informatik. Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA. Roadmap. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Probabilistic Ranking of Database Query Results

1

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Raghunath Ravi

Sivaramakrishnan SubramaniCSE@UTA

Page 2: Probabilistic Ranking of Database Query Results

2

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 3: Probabilistic Ranking of Database Query Results

3

MotivationMany-answers problemTwo alternative solutions:

Query reformulation Automatic rankingApply probabilistic model in IR to

DB tuple ranking

Page 4: Probabilistic Ranking of Database Query Results

4

Example – Realtor DatabaseHouse Attributes: Price, City,

Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results!

Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable

Page 5: Probabilistic Ranking of Database Query Results

5

Rank According to Unspecified AttributesScore of a Result Tuple t depends onGlobal Score: Global Importance of

Unspecified Attribute Values [CIDR2003]◦ E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock

Many Bedrooms Good School District

Page 6: Probabilistic Ranking of Database Query Results

6

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 7: Probabilistic Ranking of Database Query Results

7

Key ProblemsGiven a Query Q, How to

Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

Page 8: Probabilistic Ranking of Database Query Results

8

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 9: Probabilistic Ranking of Database Query Results

9

System Architecture

Page 10: Probabilistic Ranking of Database Query Results

10

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking

FunctionImplementationExperimentsConclusion and open problems

Page 11: Probabilistic Ranking of Database Query Results

11

PIR ReviewBayes’ RuleProduct Rule

)()()|()|(

bpapabpbap

),|()|()|,( cabpcapcbap

)|()|(

)()()|(

)()()|(

)|()|()(

RtpRtp

tpRpRtp

tpRpRtp

tRptRptScore

Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents

Vagelis Hristidis
Let's see how by adapting PIR techniques to our problem we can create a ranking function.
Page 12: Probabilistic Ranking of Database Query Results

12

Adaptation of PIR to DBTuple t is considered as a

documentPartition t into t(X) and t(Y)t(X) and t(Y) are written as X and

YDerive from initial scoring

function until final ranking function is obtained

Page 13: Probabilistic Ranking of Database Query Results

13

Preliminary Derivation

Page 14: Probabilistic Ranking of Database Query Results

14

Limited Independence AssumptionsGiven a query Q and a tuple t,

the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )()(

Yy

CypCYp )()(

Page 15: Probabilistic Ranking of Database Query Results

15

Continuing Derivation

Page 16: Probabilistic Ranking of Database Query Results

16

Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in WRelative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Yy XxYy DyxpDyp

RyptScore),|(

1)|()|()(

Use Workload

Use Data

Page 17: Probabilistic Ranking of Database Query Results

17

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 18: Probabilistic Ranking of Database Query Results

18

Architecture of Ranking Systems

Page 19: Probabilistic Ranking of Database Query Results

19

Scan Algorithm

Preprocessing - Atomic Probabilities Module

Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

ExecutionSelect Tuples that Satisfy the QueryScan and Compute Score for Each

Result-TupleReturn Top-K Tuples

Page 20: Probabilistic Ranking of Database Query Results

20

Beyond Scan Algorithm Scan algorithm is Inefficient

Many tuples in the answer set

Another extremePre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples

Page 21: Probabilistic Ranking of Database Query Results

21

Output from Index Module CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx {AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)

Page 22: Probabilistic Ranking of Database Query Results

22

Index Module

Page 23: Probabilistic Ranking of Database Query Results

23

Preprocessing ComponentPreprocessing For Each Distinct Value x of Database, Calculate and

Store the Conditional (Cx) and the Global (Gx) Lists as follows◦ For Each Tuple t Containing x Calculate

and add to Cx and Gx respectively Sort Cx, Gx by decreasing scoresExecution Query Q: X1=x1 AND … AND Xs=xs

Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs

Page 24: Probabilistic Ranking of Database Query Results

24

List Merge Algorithm

Page 25: Probabilistic Ranking of Database Query Results

25

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 26: Probabilistic Ranking of Database Query Results

26

Experimental Setup Datasets:

◦ MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

◦ Internet Movie Database (http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

Page 27: Probabilistic Ranking of Database Query Results

27

Quality ExperimentsConducted on Seattle Homes and

Movies tablesCollect a workload from usersCompare Conditional Ranking

Method in the paper with the Global Method [CIDR03]

Page 28: Probabilistic Ranking of Database Query Results

28

Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Page 29: Probabilistic Ranking of Database Query Results

29

Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results

Page 30: Probabilistic Ranking of Database Query Results

30

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm

Page 31: Probabilistic Ranking of Database Query Results

31

Performance Experiments – Pre-computation Time

Page 32: Probabilistic Ranking of Database Query Results

32

Performance Experiments – Execution Time

Page 33: Probabilistic Ranking of Database Query Results

33

RoadmapMotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 34: Probabilistic Ranking of Database Query Results

34

Conclusions – Future WorkConclusionsCompletely Automated Approach for the Many-

Answers Problem which Leverages Data and Workload Statistics and Correlations

Based on PIR

DrawbacksMutiple-table queryNon-categorical attributes

Future WorkEmpty-Answer ProblemHandle Plain Text Attributes

Page 35: Probabilistic Ranking of Database Query Results

35

Questions?