Top Banner
1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining
60

1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

1

INF 2914Information Retrieval and Web

Search

Lecture 10: Query ProcessingThese slides are adapted from

Stanford’s class CS276 / LING 286Information Retrieval and Web

Mining

Page 3: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

3

Abstract Formulation Ingredients:

D: document collection Q: query space f: D x Q R: relevance scoring function For every q in Q, f induces a ranking (partial order) q on D

Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation on D

Page 4: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

4

Document Representation

T = { t1,…, tk }: a “token space” (a.k.a. “feature space” or “term space”) Ex: all words in English Ex: phrases, URLs, …

A document: a real vector d in Rk

di: “weight” of token ti in d Ex: di = normalized # of occurrences of ti in d

Page 5: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

5

Classic IR (Relevance) Models

The Boolean model The Vector Space Model (VSM)

Page 6: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

6

The Boolean Model

A document: a boolean vector d in {0,1}k

di = 1 iff ti belongs to d

A query: a boolean formula q over tokens q: {0,1}k {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball

Relevance scoring function: f(d,q) = q(d)

Page 7: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

7

The Boolean Model: Pros & Cons

Advantages: Simplicity for users

Disadvantages: Relevance scoring is too coarse

Page 8: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

8

The Vector Space Model (VSM)

A document: a real vector d in Rk

di = weight of ti in d (usually TF-IDF score)

A query: a real vector q in Rk

qi = weight of ti in q

Relevance scoring function: f(d,q) = sim(d,q)

“similarity” between d and q

Page 9: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

9

Popular Similarity Measures

L1 or L2 distance d,q are first normalized to have unit norm

Cosine similarity

d

q

d –q

d

q

Page 10: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

10

TF-IDF Score: Motivation

Motivating principle: A term ti is relevant to a document d if:

ti occurs many times in d relative to other terms that occur in d

ti occurs many times in d relative to its number of occurrences in other documents

Examples 10 out of 100 terms in d are “java” 10 out of 10,000 terms in d are “java” 10 out of 100 terms in d are “the”

Page 11: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

11

TF-IDF Score: Definition

n(d,ti) = # of occurrences of ti in d N = i n(d,ti) (# of tokens in d) Di = # of documents containing ti D = # of documents in the collection

TF(d,ti): “Term Frequency” Ex: TF(d,ti) = n(d,ti) / N Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) })

IDF(ti): “Inverse Document Frequency” Ex: IDF(ti) = log (D/Di)

TFIDF(d,ti) = TF(d,ti) x IDF(ti)

Page 12: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

12

VSM: Pros & Cons

Advantages: Better granularity in relevance scoring Good performance in practice Efficient implementations

Disadvantages: Assumes term independence

Page 13: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

13

Retrieval Evaluation Notations:

D: document collection Dq: documents in D that are “relevant” to query q

Ex: f(d,q) is above some threshold Lq: list of results on query q

DLq DqRecall:

Precision:

Page 14: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

14

Precision & Recall: Example

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List BRelevant docs:

d123, d56, d9, d25, d3

Recall(A) = 80% Precision(A) = 40%

Recall(B) = 100% Precision(B) = 50%

Page 15: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

15

Precision@k and Recall@k

Notations: Dq: documents in D that are “relevant” to q Lq,k: top k results on the list

Recall@k:

Precision@k:

Page 16: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

16

Precision@k: Example

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

precision@k

List A

List B

Page 17: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

17

Recall@k: Example

0%

20%

40%

60%

80%

100%

1 3 5 7 9k

recall@k

List A

List B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Page 18: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

18

“Interpolated” Precision

Notations: Dq: documents in D that are “relevant” to q r: a recall level (e.g., 20%) k(r): first k so that recall@k >= r

Interpolated precision@ recall level r =

max { precision@k : k >= k(r) }

Page 19: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

19

Precision vs. Recall: Example

0%10%20%30%40%50%60%70%80%90%

100%

0 20 40 60 80100Recall

Interpolated Precision

List A

List B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Page 20: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

20

Top-k Query Processing

Optimal aggregation algorithms for middlewareRonald Fagin, Amnon Lotem, and Moni Naor

Based on the presentation of Wesley Sebrechts, Joost Voordouw. Modified by Vagelis Hristidis

Page 21: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

21

Why top-k query processing

• Multimedia brings fuzzy data

• attribute values are graded typically [0,1]

• No clear boundary between “answer” / “no answer”

• A query in a multimedia database means combining graded attributes

• Combine attributes by aggregation function

• Aggregation function gives overall grade of object

• Return k objects with highest overall grade

Example:

Page 22: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

22

Top-k query processing =

Finding k objects that have the highest overall grades

• How ? Which algorithms?• Fagin’s Algorithm (FA)• Threshold Algorithm (TA)

• Which is the best algorithm?

Keep in mind: Database system serves as middleware

• Multimedia (objects) may be kept in different subsystems

• e.g. photoDB, videoDB, search engine

• Take into account the limitations of these subsystems

Top-k query processing

Page 23: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

23

• Simple database model

• Simple query

• Explaining Fagin’s Algorithm (FA)

• Finding top-k with FA

• Explaining Threshold Algortihm (TA)

• Finding top-k with TA

Example

Page 24: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

24

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

Sorted L1

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

N

a

b

c

d

.

.

.

.

ObjectID

0.9

0.8

0.72

0.6

.

.

.

.

Attribute 1

0.85

0.2

0.9

.

.

.

.

Attribute 2

0.7

M

Sorted L2

Example – Simple Database model

Page 25: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

25

Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware:

A1 & A2 (eg: color=red & shape=round)

Example – Simple Query

Aggregation function:

• function that gives objects an overall grade based on attribute grades

• examples : min, max functions

• Monotonicity!

A1 & A2 as a ‘query’ to the middlewareresults in the middleware combining the grades of A1 en A2 by

min(A1, A2)

Page 26: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

26 c

ID A1 A2 Min(A1,A2)

STEP 1

• Read attributes from every sorted list• Stop when k objects have been seen in common from all lists

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85

b 0.8

0.72

0.7

Example – Fagin’s Algorithm

Page 27: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

27

c

ID A1 A2 Min(A1,A2)

STEP 2

• Random access to find missing grades

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85

b 0.8

0.72

0.7

0.6

0.2

Example – Fagin’s Algorithm

Page 28: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

28

c

ID A1 A2 Min(A1,A2)

STEP 3

• Compute the grades of the seen objects.• Return the k highest graded objects.

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85

b 0.8

0.72

0.7

0.6

0.2

0.85

0.6

0.7

0.2

Example – Fagin’s Algorithm

Page 29: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

29

Read all grades of an object once seen from a sorted access• No need to wait until the lists give k common objects

Do sorted access (and corresponding random accesses) until you have seen the top k answers.

• How do we know that grades of seen objects are higher than the grades of unseen objects ?

• Predict maximum possible grade unseen objects:

a: 0.9

b: 0.8

c: 0.72....

L1L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.f: 0.65

d: 0.6

f: 0.6

Seen

Possibly unseen

Threshold value

New Idea !!! Threshold Algorithm (TA)

T = min(0.72, 0.7) = 0.7

Page 30: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

30

ID A1 A2 Min(A1,A2)

Step 1: - parallel sorted access to each list

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85 0.85

0.6 0.6

For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer

Example – Threshold Algorithm

Page 31: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

31

ID A1 A2 Min(A1,A2)a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2)

a

d

0.9

0.9

0.85 0.85

0.6 0.6

T = min(0.9, 0.9) = 0.9

- 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1

Example – Threshold Algorithm

Page 32: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

32

ID A1 A2 Min(A1,A2)

Step 1 (Again): - parallel sorted access to each list

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85 0.85

0.6 0.6

For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer

b 0.8 0.7 0.7

Example – Threshold Algorithm

Page 33: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

33

ID A1 A2 Min(A1,A2)a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2)

a

b

0.9

0.7

0.85 0.85

0.8 0.7

T = min(0.8, 0.85) = 0.8

- 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1

Example – Threshold Algorithm

Page 34: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

34

ID A1 A2 Min(A1,A2)a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Situation at stopping condition

a

b

0.9

0.7

0.85 0.85

0.8 0.7

T = min(0.72, 0.7) = 0.7

Example – Threshold Algorithm

Page 35: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

35

Comparison of Fagin’s and Threshold Algorithm

• TA sees less objects than FA• TA stops at least as early as FA

• When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA.

• TA may perform more random accesses than FA• In TA, (m-1) random accesses for each object• In FA, Random accesses are done at the end, only for missing grades

• TA requires only bounded buffer space (k)• At the expense of more random seeks• FA makes use of unbounded buffers

Page 36: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

36

The best algorithm

Which algorithm is the best?

• Define “best”

• middleware cost

• concept of instance optimality

• Consider:

• wild guesses

• aggregation functions characteristics

• Monotone, strictly monotone, strict

• database restrictions

• distinctness property

Page 37: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

37

middleware cost = cost for processing data subsystems = sc + rc

A = class of algorithms, A E A represents an algorithm

D = legal inputs to algorithms (databases), D E D represents a database

Cost(A,D ) = middleware cost when running algorithm A over database D

The best algorithm: concept of optimality

Algorithm B is instance optimal over A and D if :

B E A and Cost(B,D ) = O(Cost(A,D )) A E A, D E D

Which means that:

Cost(B,D ) ≤ c · Cost(A,D ) + c’, A E A, D E D

optimality ratio

A

A

A

Page 38: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

38

Intuitively: B instance optimal = always the best algorithm in A

= always optimal

In reality: always is “always” we will exclude wild guesses algorithms

Wild guess = random access on object not previously encountered

by sorted access

• In practice not possible

• Database need to know ID to do random access

• If wild guesses allowed in A then no algorithm can be instance optimal

• Wild guesses can find top-k objects by k·m random accesses

(k = #objects , m = #lists)

The best algorithm: instance optimality & wild guesses

Page 39: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

39

The best algorithm: aggregation functions

• Aggregation function t combines object grades into object’s overall grade:

x1,…,xm t(x1,…,xm)

• Monotone : t(x1,…,xm) ≤ t(x’1,…,x’m) if xi ≤ x’i for every i

• Strictly monotone: t(x1,…,xm) < t(x’1,…,x’m) if xi < x’i for every i

• Strict: t(x1,…,xm) = 1 precisely when xi = 1 for every i

Page 40: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

40

The best algorithm: database restrictions

Distinctness property:

A database has no (sorted) attribute list in which two objects have the same grade

Page 41: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

41

Fagin’s Algorithm

- Database with N objects, each with m attributes.- Orderings of lists are independent

• FA finds top-k with middleware cost O(N(m-1)/mk1/m)

• FA = optimal with high probability in the worst case for strict monotone aggregation functions

Page 42: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

42

TA = instance optimal (always optimal) for every monotone aggregation function, over every database (excluding wild guesses)= optimal in much stronger sense than Fagin’s Algorithm

If strict monotone aggregation function:Optimality ratio = m + m (m-1)cR/cs = best possible (m = # attributes)

• If random acces not possible (cr = 0 ) optimality ratio = m • If sorted access not possible (cs = 0) optimality ratio = infinite

TA not instance optimal

TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database (including wild guesses) that satisfies the distinctness property

• Optimality ratio = cm2 with c = max {cR/cS, cS/cR}

Threshold Algorithm

Page 43: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

Optimized Query Execution in Large Search Engines with Global Page

Ordering

Xiaohui Long Torsten Suel

CIS DepartmentPolytechnic University

Brooklyn, NY 11201

Page 44: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• intro: query processing in search engines

• related work: query execution and pruning techniques

• algorithmic techniques• experimental evaluation: single and multiple nodes

• concluding remarks

Talk Outline:

“how to optimize query throughput in large search engines, when the ranking function is a combination of term-based ranking and a global ordering such as Pagerank”

The Problem:

Page 45: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

pages

index

pages

index

pages

index

pages

index

pages

index

broadcasts each queryand combines the results

LAN

Cluster with global index organization

Query Processing in Parallel Search Engines

queryintegrator

• local index: every node stores and indexes subset of pages• every query broadcast to all nodes by query integrator (QI)• every node supplies top-10, and QI computes global top-10• note: we don’t really need top-10 from all, maybe only top-2

• low-cost cluster architecture (usually with additional replication)

Page 46: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• IR: optimized evaluation of cosine measures (since 1980s)

• DB: top-k queries for multimedia databases (Fagin 1996)

• does not consider combinations of term-based and global scores• Brin/Page 1998: fancy lists in Google

Related Work on top-k Queries

• basic idea: “presort entries in each inverted list by contribution to

cosine”

• also process inverted lists from shortest to longest list• various schemes, either reliable or probabilistic• most closely related: - Persin/Zobel/Sacks-Davis 1993/96

- Anh/Moffat 1998, Anh/deKretzer/Moffat 2001

• typical assumptions: many keywords/query, OR semantics

Related Work (IR)

Page 47: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• motivation: searching multimedia objects by several criteria• typical assumptions: few attributes, OR semantics, random access

• FA (Fagin’s algorithm), TA (Threshold algorithm), others

• formal bounds: for k lists if lists independent

• term-based ranking: presort each list by contribution to cosine

Related Work (DB) (Fagin 1996 and others)

kN mmm 11

⋅−

Page 48: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• “fancy lists” optimization in Google• create extra shorter inverted list for “fancy matches” (matches that occur in URL, anchor text, title, bold face, etc.)

• note: fancy matches can be modeled by higher weights in the term-based vector space model

• no details given or numbers published

Related Work (Google) (Brin/Page 1998)

chair

table

fancy list

fancy list

rest of list with other matches

rest of list with other matches

Page 49: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• pruning techniques for query execution in large search engines

• focus on a combination of a term-based and a global score (such as Pagerank)

• techniques combine previous approaches such as fancy lists and presorting of lists by term scores

• experimental evaluation on 120 million pages

• very significant savings with almost no impact on results

• it’s good to have a global ordering!

Results of our Paper

Page 50: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• exhaustive algorithm: “no pruning, traverse entire list”

• first-m: “a naïve algorithm with lists sorted by Pagerank; stop after m elements in intersection found” • fancy first-m: “use fancy and non-fancy lists, each sorted by Pagerank, and stop after m elements found”

• reliable pruning: “stop when top-k results found”

• fancy last-m: “stop when at most m elements unresolved”

• single-node and parallel case with optimization

Algorithms:

Page 51: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• 120 million pages on 16 machines (1.8TB uncompressed)

• P-4 1.7Ghz with 2x80GB Seagate Barracuda IDE• compressed index based on Berkeley DB (using the mg compression macros)

• queries from Excite query trace from December 1999• queries with 2 terms in the following• local index organization with query integrator• first results for one node (7.5 million pages), then 16• note: do not need top-10 from every node• motivates top-1, top-4 schemes and precision at 1, 4• ranking by cosine + log(PR) with normalization

Experimental setup:

Page 52: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• sort inverted lists by Pagerank (docID = rank due to Pagerank)• exhaustive: top-10• first-m: return 10 highest scoring among first 10/100/1000 pages in intersection

A naïve approach: first-m

Page 53: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• for first-10, about 45% of top-10 results belong in top-10

• for first-1000, about 85% of top-

10 results belong in top-10

first-m (ctd.)

loose/strict precision, relative to “correct” cosine + log(PR)

• for first-100, about 80% of queries return correct top-1 result• for first-1000, about 70% of queries return all correct top-10 results

average cost per query in terms of disk blocks

Page 54: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

(1) Use better stopping criteria? • reliable pruning: stop when we are sure• probabilistic pruning: stop when almost sure• do not work well for Pagerank-sorted index

(2) Reorganize index structure?• sort lists by term score (cosine) instead of Pagerank - does not do any better than sorting by Pagerank only

• sort lists by term + 0.5 log(PR) (or some combination of these) - some problems in normalization and dependence on # of keywords

• generalized fancy lists - for each list, put entries with highest term value in fancy list - sort both lists by pagerank docID - note: anything that does well in 2 out of 3 scores is found soon - deterministic or probabilistic pruning, or first-k

How can we do better?

chair

table

fancy list

fancy list

rest of list, cosine < x

rest of list, cosine < y

Page 55: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• loose vs. strict precision for various sizes of the fancy lists

Results for generalized fancy lists

• MUCH better precision than without fancy lists!• for first-1000, we always get correct top-1 in these runs

Page 56: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• cost similar to first-m without fancy lists plus the additional cost of reading fancy lists• cost increases slightly with size of fancy list• slight inefficiency: fancy list items not removed from other list• note: we do not consider savings due to caching

Costs of Fancy Lists

Page 57: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• always gives “correct” result• top-4 can be computed reliably with ~20% of original cost• with 16 nodes, top-4 from each node suffice with 99% prob. to get top-10

Reliable Pruning

Page 58: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• first-30 returns correct top-10 for almost 98% of all queries

Results for 16 Nodes

Page 59: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• top-10 queries on 16 machines with 120 million pages• up to 10 queries/sec with reliable pruning• up to 20 queries per second with first-30 scheme

Throughput and Latency for 16 Nodes

Note: reliable pruning not implemented in purely incremental manner

Page 60: 1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

• results for 3+ terms and incremental query integrator• need to do precision/recall study• need to engineer ranking function and reevaluate• how to include term distance in documents• impact of caching at lower level• working on publicly available engine prototype• tons of loose ends and open questions

Current and Future Work