Top Banner
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 http://www.ee.technion.ac.il/cours es/049011
39
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Slides

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 2

March 26, 2006

http://www.ee.technion.ac.il/courses/049011

Page 2: Slides

2

Information Retrieval

Page 3: Slides

3

I want information about Michael

Jordan, the machine learning expert

Information Retrieval Setting

queryUserDocument

Collection

“Information Need”

+”Michael Jordan” -basketball

1. Michael I. Jordan’s homepage2. NBA.com3. Michael Jordan on TV

Ranked list of retrieved documents

IR SystemIR System

documents

No. 1 is good, Rest are bad

feedback

Revised ranked list of retrieved documents

1. Michael I. Jordan’s homepage2. M.I. Jordan’s pubs3. Graphical Models

Page 4: Slides

4

Information Retrieval vs.Data Retrieval Information Retrieval System: a system that allows a user

to retrieve documents that match her “information need” from a large corpus. Ex: Get documents about Michael Jordan, the machine learning

expert.

Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Ex: SELECT doc

FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”)

AND NOT (doc.text CONTAINS

“basketball”).

Page 5: Slides

5

Information Retrieval vs. Data Retrieval

Information RetrievalData Retrieval

DataFree text, unstructuredDatabase tables, structured

QueriesKeywords,

Natural language

SQL,

Relational algebras

ResultsApproximate matchesExact matches

ResultsOrdered by relevanceUnordered

AccessibilityNon-expert humansKnowledgeable users or automatic processes

Page 6: Slides

6

Information Retrieval Systems

IR System

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Corpus

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

raw docs

Page 7: Slides

7

Search EnginesSearch Engine

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Web

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

crawlerglobal

analyzerrepository

Page 8: Slides

8

Classical IR vs. Web IRClassical IRWeb IR

VolumeLargeHuge

Data qualityClean, no dupsNoisy, dups

Data change rateInfrequentIn flux

Data accessibilityAccessiblePartially accessible

Format diversityHomogeneousWidely diverse

DocumentsTextHypertext

# of matchesSmallLarge

IR techniquesContent-basedLink-based

Page 9: Slides

9

Outline

Abstract formulation Models for relevance ranking Retrieval evaluation Query languages Text processing Indexing and searching

Page 10: Slides

10

Abstract Formulation Ingredients:

D: document collection Q: query space f: D x Q R: relevance scoring function For every q in Q, f induces a ranking (partial order) q on D

Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation on D

Goals: Accuracy: should be “close” to q Compactness: index should be compact Response time: answers should be given quickly

Page 11: Slides

11

Document Representation

T = { t1,…, tk }: a “token space” (a.k.a. “feature space” or “term space”)Ex: all words in EnglishEx: phrases, URLs, …

A document: a real vector d in Rk

di: “weight” of token ti in d

Ex: di = normalized # of occurrences of ti in d

Page 12: Slides

12

Classic IR (Relevance) Models

The Boolean model The Vector Space Model (VSM)

Page 13: Slides

13

The Boolean Model

A document: a boolean vector d in {0,1}k

di = 1 iff ti belongs to d

A query: a boolean formula q over tokens q: {0,1}k {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball

Relevance scoring function: f(d,q) = q(d)

Page 14: Slides

14

The Boolean Model: Pros & Cons

Advantages:Simplicity for users

Disadvantages:Relevance scoring is too coarse

Page 15: Slides

15

The Vector Space Model (VSM)

A document: a real vector d in Rk

di = weight of ti in d (usually TF-IDF score)

A query: a real vector q in Rk

qi = weight of ti in q

Relevance scoring function: f(d,q) = sim(d,q)

“similarity” between d and q

Page 16: Slides

16

Popular Similarity Measures

L1 or L2 distance d,q are first normalized

to have unit norm

Cosine similarity

d

q

d –q

d

q

Page 17: Slides

17

TF-IDF Score: Motivation

Motivating principle:A term ti is relevant to a document d if:

ti occurs many times in d relative to other terms that occur in d

ti occurs many times in d relative to its number of occurrences in other documents

Examples10 out of 100 terms in d are “java”10 out of 10,000 terms in d are “java”10 out of 100 terms in d are “the”

Page 18: Slides

18

TF-IDF Score: Definition

n(d,ti) = # of occurrences of ti in d N = i n(d,ti) (# of tokens in d) Di = # of documents containing ti D = # of documents in the collection

TF(d,ti): “Term Frequency” Ex: TF(d,ti) = n(d,ti) / N Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) })

IDF(ti): “Inverse Document Frequency” Ex: IDF(ti) = log (D/Di)

TFIDF(d,ti) = TF(d,ti) x IDF(ti)

Page 19: Slides

19

VSM: Pros & Cons

Advantages:Better granularity in relevance scoringGood performance in practiceEfficient implementations

Disadvantages:Assumes term independence

Page 20: Slides

20

Retrieval Evaluation Notations:

D: document collection Dq: documents in D that are “relevant” to query q

Ex: f(d,q) is above some threshold

Lq: list of results on query qD

Lq DqRecall:

Precision:

Page 21: Slides

21

Recall & Precision: Example

Recall(A) = 80% Precision(A) = 40%

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A Relevant docs:

d123, d56, d9, d25, d31. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Recall(B) = 100% Precision(B) = 50%

Page 22: Slides

22

Precision@k and Recall@k

Notations:Dq: documents in D that are “relevant” to q

Lq,k: top k results on the list

Recall@k:

Precision@k:

Page 23: Slides

23

Precision@k: Example

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

pre

cisi

on

@k

List A List B

Page 24: Slides

24

Recall@k: Example

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

recall@

k

List A List B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Page 25: Slides

25

“Interpolated” Precision

Notations:Dq: documents in D that are “relevant” to qr: a recall level (e.g., 20%)k(r): first k so that recall@k >= r

Interpolated precision@ recall level r =

max { precision@k : k >= k(r) }

Page 26: Slides

26

Precision vs. Recall: Example

0%10%20%30%40%50%60%70%80%90%

100%

0 20 40 60 80 100

Recall

Inte

rpo

late

d P

reci

sio

n List AList B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Page 27: Slides

27

Query Languages: Keyword-Based Singe-word queries

Ex: Michael Jordan machine learning Context queries

Phrases. Ex: “Michael Jordan” “machine learning” Proximity. Ex: “Michael Jordan” at distance of at most

10 words from “machine learning” Boolean queries

Ex: +”Michael Jordan” –basketball Natural language queries

Ex: “Get me pages about Michael Jordan, the machine learning expert.”

Page 28: Slides

28

Query Languages: Pattern Matching Prefixes

Ex: prefix:comput Suffixes

Ex: suffix:net Regular Expressions

Ex: [0-9]+th world-wide web conference

Page 29: Slides

29

Text Processing

Lexical analysis & tokenization Split text into words, downcase letters, filter out

punctuation marks, digits, hyphens Stopword elimination

Better retrieval accuracy, more compact index Ex: “to be or not to be”

Stemming Ex: “computer”, “computing”, “computation” comput

Index term selection Keywords vs. full text

Page 30: Slides

30

Inverted Index

Michael1 Jordan2, the3 author4 of5 “graphical6 models7”, is8 a9 professor10 at11 U.C.12 Berkeley13.

The1 famous2 NBA3 legend4 Michael5 Jordan6 liked7 to8 date9 models10.

d1

d2

author: (d1,4)

berkeley: (d1,13)

date: (d2,9)

famous: (d2, 2)

graphical: (d1,6)

jordan: (d1,2), (d2,6)

legend: (d2,4)

like: (d2,7)

michael: (d1,1), (d2,5)

model: (d1,7), (d2,10)

nba: (d2,3)

professor: (d1,10)

uc: (d1,12)

Vocabulary Postings

Page 31: Slides

31

Inverted Index Structure

Vocabulary File

term1

term2

Postings File

postings list 1

postings list 2

• Usually, fits in main memory

• Stored on disk

Page 32: Slides

32

Searching an Inverted Index

Given: t1, t2: query terms L1,L2: corresponding posting lists

Need to get ranked list of docs in intersection of L1,L2

Solution 1: If L1,L2 are comparable in size, “merge” L1 and L2 to find docs in their intersection, and then order them by rank. (running time: O(|L1| + |L2|))

Solution 2: If L1 is considerably shorter than L2, binary search each posting of L1 in L2 to find the intersection, and then order them by rank.(running time: O(|L1| x log(|L2|))

Page 33: Slides

33

Search Optimization

Improvement:

Order docs in posting lists by static rank (e.g., PageRank).

Then, can output top matches, without scanning the whole lists.

Page 34: Slides

34

Index Construction

Given a stream of documents, store (did,tid,pos) triplets in a file

Sort and group file by tid Extract posting lists

Page 35: Slides

35

Index Maintenance

Naïve updates of inverted index can be very costly Require random access A single change may cause many insertions/deletions

Batch updates Two indices

Main index (created in batch, large, compressed) “Stop-press” index (incremental, small,

uncompressed)

Page 36: Slides

36

Index Maintenance

If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index.

Given a query term t, fetch its list Lt from main index, and two lists Lt,+ and Lt,- from stop-press index.

Result is:

When stop-press index grows too large, it is merged into the main index.

Page 37: Slides

37

Index Compression

Delta compression

Saves a lot for popular terms Doesn’t save much for rare terms (but these don’t take

much space anyway)

michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),…

michael: (1000007,5), (2,12), (4,77), (22,88),…

Page 38: Slides

38

Variable Length Encodings

How to encode gaps succinctly? Option 1: Fixed-length binary encoding.

Effective when all gap lengths are equally likely No savings over storing doc ids.

Option 2: Unary encoding. Gap x is encoded by x-1 1’s followed by a 0 Effective when large gaps are very rare (Pr(x) = 1/2x)

Option 3: Gamma encoding. Gap x is encoded by (x x), where x is the binary encoding

of x and x is the length of x, encoded in unary. Encoding length: about 2log(x).

Page 39: Slides

39

End of Lecture 2