Top Banner
Information Retrieval Deepak Kumar
47

Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Jun 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Information Retrieval

Deepak Kumar

Page 2: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Information Retrieval

Searching within a document collection for a particular information need.

• Traditional vs. web IR

Page 3: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Query

Page 4: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Search Engines…

AltavistaAskBaiduBingBlekkoChaChaDogpileDaumDuckDuckGo

EntirewebExciteFarooInfo.comGigablastGoogleGoHakiaHotBot

LeapfishLycosMonster CrawlerNaverOmgiliDmozScrub The WebSpezifyStinky Teddy

StumpdediaTeomaWebCrawlerYahoo! SearchYandex

Page 5: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.
Page 6: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Matching & Ranking

query

muddy waters

matched pages ranked pages

1.

2.

3.matching ranking

“hits”

Page 7: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Index

Page 8: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index

• A mapping from content (words) to location.

• Example:

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

1 2 3

Page 9: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

Page 10: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

Every word in everyweb page is indexed!

Page 11: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat

Page 12: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat

Page 13: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat

the cat sat on the mat

1

the cat stood while a dog sat3

hits

Page 14: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

dog the cat stood while a dog sat3

hits

the dog stood on the mat2

Page 15: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat dog

Page 16: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat dog

Page 17: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat dog

the cat stood while a dog sat3

hits

Page 18: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Searching

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

cat the sat ???

Page 19: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Phrase Queries

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

“cat sat”

the cat sat on the mat

1

the cat stood while a dog sat3

hits

Page 20: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Phrase Queries

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

“cat sat”

the cat sat on the mat

1

the cat stood while a dog sat3

hits

How to tell if two words occur next to each other?

Page 21: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Phrase Queries

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3cat 1 3dog 2 3mat 1 2on 1 2sat 1 3stood 2 3the 1 2 3while 3

1 2 3

query

“cat sat”

the cat sat on the mat

1

the cat stood while a dog sat3

hits

How to tell if two words occur next to each other? EFFICIENTLY???

Page 22: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index with Location

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

1 2 3

Page 23: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index with Location

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

1 2 3

query

“cat sat”

Page 24: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index with Location

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

1 2 3

query

“cat sat”

1‐2, 3‐2

1‐3, 3‐7

Page 25: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index with Location

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

1 2 3

query

“cat sat”

1‐2, 3‐2

1‐3, 3‐7

Page 26: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Inverted Index with Location

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

1 2 3

query

“cat sat”

1‐2

1‐3

the cat sat on the mat

1

hits

Page 27: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

NEAR* Queries

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

1 2 3

query

cat NEAR dog

the cat stood while a dog sat3

hits

*NEAR: distance <= 5

3‐2

3‐6

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

Page 28: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

NEAR* Queries

the dog stood on the mat

the cat stood while a dog sat

the cat sat on the mat

1 2 3

query

cat NEAR dog

the cat stood while a dog sat3

hits

*NEAR: distance <= 5

3‐2

3‐6

a 3‐5cat 1‐2 3‐2dog 2‐2 3‐6mat 1‐6 2‐6on 1‐4 2‐4sat 1‐3 3‐7stood 2‐3 3‐3the 1‐1 1‐5 2‐1 2‐5 3‐1while 3‐4

Useful in ranking!

Page 29: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Matching & Ranking

query

muddy waters

matched pages ranked pages

1.

2.

3.matching ranking

“hits”

Page 30: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Ranking & Relevance

By far the most commoncause of malaria isbeing bitten by aninfected mosquito, butthere are also otherways to contract thedisease.

Our cause was nothelped by the poorhealth of the troops,many of whom weresuffering from malariaand other tropicaldiseases.

1 2

Page 31: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Ranking & Relevance

By far the most commoncause of malaria isbeing bitten by aninfected mosquito, butthere are also otherways to contract thedisease.

Our cause was nothelped by the poorhealth of the troops,many of whom weresuffering from malariaand other tropicaldiseases.

1 2

also 1‐19…cause 1‐6   2‐2…malaria 1‐8   2‐19…whom 2‐15

Page 32: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Ranking & Relevance

By far the most commoncause of malaria isbeing bitten by aninfected mosquito, butthere are also otherways to contract thedisease.

Our cause was nothelped by the poorhealth of the troops,many of whom weresuffering from malariaand other tropicaldiseases.

1 2

also 1‐19…cause 1‐6   2‐2…malaria 1‐8   2‐19…whom 2‐15

query

malaria cause

Page 33: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Ranking & Relevance

By far the most commoncause of malaria isbeing bitten by aninfected mosquito, butthere are also otherways to contract thedisease.

Our cause was nothelped by the poorhealth of the troops,many of whom weresuffering from malariaand other tropicaldiseases.

1 2

also 1‐19…cause 1‐6   2‐2…malaria 1‐8   2‐19…whom 2‐15

query

malaria causeNearness canresolve the ranking!

Page 34: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Using Metadata

Page 35: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Using Metadata<!DOCTYPE HTML PUBLIC "‐//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><head><meta http‐equiv="Content‐Type" content="text/html; charset=iso‐8859‐1"> <title>CS380: Science of Information (Course Page)</title></head><body><P><CENTER><h3>Bryn Mawr College<BR CLEAR="ALL"> <B><FONT SIZE="+2">CS 380: Recent Advances in Computer Science<br>Topic: Science of Information</FONT></B><BR CLEAR="ALL"><B><FONT SIZE="+2">Fall 2012</FONT></B><br>BMC Class Number: 1214<BR CLEAR="ALL"><B><FONT SIZE="+2">Course Materials</FONT></B></h3></CENTER>…

Page 36: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Metadata

my dogthe dog stood on the mat

my petsthe cat stood while a dog sat

my catthe cat sat on the mat

1 2 3

Page 37: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Metadata

my dogthe dog stood on the mat

my petsthe cat stood while a dog sat

my catthe cat sat on the mat

1 2 3

<title>my dog </title><body>the dog stood on the mat</body>

<title>my pets </title><body>the cat stood while a dog sat

<title>my cat </title> <body>the cat sat on the mat </body>

1 2 3

Page 38: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Metadata

<title>my dog </title><body>the dog stood on the mat</body>

<title>my pets </title><body>the cat stood while a dog sat

<title>my cat </title> <body>the cat sat on the mat </body>

1

2

3

a 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

Page 39: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Structure Queriesa 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

query

intitle: dog

Page 40: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Structure Queriesa 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

query

intitle: dog

Page 41: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Structure Queriesa 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

query

intitle: dog

Page 42: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Structure Queriesa 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

query

intitle: dog

Page 43: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Structure Queriesa 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

query

intitle: dog

Page 44: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Structure Queriesa 3‐10cat 1‐3 1‐7 3‐7dog 2‐3 2‐7 3‐11mat 1‐11 2‐11my 1‐2 2‐2 3‐2on 1‐9 2‐9pets 3‐3sat 1‐8 3‐12stood 2‐8 3‐8the 1‐6 1‐10 2‐6 2‐10 3‐6while 3‐9<body> 1‐5 2‐5 3‐5</body> 1‐12 2‐12 3‐13<title> 1‐1 2‐1 3‐1</title> 1‐4 2‐4 3‐4

query

intitle: dog

<title>my dog </title><body>the dog stood on the mat</body>

2

Page 45: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Web Information Retrieval

• Search Engines• Queriesphrase queriesstructure queries (NEAR, intitle:, …)

• Matching• Inverted Indexpage numberlocation

• Ranking & Relevance• Metadata

Page 46: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

Web Information Retrieval

• Search Engines• Queriesphrase queriesstructure queries

• Matching• Inverted Indexpage numberlocation

• Ranking & Relevance• Metadata

Efficient matchingis only one half the story.

The other grand challengeis how to rank the matching pages

Page 47: Information Retrieval - cs.brynmawr.edu · Information Retrieval Searching within a document collection for a particular information need. • Traditional vs. web IR • • Query.

References

• Google’s PageRank and Beyond, Amy N. Langville and Carl D. Meyer, Princeton University Press, 2006.

• Nine Algorithms That Changed The Future, John MacCormick, Princeton University Press, 2012.

• Learning Computing with Robots, Deepak Kumar, IPRE 2011.