Information Retrieval

Information Retrieval

CSE 8337 (Part II)

Spring 2011

Some Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/bookIntroduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze

http://informationretrieval.org

http://www.sims.berkeley.edu/~hearst/irbook/

http://www.engr.smu.edu/~mhd/book

http://informationretrieval.org/

CSE 8337 Spring 2011 2

CSE 8337 Outline• Introduction• Text Processing• Indexes• Boolean Queries

• Web Searching/Crawling• Vector Space Model• Matching• Evaluation• Feedback/Expansion


Web Searching TOC Web Overview Searching Ranking Crawling


Web Overview Size

>11.5 billion pages (2005) Grows at more than 1 million pages a

day Google indexes over 3 billion

documents Diverse types of data http://www.worldwidewebsize.com/

http://www.worldwidewebsize.com/


Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data

Profiles Registration information Cookies


Zipf’s Law Applied to Web Distribution of frequency of

occurrence of words in text. “Frequency of i-th most frequent

word is 1/i q times that of the most frequent word”


Heap’s Law Applied to Web Measures size of vocabulary in a

text of size n :O (n b)

b normally less than 1


Web search basics

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User


How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

http://www.iprospect.com/

CSE 8337 Spring 2011 10

Users’ empirical evaluation of results Quality of pages varies widely

Relevance is not enough Other desirable qualities (non IR!!)

Content: Trustworthy, diverse, non-duplicated, well maintained

Web readability: display correctly & fast No annoyances: pop-ups, etc

Precision vs. recall On the web, recall seldom matters

What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with

obscure queries Recall matters when the number of matches is very small

User perceptions may be unscientific, but are significant over a large aggregate

CSE 8337 Spring 2011 11

Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided

Mitigate user errors (auto spell check, search assist,…)

Explicit: Search within results, more like this, refine ...

Anticipative: related searches Deal with idiosyncrasies

Web specific vocabulary Impact on stemming, spell-check, etc

Web addresses typed in the search box …

CSE 8337 Spring 2011 12

Simplest forms First generation engines relied heavily on tf/idf

The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s

SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as

the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers

Pure word density cannot

be trusted as an IR signal

CSE 8337 Spring 2011 13

Term frequency tf The term frequency tft,d of term t in

document d is defined as the number of times that t occurs in d.

Raw term frequency is not what we want: A document with 10 occurrences of the

term is more relevant than a document with one occurrence of the term.

But not 10 times more relevant. Relevance does not increase

proportionally with term frequency.

CSE 8337 Spring 2011 14

Log-frequency weighting The log frequency weight of term t in d is

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum

over terms t in both q and d: score

The score is 0 if none of the query terms is

present in the document.

otherwise 0,

0 tfif, tflog10 1 t,dt,d

t,dw

dqt dt ) tflog (1 ,

CSE 8337 Spring 2011 15

Document frequency Rare terms are more informative than

frequent terms Recall stop words

Consider a term in the query that is rare in the collection (e.g., arachnocentric)

A document containing this term is very likely to be relevant to the query arachnocentric

→ We want a high weight for rare terms like arachnocentric.

CSE 8337 Spring 2011 16

Document frequency, continued

Consider a query term that is frequent in the collection (e.g., high, increase, line)

For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.

We will use document frequency (df) to capture this in the score.

df ( N) is the number of documents that contain the term

CSE 8337 Spring 2011 17

idf weight dft is the document frequency of t: the

number of documents that contain t df is a measure of the

informativeness of t We define the idf (inverse document

frequency) of t by

We use log N/dft instead of N/dft to “dampen” the effect of idf.

tt N/df log idf 10

Will turn out the base of the log is immaterial.

CSE 8337 Spring 2011 18

idf example, suppose N= 1 million

term dft idft

calpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000

0

There is one idf value for each term t in a collection.

CSE 8337 Spring 2011 19

Collection vs. Document frequency

The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

Example:

Which word is a better search term (and should get a higher weight)?

Word Collection frequency

Document frequency

insurance 10440 3997

try 10422 8760

CSE 8337 Spring 2011 20

tf-idf weighting The tf-idf weight of a term is the product of

its tf weight and its idf weight.

Best known weighting scheme in information retrieval

Note: the “-” in tf-idf is a hyphen, not a minus sign!

Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences

within a document Increases with the rarity of the term in the

collection

tdt Ndt

df/log)tflog1(w 10,,

CSE 8337 Spring 2011 21

Search engine optimization (Spam)

Motives Commercial, political, religious, lobbies Promotion funded by advertising budget

Operators Search Engine Optimizers for lobbies,

companies Web masters Hosting services

Forums E.g., Web master world (

www.webmasterworld.com) Search engine specific tricks Discussions about academic papers

http://www.webmasterworld.com/

CSE 8337 Spring 2011 22

Cloaking Serve fake content to search engine

spider DNS cloaking: Switch IP address.

Impersonate How do you identify a spider?

Is this a SearchEngine spider?

Y

N

SPAM

RealDoc

Cloaking

CSE 8337 Spring 2011 23

More spam techniques Doorway pages

Pages optimized for a single keyword that re-direct to the real target page

Link spamming Mutual admiration societies, hidden

links, awards – more on these later Domain flooding: numerous domains

that point or re-direct to a target page Robots

Fake query stream – rank checking programs

CSE 8337 Spring 2011 24

The war against spam Quality signals - Prefer

authoritative pages based on: Votes from authors

(linkage signals) Votes from users (usage

signals) Policing of URL

submissions Anti robot test

Limits on meta-keywords

Robust link analysis Ignore statistically

implausible linkage (or text)

Use link analysis to detect spammers (guilt by association)

Spam recognition by machine learning Training set based on

known spam Family friendly filters

Linguistic analysis, general classification techniques, etc.

For images: flesh tone detectors, source text analysis, etc.

Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern

detection

CSE 8337 Spring 2011 25

More on spam Web search engines have policies

on SEO practices they tolerate/block http://

help.yahoo.com/l/us/yahoo/search/basics/basics-18.html

http://www.google.com/intl/en/webmasters/

Adversarial IR: the unending (technical) battle between SEO’s and web search engines

Research http://airweb.cse.lehigh.edu

http://help.yahoo.com/l/us/yahoo/search/basics/basics-18.html





http://airweb.cse.lehigh.edu/

CSE 8337 Spring 2011 26

Ranking Order documents based on relevance to query

(similarity measure) Ranking has to be performed without

accessing the text, just the index About ranking algorithms, all information is

“top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries

CSE 8337 Spring 2011 27

Ranking Some of the new ranking algorithms also

use hyperlink information Important difference between the Web and

normal IR databases, the number of hyperlinks that point to a page provides a measure of its popularity and quality.

Links in common between pages often indicate a relationship between those pages.

CSE 8337 Spring 2011 28

Ranking Three examples of ranking techniques

based in link analysis: WebQuery HITS (Hub/Authority pages) PageRank

CSE 8337 Spring 2011 29

WebQuery

WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is

http://www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html



CSE 8337 Spring 2011 30

HITS Kleinberg ranking scheme depends on the

query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to

them in S are called authorities Pages that have many outgoing links

are called hubs Better authority pages come from

incoming edges from good hubs and better hub pages come from outgoing edges to good authorities

CSE 8337 Spring 2011 31

Ranking

upSu

uApH )()(

pvSv

vHpA )()(

CSE 8337 Spring 2011 32

PageRank Used in Google PageRank simulates a user navigating

randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a

This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed

Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p1 to pn

CSE 8337 Spring 2011 33

PageRank (cont’d) PR(p) = c (PR(1)/N1 + … +

PR(n)/Nn) PR(i): PageRank for a page i which

points to target page p. Ni: number of links coming out of

page I

CSE 8337 Spring 2011 34

Conclusion Nowadays search engines use,

basically, Boolean or Vector models and their variations

Link Analysis Techniques seem to be the “next generation” of the search engines

Indexes: Compression and distributed architecture are keys

CSE 8337 Spring 2011 35

Crawlers Robot (spider) traverses the hypertext

sructure in the Web. Collect information from visited pages Used to construct indexes for search

engines Traditional Crawler – visits entire Web

(?) and replaces index Periodic Crawler – visits portions of

the Web and updates subset of index Incremental Crawler – selectively

searches the Web and incrementally modifies index

Focused Crawler – visits pages related to a particular subject

CSE 8337 Spring 2011 36

Crawling the Web The order in which the URLs are traversed is

important Using a breadth first policy, we first look at all

the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests

In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively

Good ordering schemes can make a difference if crawling better pages first (PageRank)

CSE 8337 Spring 2011 37

Crawling the Web Due to the fact that robots can

overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed

Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages

CSE 8337 Spring 2011 38

Focused Crawler Only visit links from a page if that page

is determined to be relevant. Components:

Classifier which assigns relevance score to each page based on crawl topic.

Distiller to identify hub pages. Crawler visits pages based on crawler and

distiller scores. Classifier also determines how useful

outgoing links are Hub Pages contain links to many

relevant pages. Must be visited even if not high relevance score.

CSE 8337 Spring 2011 39

Focused Crawler

CSE 8337 Spring 2011 40

Basic crawler operation Begin with known “seed”

pages Fetch and parse them

Extract URLs they point to Place the extracted URLs on a queue

Fetch each URL on the queue and repeat

CSE 8337 Spring 2011 41

Crawling picture

Web

URLs crawledand parsed

URLs frontier

Unseen Web

Seedpages

CSE 8337 Spring 2011 42

Simple picture – complications Web crawling isn’t feasible with one

machine All of the above steps distributed

Even non-malicious pages pose challenges Latency/bandwidth to remote servers

vary Webmasters’ stipulations

How “deep” should you crawl a site’s URL hierarchy?

Site mirrors and duplicate pages Malicious pages

Spam pages Spider traps

Politeness – don’t hit a server too often

CSE 8337 Spring 2011 43

What any crawler must do Be Polite: Respect implicit and

explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this

shortly) Be Robust: Be immune to spider

traps and other malicious behavior from web servers

CSE 8337 Spring 2011 44

What any crawler should do Be capable of distributed operation:

designed to run on multiple distributed machines

Be scalable: designed to increase the crawl rate by adding more machines

Performance/efficiency: permit full use of available processing and network resources

CSE 8337 Spring 2011 45

What any crawler should do Fetch pages of “higher quality”

first Continuous operation: Continue

fetching fresh copies of a previously fetched page

Extensible: Adapt to new data formats, protocols

CSE 8337 Spring 2011 46

Updated crawling picture

URLs crawledand parsed

Unseen Web

SeedPages

URL frontier

Crawling thread

CSE 8337 Spring 2011 47

URL frontier Can include multiple pages from

the same host Must avoid trying to fetch them

all at the same time Must try to keep all crawling

threads busy

CSE 8337 Spring 2011 48

Explicit and implicit politeness Explicit politeness: specifications

from webmasters on what portions of site can be crawled robots.txt

Implicit politeness: even with no specification, avoid hitting any site too often

CSE 8337 Spring 2011 49

Robots.txt Protocol for giving spiders (“robots”)

limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html

Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt

This file specifies access restrictions

http://www.robotstxt.org/wc/norobots.html

CSE 8337 Spring 2011 50

Robots.txt example No robot should visit any URL

starting with "/yoursite/temp/", except the robot called “searchengine":

User-agent: *Disallow: /yoursite/temp/

User-agent: searchengineDisallow:

CSE 8337 Spring 2011 51

Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL

Extract links from it to other docs (URLs)

Check if URL has content already seen If not, add to indexes

For each extracted URL Ensure it passes certain URL filter

tests Check if it is already in the frontier

(duplicate URL elimination)

E.g., only crawl .edu, obey robots.txt, etc.

Which one?

CSE 8337 Spring 2011 52

Basic crawl architecture

WWW

DNS

Parse

Contentseen?

DocFP’s

DupURLelim

URLset

URL Frontier

URLfilter

robotsfilters

Fetch

CSE 8337 Spring 2011 53

Parsing: URL normalization When a fetched document is parsed, some

of the extracted links are relative URLs E.g., at

http://en.wikipedia.org/wiki/Main_Pagewe have a relative link to

/wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer

During parsing, must normalize (expand) such relative URLs

http://en.wikipedia.org/wiki/Main_Page

http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer

http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer

CSE 8337 Spring 2011 54

Content seen? Duplication is widespread on the

web If the page just fetched is already in

the index, do not further process it This is verified using document

fingerprints or shingles http://theory.stanford.edu/~

aiken/publications/papers/sigmod03.pdf http://

www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/duplicateDocs_corrected.pdf

http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

http://www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/duplicateDocs_corrected.pdf




CSE 8337 Spring 2011 55

Filters and robots.txt

Filters – regular expressions for URL’s to be crawled/not

Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits

web server Cache robots.txt files

CSE 8337 Spring 2011 56

Distributing the crawler Run multiple crawl threads, under

different processes – potentially at different nodes Geographically distributed nodes

Partition hosts being crawled into nodes Hash used for partition

How do these nodes communicate?

CSE 8337 Spring 2011 57

URL frontier: two main considerations

Politeness: do not hit a web server too frequently

Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose

content changes oftenThese goals may conflict each other.(E.g., simple priority queue fails – many

links out of a page go to its own site, creating a burst of accesses to that site.)

CSE 8337 Spring 2011 58

Politeness – challenges Even if we restrict only one

thread to fetch from a host, can hit it repeatedly

Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host

CSE 8337 Spring 2011 59

URL frontier: Mercator scheme

Prioritizer

Biased front queue selectorBack queue router

Back queue selector

K front queues

B back queuesSingle host on each

URLs

Crawl thread requesting URL

CSE 8337 Spring 2011 60

Mercator URL frontier URLs flow in from the top into the

frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO http://users.cis.fiu.edu/~

lusec001/presentations/mercator_join.pdf

http://users.cis.fiu.edu/~lusec001/presentations/mercator_join.pdf



CSE 8337 Spring 2011 61

Front queues

Prioritizer

1 K


CSE 8337 Spring 2011 62

Front queues Prioritizer assigns to URL an integer

priority between 1 and K Appends URL to corresponding queue

Heuristics for assigning priority Refresh rate sampled from previous

crawls Application-specific (e.g., “crawl news

sites more often”)

CSE 8337 Spring 2011 63

Biased front queue selector When a back queue requests a URL

(in a sequence to be described): picks a front queue from which to pull a URL

This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized

CSE 8337 Spring 2011 64

Back queues


Back queue selector

1 B

CSE 8337 Spring 2011 65

Back queue invariants Each back queue is kept non-empty

while the crawl is in progress Each back queue only contains URLs

from a single host Maintain a table from hosts to back

queuesHost name Back queue

… 3

1

B

CSE 8337 Spring 2011 66

Back queue heap One entry for each back queue The entry is the earliest time te at

which the host corresponding to the back queue can be hit again

This earliest time is determined from Last access to that host Any time buffer heuristic we choose

CSE 8337 Spring 2011 67

Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding

back queue q (look up from table) Checks if queue q is now empty – if so,

pulls a URL v from front queues If there’s already a back queue for v’s

host, append v to q and pull another URL from front queues, repeat

Else add v to q When q is non-empty, create heap entry

for it

CSE 8337 Spring 2011 68

Number of back queues B Keep all threads busy while

respecting politeness Mercator recommendation: three

times as many back queues as crawler threads

Information Retrieval

Documents

search box cse

frequent word cse

web overviewsize11

large aggregate cse

search assist

web pagerepeated terms

frequency of i

query maui resort