Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze http://informationretrieval.org
Information Retrieval. CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Information Retrieval
CSE 8337 (Part II)
Spring 2011
Some Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/bookIntroduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze
Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data
Profiles Registration information Cookies
CSE 8337 Spring 2011 6
Zipf’s Law Applied to Web Distribution of frequency of
occurrence of words in text. “Frequency of i-th most frequent
word is 1/i q times that of the most frequent word”
CSE 8337 Spring 2011 7
Heap’s Law Applied to Web Measures size of vocabulary in a
text of size n :O (n b)
b normally less than 1
CSE 8337 Spring 2011 8
Web search basics
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Users’ empirical evaluation of results Quality of pages varies widely
Relevance is not enough Other desirable qualities (non IR!!)
Content: Trustworthy, diverse, non-duplicated, well maintained
Web readability: display correctly & fast No annoyances: pop-ups, etc
Precision vs. recall On the web, recall seldom matters
What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with
obscure queries Recall matters when the number of matches is very small
User perceptions may be unscientific, but are significant over a large aggregate
CSE 8337 Spring 2011 11
Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided
Mitigate user errors (auto spell check, search assist,…)
Explicit: Search within results, more like this, refine ...
Anticipative: related searches Deal with idiosyncrasies
Web specific vocabulary Impact on stemming, spell-check, etc
Web addresses typed in the search box …
CSE 8337 Spring 2011 12
Simplest forms First generation engines relied heavily on tf/idf
The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s
SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as
the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers
Pure word density cannot
be trusted as an IR signal
CSE 8337 Spring 2011 13
Term frequency tf The term frequency tft,d of term t in
document d is defined as the number of times that t occurs in d.
Raw term frequency is not what we want: A document with 10 occurrences of the
term is more relevant than a document with one occurrence of the term.
But not 10 times more relevant. Relevance does not increase
proportionally with term frequency.
CSE 8337 Spring 2011 14
Log-frequency weighting The log frequency weight of term t in d is
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum
over terms t in both q and d: score
The score is 0 if none of the query terms is
present in the document.
otherwise 0,
0 tfif, tflog10 1 t,dt,d
t,dw
dqt dt ) tflog (1 ,
CSE 8337 Spring 2011 15
Document frequency Rare terms are more informative than
frequent terms Recall stop words
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
A document containing this term is very likely to be relevant to the query arachnocentric
→ We want a high weight for rare terms like arachnocentric.
CSE 8337 Spring 2011 16
Document frequency, continued
Consider a query term that is frequent in the collection (e.g., high, increase, line)
For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.
We will use document frequency (df) to capture this in the score.
df ( N) is the number of documents that contain the term
CSE 8337 Spring 2011 17
idf weight dft is the document frequency of t: the
number of documents that contain t df is a measure of the
informativeness of t We define the idf (inverse document
frequency) of t by
We use log N/dft instead of N/dft to “dampen” the effect of idf.
tt N/df log idf 10
Will turn out the base of the log is immaterial.
CSE 8337 Spring 2011 18
idf example, suppose N= 1 million
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000
0
There is one idf value for each term t in a collection.
CSE 8337 Spring 2011 19
Collection vs. Document frequency
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.
Example:
Which word is a better search term (and should get a higher weight)?
Word Collection frequency
Document frequency
insurance 10440 3997
try 10422 8760
CSE 8337 Spring 2011 20
tf-idf weighting The tf-idf weight of a term is the product of
its tf weight and its idf weight.
Best known weighting scheme in information retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences
within a document Increases with the rarity of the term in the
collection
tdt Ndt
df/log)tflog1(w 10,,
CSE 8337 Spring 2011 21
Search engine optimization (Spam)
Motives Commercial, political, religious, lobbies Promotion funded by advertising budget
Operators Search Engine Optimizers for lobbies,
companies Web masters Hosting services
Forums E.g., Web master world (
www.webmasterworld.com) Search engine specific tricks Discussions about academic papers
query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to
them in S are called authorities Pages that have many outgoing links
are called hubs Better authority pages come from
incoming edges from good hubs and better hub pages come from outgoing edges to good authorities
CSE 8337 Spring 2011 31
Ranking
upSu
uApH )()(
pvSv
vHpA )()(
CSE 8337 Spring 2011 32
PageRank Used in Google PageRank simulates a user navigating
randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a
This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed
Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p1 to pn
CSE 8337 Spring 2011 33
PageRank (cont’d) PR(p) = c (PR(1)/N1 + … +
PR(n)/Nn) PR(i): PageRank for a page i which
points to target page p. Ni: number of links coming out of
page I
CSE 8337 Spring 2011 34
Conclusion Nowadays search engines use,
basically, Boolean or Vector models and their variations
Link Analysis Techniques seem to be the “next generation” of the search engines
Indexes: Compression and distributed architecture are keys
CSE 8337 Spring 2011 35
Crawlers Robot (spider) traverses the hypertext
sructure in the Web. Collect information from visited pages Used to construct indexes for search
engines Traditional Crawler – visits entire Web
(?) and replaces index Periodic Crawler – visits portions of
the Web and updates subset of index Incremental Crawler – selectively
searches the Web and incrementally modifies index
Focused Crawler – visits pages related to a particular subject
CSE 8337 Spring 2011 36
Crawling the Web The order in which the URLs are traversed is
important Using a breadth first policy, we first look at all
the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests
In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively
Good ordering schemes can make a difference if crawling better pages first (PageRank)
CSE 8337 Spring 2011 37
Crawling the Web Due to the fact that robots can
overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed
Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages
CSE 8337 Spring 2011 38
Focused Crawler Only visit links from a page if that page
is determined to be relevant. Components:
Classifier which assigns relevance score to each page based on crawl topic.
Distiller to identify hub pages. Crawler visits pages based on crawler and
distiller scores. Classifier also determines how useful
outgoing links are Hub Pages contain links to many
relevant pages. Must be visited even if not high relevance score.
CSE 8337 Spring 2011 39
Focused Crawler
CSE 8337 Spring 2011 40
Basic crawler operation Begin with known “seed”
pages Fetch and parse them
Extract URLs they point to Place the extracted URLs on a queue
Fetch each URL on the queue and repeat
CSE 8337 Spring 2011 41
Crawling picture
Web
URLs crawledand parsed
URLs frontier
Unseen Web
Seedpages
CSE 8337 Spring 2011 42
Simple picture – complications Web crawling isn’t feasible with one
machine All of the above steps distributed
Even non-malicious pages pose challenges Latency/bandwidth to remote servers
vary Webmasters’ stipulations
How “deep” should you crawl a site’s URL hierarchy?
Site mirrors and duplicate pages Malicious pages
Spam pages Spider traps
Politeness – don’t hit a server too often
CSE 8337 Spring 2011 43
What any crawler must do Be Polite: Respect implicit and
explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this
shortly) Be Robust: Be immune to spider
traps and other malicious behavior from web servers
CSE 8337 Spring 2011 44
What any crawler should do Be capable of distributed operation:
designed to run on multiple distributed machines
Be scalable: designed to increase the crawl rate by adding more machines
Performance/efficiency: permit full use of available processing and network resources
CSE 8337 Spring 2011 45
What any crawler should do Fetch pages of “higher quality”
first Continuous operation: Continue
fetching fresh copies of a previously fetched page
Extensible: Adapt to new data formats, protocols
CSE 8337 Spring 2011 46
Updated crawling picture
URLs crawledand parsed
Unseen Web
SeedPages
URL frontier
Crawling thread
CSE 8337 Spring 2011 47
URL frontier Can include multiple pages from
the same host Must avoid trying to fetch them
all at the same time Must try to keep all crawling
threads busy
CSE 8337 Spring 2011 48
Explicit and implicit politeness Explicit politeness: specifications
from webmasters on what portions of site can be crawled robots.txt
Implicit politeness: even with no specification, avoid hitting any site too often
CSE 8337 Spring 2011 49
Robots.txt Protocol for giving spiders (“robots”)
limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html
Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt