Monika Henzinger Supercomputing Challenge for Indexing the ... · M. Henzinger Indexing the Web 13 The bright side: Web advantages vs. classic IR Collection/tools Redundancy Hyperlinks

Indexing the Web – A Challenge for

SupercomputingMonika Henzinger

M. Henzinger Indexing the Web 2

The Web

� 2-10 billion pages – doubling every 8 months� 260 million users per month [Nielson/NetRatings]� 80% of them issue searches [Jupiter Media Metrix]

� How can they find what they need:

Smart algorithms + parallelism =


Let’s first talk about the smart

algorithms …


Classic Information Retrieval

� Input: Document collection

� Goal: Retrieve documents or text with information content that is relevant to user’s information need

� Two aspects:

1. Processing the collection

2. Processing queries (searching)


Determining query results

� Ranking is a function of query term frequency within the document and across all documents

� This works because of the following assumptions in classical IR:– Queries are long and well specified

“What is the impact of the Falklands war on Anglo-Argentinean relations”

– Documents (e.g., newspaper articles) are coherent, well authored, and are usually about one topic

– The vocabulary is small and relatively well understood


IR on the Web

� Input:The publicly accessible Web� Goal: Retrieve high quality pages that are relevant to

user’s need– Static (files: text, audio, … )– Dynamically generated on request: mostly data base

access� Two aspects:

1. Gathering and processing the collection 2. Processing queries (searching)


What’s different about the Web?

(1) Pages:� Bulk …………………… >2B � Vocabulary size………. 10s-100s million of words� Lack of stability……….. Estimates: 23%/day, 38%/week [CG’99]� Diversity

– Type of documents .. Text, pictures, audio, scripts,…– Quality ……………… Lots of misinformation… – Language ………….. 100+

� Duplication– Syntactic……………. 30% (near) duplicates – Semantic……………. ??

� Non-running text……… many home pages, bookmarks, ...


Lack of stability

[Cho2000] 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999


Misinformation (Spam)http://www.Safari-iafrica.com/diana.htm<META name="keywords"

content="diana,di,princess,princess diana,princess di,princess DIANA,PRINCES di,princess of wales,princess of wales, death,condolences, royal, royal family,british,spencer,harry,william,charles,prince,william,prince harry,prince charles,queen elizabeth,">

Other tricks:Keyword hiding, Link spam,Cloaking, Doorways, DNS cloaking,Domain hijacking, …


The big challenge

Meet the user needs given

the heterogeneity of Web pages


What’s different about the Web? (2) Users:� Bulk ……………………. > 93.5 million unique users [3/2002,

MediaMetrics USA data]� Diversity

– Topics of interest…… Arts, computers, …– Knowledge & needs.. – Language …………… 100+

� Sub-optimal queries– Short…………………. 2.35 terms avg– Sub-optimal syntax….. ~80% without operators *

� Search behavior……… – Few results studied…. 85% of users look only at top 10 results *– Query modification….. 78% of queries are not modified *

* [SHMM’98]


The bigger challenge

Meet the user needs given

the heterogeneity of Web pagesand

the sub-optimal queries.


The bright side:Web advantages vs. classic IR

Collection/tools� Redundancy� Hyperlinks� Statistics

– Easy to gather– Large sample sizes

� Interactivity (give hints to the user)

User� Many tools available� Interactivity (refine the

query if needed)


Determining query results on the web

� Ranking based on query term frequency within the document and across all documents does not work well:– Misinformation– Variety in page quality– Huge vocabulary– Short queries


Google’s approach

� Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A)

�Quality of a page is related to its in-degree

� Recursion: Quality of a page is related to– its in-degree, and to – the quality of pages linking to it

�PageRank [BP ‘98]


Definition of PageRank [BP’98]

� Consider the following infinite random walk (surf):

– Initially the surfer is at a random page

– At each step, the surfer proceeds

• to a randomly chosen web page with probability d

• to a randomly chosen successor of the current page with probability 1-d

� The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.


PageRank (cont.)

Said differently:� PageRank = stationary probability for this Markov chain,

i.e.

where n is the total number of nodes in the graph

∑∈

−+=Euv

voutdegreevPageRankdnduPageRank

),()(/)()1()(


PageRank (cont.)

P

A B

PageRank of P is

β∗ ( 1/4th the PageRank of A + 1/3rd the PageRank of B ) +(1- β)

ββββββββ


PageRank advantages

� Query-independent

� Summarizes the “web opinion” of the page importance

� Highly spam-resistant

� Patented


Now let’s talk about parallelism …


Search engine components

� Crawler (Spider): collects the documents

� Indexer: processes and represents the data

� Query handler: processes user queries


Crawler

List of links to explore

Expired pagesfrom index

Add URL

Get link fromlist

Fetch page

Add to queue

Index page and

parse links

Crawling process


Issues with Parallelization

� Crawling order

� Avoid re-crawling

– Session-ids

� Should not overload any server or connection

– virtual hosting

� Avoid infinite spaces

� Content types (Google: 23 different types)



� Crawler (Spider): collects the documents �

� Indexer: processes and represents the data



Indexer� Inverted index data structure: Consider all documents concatenated

into one huge document– For each word keep an ordered array of all positions in

document, compressed

� Indexing = a huge parallel sort� Issues

– Redirects– Anchor text– Incremental updates

...last position1st positionWord 1

......

…



� Crawler (Spider): collects the documents �

� Indexer: processes and represents the data �



Google’s query handler

� Over 150 million queries per day

� Sub-second response time

� Powered by more than 10,000 Linux-based systems

(over 10 teraflops)


Issues

� Scalability with:

– traffic growth

– web data growth

� Hardware faults


Scalability (Data)

� Size of web is growing exponentially� No matter how big your machine, it’s going to be too

small� Solution: distribute index across multiple machines

(“sharding”)


Scalability (Traffic)

� Replicate everything

� Index is read-only, so no consistency problems

� Search is embarrassingly parallel, so linear speedup


Reliability / Fault Tolerance

� PCs are unreliable, especially if you have thousands

� But they are cheap and fast

� Strategy:

– Again: Replication is your friend

• Failure only reduces capacity

• Anyway needed for scalability

– Make it reliable in software


New developments at Google

� Google Web APIs

� Distributed computing: part of folding@home project at

Stanford for protein analysis

� Voice search (+1 650 318 0165)

� …


Google Web APIs (beta)

� SOAP interface to Google:– Searches– Spelling requests– Cached pages

� Client examples in Java, Perl, .NET� Open developer program (http://www.google.com/apis/)� Examples:

– Professor verifier– Google velocity indicator


Where Our Users Are...