“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

“The Anatomy of a Large-Scale HypertextualWeb Search Engine”

Presented by

Ahmed Khaled Al-ShantoutICS 542 - 072

Outline• Paper Objective• Introduction & History• Related Work• Design Goals• Google Search Engine Features• Google Architecture• Results & Performance• Conclusion• References

Paper Objective

• To describe the anatomy of a large scale web search engine = Google

Introduction & History• Why the name Google? From googol = 10100 = very large scale

• Web is different from a normal search engine.– Web is vast and is growing exponentially

– Web is heterogonous – images, HTML, files … etc.

– IR on small and well controlled homogenous collections is much easier.

• Human Maintained Lists can’t keep up– Yahoo! Is a human maintained list and so – subjective, slow to

improve, expensive to build and maintain.

• Google make use of the hypertext info to get better results

Related Work

• WWWW (World Wide Web Worm) – 1994 – first web search engine - indexed about 110,000 web pages – handled about 1500 queries/day.

• In 1997, the top SE claimed to index about 2 million web documents - handled about 20 million queries/day.

• In 2000, it was expected to index more than a billion documents – with more than 100 million queries/day.

• Current S.E problems– Subjective– If automated S.E then it returns low quality results– Advertisers can mislead automated S.E

Solution = Google

• Problem statement !

• Must handle the problem in very efficient way– Storage requirements– Efficient processing of the indexing system– Handle a huge number of queries/second– Produce a high quality results

Design Goals• Deliver results that have very high precision even at the

expense of recall– Using hypertextual info can improve the search quality – such as font size,

links, titles, anchor text …. Etc– In 1997, only 4 commercial S.E. was able to return themselves in the top

ten results !

• Make search engine technology transparent, i.e. advertising shouldn’t bias results

• Bring search engine technology into academic environment in order to support novel research activities on large web data sets

• Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple of words

Google Search Engine Features

Two main features to increase result precision:• Uses link structure of web (PageRank)• Uses text surrounding hyperlinks to improve

accurate document retrieval

Other features include:• Takes into account word proximity in documents• Uses font size, word position, etc. to weight word• Storage of full raw html pages

PageRank• What does it mean for a web page to have a high rank?

– Many pages point to it – so it is an important one-– Some important pages point to it such as Yahoo!

• PR(A) = (1-d) + d [PR(T1)/C(T1) + PR(T2/C(T2) + … + PR(Tn/C(Tn)].

• D is called the damping factor - used to prevent misleading the system to get a higher ranking.

• Page A has T1…Tn pages which point to A.• C(T1) is the number of links going out of page T1.• Not all links are treated the same• PageRank is calculated using simple iterative algorithms

Anchor Text• The text of the link. Chapter1

• Objective to return non-textual objects like files, databases, images … etc – which can not be indexed by a text-based S.E

• Also, it return non crawled pages

PageTextLinkTarget Page

Dr. Wasfi’s Page

Chapter1www…PDF file

http://www.ccse.kfupm.edu.sa/~wasfi/ics542/chapter1

Google Architecture

• Most of Google was built using C and C++ for efficiency.

• Works on Solaris and Linux.

Repository

BarrelsLexicon

SearcherPageRank

Sorter

CrawlerStore ServerURL Server

URL Resolver

Links

Doc Index

Indexer

Anchors

Crawling the Web

• To fetch URL and gather web pages into the store.• It is a challenging task. Needs to be fast to keep up to date info• Has to interact with the outside world – web servers, name servers …

etc• To scale well distributed crawling• Each crawler keep 300 open connections• Up to 100 web pages per second using 4 crawler• DNS cache to improve performance• A connection can be in one of these states - looking up DNS,

connecting to host, sending request, and receiving response

Major Data Structure

• Repository: Contains the full html page compressed using zlib standard.

• Document Index: Keeps information about each document. ordered by docID.

• Hit Lists: Corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.

Forwarded Index

• Stored in the barrels.• Each barrel hold a set of wordID• If a document has a word in that barrel, the docID

is recorded in that barrel.

docIDWordID =1 #Hits =2Hit list

WordID =2 #Hits =3Hit list

Inverted Index

• Same as forwarded index expect that it was processed by the sorter.

WordIDDocs#

WordIDDocs#

WordIDDocs# docID# hitsHit list

docID# hitsHit list

docID# hitsHit list

docID# hitsHit list

Lexicon

Results & Performance

• Quality of the results is the most important metric in search engines

• Authors claim that Google outperform major commercial search engines

• Example

Storage Requirements

• Scale well.

• Utilize the storage efficiently

• Use compression

System Performance

• Major operations are crawling, indexing and sorting

• 9 days to get 26 million pages.

• Average 48.5 pages/second

• Indexer – 54 pages/second

• The whole sorting operation takes 24 hours

Search Performance

• Was not the most important issue in their design that time

• The response time for a query was between 1 to 10 second for all queries – mainly Disk IO time –

• Did not have any query cashing, subindices on common terms – for optimization -

Search Performance

Repository

BarrelsLexicon

SearcherPageRank

Sorter

CrawlerStore ServerURL Server

URL Resolver

Links

Doc Index

Indexer

Anchors

Conclusion & Future Work

• The most concern of Google design is to be a scalable web search engine.

• And to provide high quality results

– Page ranking

– Anchor text

Future Work

• To improve search efficiency– Cash the query– Smart disk allocation– Subindices

• Updates – old and new pages – • Add Boolean operators, negation, and stemming• relevance feedback and clustering• user context• result summarization• PageRank can be personalized by increasing the weight of a

user’s home page or bookmarks

References

• S. Brin,L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)

Q & A

Thanks for your Attention

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Documents

web pages

google slide

web documents

large scale web search

normal search engine

search quality

link structure of web

raw html pages slide