Top Banner
“The Anatomy of a Large-Scale Hypertextual Web Search Engine” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072
27

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

“The Anatomy of a Large-Scale HypertextualWeb Search Engine”

Presented by

Ahmed Khaled Al-ShantoutICS 542 - 072

Page 2: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Outline• Paper Objective• Introduction & History• Related Work• Design Goals• Google Search Engine Features• Google Architecture• Results & Performance• Conclusion• References

Page 3: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Paper Objective

• To describe the anatomy of a large scale web search engine = Google

Page 4: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Introduction & History• Why the name Google? From googol = 10100 = very large scale

• Web is different from a normal search engine.– Web is vast and is growing exponentially

– Web is heterogonous – images, HTML, files … etc.

– IR on small and well controlled homogenous collections is much easier.

• Human Maintained Lists can’t keep up– Yahoo! Is a human maintained list and so – subjective, slow to

improve, expensive to build and maintain.

• Google make use of the hypertext info to get better results

Page 5: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Related Work

• WWWW (World Wide Web Worm) – 1994 – first web search engine - indexed about 110,000 web pages – handled about 1500 queries/day.

• In 1997, the top SE claimed to index about 2 million web documents - handled about 20 million queries/day.

• In 2000, it was expected to index more than a billion documents – with more than 100 million queries/day.

• Current S.E problems– Subjective– If automated S.E then it returns low quality results– Advertisers can mislead automated S.E

Page 6: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Solution = Google

• Problem statement !

• Must handle the problem in very efficient way– Storage requirements– Efficient processing of the indexing system– Handle a huge number of queries/second– Produce a high quality results

Page 7: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Design Goals• Deliver results that have very high precision even at the

expense of recall– Using hypertextual info can improve the search quality – such as font size,

links, titles, anchor text …. Etc– In 1997, only 4 commercial S.E. was able to return themselves in the top

ten results !

• Make search engine technology transparent, i.e. advertising shouldn’t bias results

• Bring search engine technology into academic environment in order to support novel research activities on large web data sets

• Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple of words

Page 8: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Google Search Engine Features

Two main features to increase result precision:• Uses link structure of web (PageRank)• Uses text surrounding hyperlinks to improve

accurate document retrieval

Other features include:• Takes into account word proximity in documents• Uses font size, word position, etc. to weight word• Storage of full raw html pages

Page 9: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

PageRank• What does it mean for a web page to have a high rank?

– Many pages point to it – so it is an important one-– Some important pages point to it such as Yahoo!

• PR(A) = (1-d) + d [PR(T1)/C(T1) + PR(T2/C(T2) + … + PR(Tn/C(Tn)].

• D is called the damping factor - used to prevent misleading the system to get a higher ranking.

• Page A has T1…Tn pages which point to A.• C(T1) is the number of links going out of page T1.• Not all links are treated the same• PageRank is calculated using simple iterative algorithms

Page 10: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Anchor Text• The text of the link. Chapter1

• Objective to return non-textual objects like files, databases, images … etc – which can not be indexed by a text-based S.E

• Also, it return non crawled pages

PageTextLinkTarget Page

Dr. Wasfi’s Page

Chapter1www…PDF file

Page 11: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Google Architecture

• Most of Google was built using C and C++ for efficiency.

• Works on Solaris and Linux.

Page 12: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Repository

BarrelsLexicon

SearcherPageRank

Sorter

CrawlerStore ServerURL Server

URL Resolver

Links

Doc Index

Indexer

Anchors

Page 13: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Crawling the Web

• To fetch URL and gather web pages into the store.• It is a challenging task. Needs to be fast to keep up to date info• Has to interact with the outside world – web servers, name servers …

etc• To scale well distributed crawling• Each crawler keep 300 open connections• Up to 100 web pages per second using 4 crawler• DNS cache to improve performance• A connection can be in one of these states - looking up DNS,

connecting to host, sending request, and receiving response

Page 14: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Major Data Structure

• Repository: Contains the full html page compressed using zlib standard.

• Document Index: Keeps information about each document. ordered by docID.

• Hit Lists: Corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.

Page 15: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Forwarded Index

• Stored in the barrels.• Each barrel hold a set of wordID• If a document has a word in that barrel, the docID

is recorded in that barrel.

docIDWordID =1 #Hits =2Hit list

WordID =2 #Hits =3Hit list

Page 16: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Inverted Index

• Same as forwarded index expect that it was processed by the sorter.

WordIDDocs#

WordIDDocs#

WordIDDocs# docID# hitsHit list

docID# hitsHit list

docID# hitsHit list

docID# hitsHit list

Lexicon

Page 17: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Results & Performance

• Quality of the results is the most important metric in search engines

• Authors claim that Google outperform major commercial search engines

• Example

Page 18: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.
Page 19: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Storage Requirements

• Scale well.

• Utilize the storage efficiently

• Use compression

Page 20: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

System Performance

• Major operations are crawling, indexing and sorting

• 9 days to get 26 million pages.

• Average 48.5 pages/second

• Indexer – 54 pages/second

• The whole sorting operation takes 24 hours

Page 21: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Search Performance

• Was not the most important issue in their design that time

• The response time for a query was between 1 to 10 second for all queries – mainly Disk IO time –

• Did not have any query cashing, subindices on common terms – for optimization -

Page 22: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Search Performance

Page 23: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Repository

BarrelsLexicon

SearcherPageRank

Sorter

CrawlerStore ServerURL Server

URL Resolver

Links

Doc Index

Indexer

Anchors

Page 24: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Conclusion & Future Work

• The most concern of Google design is to be a scalable web search engine.

• And to provide high quality results

– Page ranking

– Anchor text

Page 25: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Future Work

• To improve search efficiency– Cash the query– Smart disk allocation– Subindices

• Updates – old and new pages – • Add Boolean operators, negation, and stemming• relevance feedback and clustering• user context• result summarization• PageRank can be personalized by increasing the weight of a

user’s home page or bookmarks

Page 26: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

References

• S. Brin,L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)

Page 27: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Q & A

Thanks for your Attention