Page 1
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.1
Chapter 4 : Searching the Web
• The mechanics of a typical search.• Search engines as information gatekeepers.• The search engine wars.• Statistics from search engine logs.• The architecture of a search engine.• The search index.• The query engine.• Crawling the web.
Page 2
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.2
Mechanics of a Typical Search
Figure 4.1 : Query submitted to Google
Page 3
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.3
Mechanics of a Typical Search
Figure 4.2 : Google results for the query
Page 4
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.4
Mechanics of a Typical Search
Figure 4.3: Category of first result
Page 5
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.5
Mechanics of a Typical Search
Figure 4.4 : Result for phrase query
Page 6
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.6
Search Engines as Information Gatekeepers
• Search engines are becoming the primary entry point for discovering web pages.
• Ranking of web pages influences which pages users will view.
• Exclusion of a site from search engines will cut off the site from its intended audience.
• The privacy policy of a search engine is important.
Page 7
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.7
Search Engine Wars
• The battle for domination of the web search space is heating up!
• The competition is good news for users!
• The way in which advertising is combined with search results is crucial!
• There are serious implications if one of the search engines will manage to dominate the space!
Page 8
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.8
Google
• Verb “google” has become synonymous with searching for information on the web.
• Has raised the bar on search quality,
• Has been the most popular search engine in the last few years.
• Had a very successful IPO in August 2004.
• Is innovative and dynamic.
Page 9
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.9
Yahoo!
• Synonymous with the dot-com boom, probably the best known brand on the web.
• Started off as a web directory service.
• Has very strong advertising and e-commerce partnerships.
• Acquired leading search engine technology in 2003.
Page 10
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.10
MSN Search
• Synonymous with PC software.
• Remember its victory in the browser wars with Netscape.
• Developed its own search engine technology only recently, officially launched in Feb. 2005.
• May link web search into its next version of Windows.
Page 11
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.11
Others
• Ask Jeeves– Specialises in natural language question
answering.– Search driven by Teoma.
• Looksmart– Has its own directory service.– Search driven by Wisenut.
• …
Page 12
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.12
Statistics from search engine logs
Statistic
(Year)
AltaVista
(1998)
AlltheWeb
(2002)
Excite
(2001)
average terms per query
2.35 2.30 2.60
average queries per session
2.02 2.80 2.30
average result pages viewed
1.39 1.55 1.70
usage of advanced search features
20.4% 1.0% 10.0%
Page 13
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.13
Experiment with search engine query syntax
• Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits.
• “+chess” in a query means the user insists that “chess” be present in all hits.
• “computer OR chess” means either keywords must be present in all hits.
• “”computer chess”” means that the phrase “computer chess” must be present in all hits.
Page 14
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.14
The most popular search keywords
AltaVista (1998) AlltheWeb (2002) Excite (2001)
sex free free
applet sex sex
porno download pictures
mp3 software new
chat uk nude
Page 15
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.15
Architecture of a Search Engine
Page 16
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.16
Search Index - Inverted File
• Also store position of word in web page and information on HTML structure.
Page 17
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.17
The query engine
• The interface between the search index, the user and the web.
• Algorithmic details of commercial search engines kept as trade secrets.
• First step is retrieval of potential results from the index.
• Second step is the ranking of the results based on their “relevance” to the query.
Page 18
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.18
Portal User Interface(See also yahoo.com)
Page 19
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.19
Crawling the Web
Page 20
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005
Slide 4.20
Delivering a global search service
• See: Web Search for a Planet: The Google Cluster Architecture (IEEE Micro, 2003).