Top Banner
Search Engines: The players and the field The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture of a search engine. The query engine.
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Search enginebasics

Search Engines: The players and the field

The mechanics of a typical search.

The search engine wars.

Statistics from search engine logs.

The architecture of a search engine.

The query engine.

Page 2: Search enginebasics

Mechanics of a typical search

Page 3: Search enginebasics

Results & ads returned ranked

Page 4: Search enginebasics

Category of first result

Page 5: Search enginebasics

Result for phrase query

Page 6: Search enginebasics

Search on the Web

Corpus: The publicly accessible Web: static + dynamic

Goal: Retrieve high quality results relevant to the user’s need (not docs!)

Need Informational – want to learn about something

Navigational – want to go to that page

Transactional – want to do something (web-mediated)

Access a service

Downloads

ShopGray areas

Find a good hub Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Tampere weatherMars surface images

Nikon CoolPix

Car rental Finland

Abortion morality

Page 7: Search enginebasics

Search Engines as Info Gatekeepers

Search engines are becoming the primary entry point for discovering web pages.

Ranking of web pages influences which pages users will view.

Exclusion of a site from search engines will cut off the site from its intended audience.

The privacy policy of a search engine is important.

Introna & Nissenbaum: Defining the Web: The Politics of Search EnginesHindman et al: Googlearchy: How a few Heavily-Linked Sites Dominate Politics on the Web

Page 8: Search enginebasics

Search Engine Wars

The battle for domination of the web search space is heating up!

The competition is good news for users!

Crucial: advertising is combined with search results!

What if one of the search engines will manage to dominate the space?

Page 9: Search enginebasics

Yahoo!Synonymous with the dot-com boom,

probably the best known brand on the web.

Started off as a web directory service in 1994,acquired leading search engine technology in 2003.

Has very strong advertising and e-commerce partners

Page 10: Search enginebasics

Lycos!One of the pioneers of the field

Introduced innovations that inspired the creation of Google

Page 11: Search enginebasics

Google

Verb “google” has become synonymous with searching for information on the web.

Has raised the bar on search quality

Has been the most popular search engine in the last few years.

Had a very successful IPO in August 2004.

Is innovative and dynamic.

Has restored glamour in CS lost in dot-com-bust

Page 12: Search enginebasics

Live Search(was: MSN Search)

Synonymous with PC software.

Remember its victory in the browser wars with Netscape.

Developed its own search engine technology only recently,

officially launched in Feb. 2005.

May link web search into its next version of Windows.

Page 13: Search enginebasics

Ask Jeeves Specialises in natural language question answering.

Search driven by Teoma.

Page 14: Search enginebasics

Cuil The latest kid on the block

Claims to have indexed 120B pages!

So far, it does not rank!

Page 15: Search enginebasics

Experiment with query syntax

Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits.

“+chess” in a query means the user insists that “chess” be present in all hits.

“computer OR chess” means either keywords must be present in all hits.

“”computer chess”” means that the phrase “computer chess” must be present in all hits.

Page 16: Search enginebasics

Statistics from search engine logs

Statistic(Year)

AltaVista(1998)

AlltheWeb (2002)

Excite(2001)

average terms per query

2.35 2.30 2.60

average queries per session

2.02 2.80 2.30

average result pages viewed

1.39 1.55 1.70

usage of advanced search features

20.4% 1.0% 10.0%

Page 17: Search enginebasics

The most popular search keywordsAltaVista (1998) AlltheWeb (2002) Excite (2001)

sex free free

applet sex sex

porno download pictures

mp3 software new

chat uk nude

Page 18: Search enginebasics

Web search Users

Ill-defined queries Short length Imprecise terms Sub-optimal syntax

(80% queries without operator) Low effort in defining queries

Wide variance in Needs Expectations Knowledge Bandwidth

Specific behavior 85% look over

one result screen only mostly above the fold 78% of queries are not

modified 1 query/session

Follow links – “the scent of information” ...

Page 19: Search enginebasics

Query Distribution

Power law: few popular broad queries, many rare specific queries

Page 20: Search enginebasics

How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

Page 21: Search enginebasics

Architecture of a Search Engine

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Page 22: Search enginebasics

Rate of web content change

720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 [Cho00]

Mathematically, whatdoes this seem to be?

What does this suggest for crawling policy?

Page 23: Search enginebasics

Diversity

Languages/Encodings Hundreds of languages, W3C encodings: 55 (Jul01) [W3C01] Home pages (1997): English 82%, Next 15: 13% [Babe97] Google (mid 2001): English: 53%, JGCFSKRIP: 30%

Document & query topicPopular Query Topics (from 1 million Google queries, Apr 2000)

1.8%Regional: Europe7.2%Business

…………

2.3%Business: Industries7.3%Recreation

3.2%Computers: Internet8%Adult

3.4%Computers: Software8.7%Society

4.4%Adult: Image Galleries10.3%Regional

5.3%Regional: North America13.8%Computers

6.1%Arts: Music14.6%Arts

Page 24: Search enginebasics

Search Index - Inverted File

Also store position of word in web page (“offset”)and information on HTML structure.

Frequency

Page 25: Search enginebasics

The query engine

The interface between the search index, the user and the web.

Algorithmic details of commercial search engines are kept as trade secrets.

First step is retrieval of potential results from the index.

Second step is the ranking of the results based on their “relevance” to the query.

Page 26: Search enginebasics

Portal User Interface

Page 27: Search enginebasics

Crawling the Web

Mode of crawl: BFSFrequency of crawl: importantrobots.txt gives explicit directions on what not to crawlParallel machines crawl all the time