Top Banner
Internet Resources Discovery (IRD) Search Engines Quality
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Internet Resources Discovery (IRD) Search Engines Quality.

Internet Resources Discovery (IRD)

Search Engines Quality

Page 2: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank2

Problem: Quality/Reliability

Tsunami?

Page 3: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank3

Search Engines Generations

• 1st Generation - Basic SEs:

• 2nd Generation - Meta SEs:

• 3rd Generation - Popularity SEs:

Page 4: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank4

1st Generation SEs • Basic data about websites on which

queries are being executed.

• Directories including basic indices: general and special.

• Website ranking based on page content.

Page 5: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank5

Vector Space Model

• Representation of documents/queries - converted into vectors.

• Vector features are words in the document or query, after stemming and removing stop words.

• Vectors are weighted to emphasize important terms.

• The query vector is compared to each document vector. Those that are closest to the query are considered to be similar, and are returned.

Page 6: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank6

Example of Computing Scores

term (t) weight (w)

Information 3

Retrieval 3

Search 2

Information retrieval abstract. Meant to show how results are evaluated for allkinds of queries. There are two measuresare recall and precision and they change ifthe evaluation method changes. Information retrieval is important! It isused a lot for search engines that store andretrieve a lot of information, to help ussearch the World Wide Web.

Document (d)Document Related Part

w(t,d)

Page 7: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank7

Example of Computing Scores

Query

term weight

Information 100

Retrieval 100

Search 10

Document

term weight

Information 3

Retrieval 3

Search 2

Result Vector

term weight

Information 300

Retrieval 300

Search 20

Score = 300+300+20 = 620

* =

Page 8: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank8

Altavista’s Search Ranking

• Prominence: The closer the keywords are to the start of the page or the start of a sentence (also title/heading/bottom).

• Proximity: how close keywords are to each other.

• Density and Frequency: – relationship (%) of keywords to other text. – number of times keywords occur within the text.

Page 9: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank9

2nd Generation SEs

• Using several SEs in parallel.• The results are filtered, ranked and

presented to the user as a uniformed list.• The ranking is a combination of the

number of sources each page appeared in, and the ranking in each source.

Page 10: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank10

Meta SE is a Meta-Service

• It doesn’t use an Index/database of its own.

• It uses other external search services that provide the information necessary to fulfill user queries.

Page 11: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank11

Meta Search Engine

MetaCrawler

Yahoo Web Crawler Open Text Lycos InfoSeek Inktomi Galaxy Excite

Page 12: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank12

Premises of MetaCrawler

• No single search is sufficient.

• Problem in expressing the query.

• Low quality references can be detected.

Page 13: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank13

Search Service - Motivation

1. The number and variety of SEs.2. Each SE provides an incomplete snapshot of Web.3. Users are forced to try and retry their queries across

different SEs.4. Each SE has its own interface.5. Irrelevant, outdated or unavailable responses.6. There is no time for intelligence.7. Each query is independent.8. No individual customization.9. The result is not homogenized.

Page 14: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank14

Problems

• No advanced search options.

• Using the lowest common denominator.

• Sponsored results from the SEs are not highlighted.

Page 15: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank15

3rd Generation SEs

• Emphasis on many various services.• Higher quality.• Faster search.• Usually using mainly external

”out of page” information.• Better ranking methods.

Page 16: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank16

Google

• Ranks websites according to the number of links from other pages.

• Increases ranking based on the page characteristics (keywords).

• Disadvantage: new pages will not appear in the results page, because it takes time to get linked (sandboxing).

Page 17: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank17

AskJeeves (Teoma)

• Trying to direct the searcher exactly to the page answering the question.

• When cannot find something suitable in its resource, directs to other sites using additional SEs.

• Uses natural language interface.

Page 18: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank18

• Allow users, rather than search engines or directory editors, to organize search results.

• Given a query answer, saves the websites that the users chose from the results page (websites list).

• Over time, learns the popular pages for each query.

DirectHit (1)

Page 19: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank19

DirectHit (2)• Calculate Click Popularity and Stickiness.

• Click popularity is a measure of the number of clicks received by each site in the results page.

• Stickiness is a measure of the amount of time a user spends at a site. It's calculated according to the time that elapses between each of the user's clicks on the search engine's results page.

• Gives clicks for low-scoring sites more weight.

Page 20: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank20

Problems• New Web sites will not get high ranking,

because most searchers enter a limited number of Web sites (usually the first three).

• Spamming:

– Programs that can search for a certain keyword, find a company's site and click on.

– After remaining on the site for a specified amount of time, the program will go back and repeat the process.

Page 21: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank21

Some Evaluation Techniques

• Who wrote the page– Use info: or look at “about us”– Check how popular/authoritative

• When was it written.• Other indicators:

– Why was the page written– References/bibliography– Links to other resources

Page 22: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank22

Tools for Checking Quality

• Toolbars (Google, Alexa)

• Backward links (Google)

• PRSearch.net http://www.prsearch.net/inbx.php

• Internet SEs FAQhttp://www.internet-search-engines-faq.com/find-page-rank.shtml

Page 23: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank23

Google and Alexa Toolbars

Page 24: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank24

Example: Google’s PageRank

Page 25: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank25

Alexa Toolbar Info

Page 26: Internet Resources Discovery (IRD) Search Engines Quality.

T.Sharon-A. Frank26

References• http://www.searchengines.com/ranking_factors.html• http://www.lib.berkeley.edu/TeachingLib/Guides/

Internet/Evaluate.html• http://gateway.lib.ohio-state.edu/tutor/les1/index.html• http://www2.widener.edu/Wolfgram-Memorial-

Library/webevaluation/webeval.htm