Top Banner
www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet
39

Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

IMS5016/3616Information Access

Lecture 4

Information Seeking – the Internet

Page 2: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

2

What information on the Web?

Page 3: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

3

Reading

Henninger, Maureen [2003]. The hidden web: finding quality information on the net. Sydney: UNSW Press. pp. 94-119

[available electronically through unit site]

Page 4: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

4

Outline

• Less-structured documents• What is being sought on the Web• Word searching• Search engines• Why full-text searching is good• Why full-text searching is poor• Phrases and other proximities• Ranking algorithms

Page 5: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

5

Less-structured documents

• Databases are structured• Relational databases are very highly

structured• Documents are structured• Is the structure reflected in Web

documents?• Nothing is unstructured

Page 6: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

6

Structure in Web documents

• Look at a source document:– http://www.health.gov.au/internet/wcms/Publishing.nsf/Co

ntent/health-avian_influenza-index.htm

• What are the structural elements? [class discussion]

Page 7: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

7

Structural elements of a Web page

• OK, I cheated – this is a particularly well-structured site, but it has:

– Words– Paragraphs– Lists– Images– Metadata– etc. etc.– Required HTML structural elements – Head, Body

and Identifier [big structure]

Page 8: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

8

What is being sought on the Web

• Text• Images [static and moving]• Sound/music• Entertainment• In general, Information?

Page 9: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

9

Image searching

• Relies on complex algorithm. E.g.– Google analyzes

> the text on the page adjacent to the image,> the image caption and > dozens of other factors to determine the image

content.

– Google also uses sophisticated algorithms to> remove duplicates and > ensure that the highest quality images are presented

first. [Google Help pages]

Page 10: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

10

Music

• Somewhat problematic– Identification

– Retrieval

– Quality

– Rights management

– Streaming audio

Page 11: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

11

Entertainment

• Generally relies on knowing where to look, or

• At least having a starting point• A process based as much on serendipity

as anything else.

Page 12: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

12

Text Information

• Data with structure– There’s not all that much data on the Web

– There may not be that much information

– Lots of opinion and commentary

• Words are the way by which much information is sought, using Search Engines

Page 13: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

13

Search Engines

• Three parts to the search engine:– A robot/knowbot/harvester/agent piece of

software, used to seek out and download web pages, usually by following links

– A database of web pages and the indexes whereby it can be searched

– A front-end application – the bit the user sees.

Page 14: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

14

Searcher Robot

• A small application – won’t interfere with the operation of the site

• The site is typically back-loaded to the search engine database

• Links [to a certain level?] are followed to seek more and more sites

• More than one copy of the harvester working?

Page 15: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

15

Size of a Search Engine

• This is really big. More than 10,000 servers• Officially, Google says that it processes more

than 150 million searches a day, but the true number is probably much higher.

• According to Nielsen/NetRatings, 67.6 million people worldwide visited Google an average of 6.2 times December. [2002]

• Analysts guess that [2002] revenue was between $60 million and $300 million.

Page 16: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

16

Google Query Process

• Here’s the original paper by the founders of Google about how it all works:

http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm• This site from Google shows their model

of what goes on with a query:

http://www.google.com/press/query.html

Page 17: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

17

Other Search Engines

• In general these rely on hits of a word on each page of a site.

• The ranking algorithm is the tricky bit [read: “trade secret”]

• The advantages and disadvantages of full-text searching have been known for ages

Page 18: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

18

Full-text searching

• Every word [except stopwords?] in the document is indexed.

• Sometimes the word order position is recorded.

• The number of times the word occurs in the document is usually recorded.

Page 19: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

19

Advantages of full-text indexing

• Generally the source text is electronic [so it’s easy to load to computer].

• Author has control of the terms that are used to identify her work.

• The index is cheap to build, as it is created by a computer, not people.

• Users usually search naively for “keywords” rather than controlled terms, so the index supports their behaviour

Page 20: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

20

Disadvantages of full-text searching

• Lack of control• Any added metadata is merely abused• Poor contextual information• English is especially

rich/verbose/ambiguous

Page 21: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

21

Metadata

• Structure• Control [sometimes]• Predictability• Human intervention generally necessary• Expensive• Not yet standardised

Page 22: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

22

Why not [just] use metadata

• Web sites [and some other electronic documents] are volatile – why index something that will be quickly gone?

• Adding metadata is expensive [it needs humans?]

• Metadata standards are either obscure [but specific] or too general to be really useful [Dublin Core?].

Page 23: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

23

Boolean logic – a problem

• Boolean logic as an approach to searching is eminently well suited to digital [binary] machines but

• As humans in a pluralist environment we are accustomed to shades of grey, and the subtle shades of meaning that gives us.

Page 24: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

24

George Boole and his Logic

Page 25: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

25

Stopwords

• A list of words that don’t provide discrimination of one document from another.

• “a”, “and”, “of”, “the”, “or”, “that” etc.• The list is often derived from the

occurrences in the document body.• The first 20 account for more than 40%

of words in documents.

Page 26: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

26

A Problem?

“To be or not to be”

A test for any search engine that precludes stopwords from its indices

Page 27: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

27

Phrases and other proximities

• To search for a pair of adjacent words it is necessary to know there is a document in which the pair occur and word order in that document

• Adjacent words can then be retrieved• What about adjacent words in different

sentences?• What about phrases containing stop

words?

Page 28: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

28

Ranking

Page 29: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

29

Ranking is the secret

The basics of how the search engines work are pretty similar.

The exact algorithms used to achieve ranking is proprietary and secret.

The ranking process is what gives each engine its advantage.

Page 30: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

30

Ranking by “older” search engines

• Boolean “and” before Boolean “or” [before a non-Boolean “some”?]

• Word count• Balanced or total word count• Comparative word frequency• Word proximity

Page 31: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

31

Boolean “and” before Boolean “or”

Most full-text Web search tools will return items that contain all the “keyword” search words ranked ahead of items containing fewer of the search terms, finally reaching a point where items contain only one of the terms.

Other database search tools are much more rigid in adhering to “and” or “or” statements from users.

Page 32: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

32

Word Count

• The number of hits for either each or all of the words may be used to rank the items.

Page 33: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

33

Balanced or Total word count

• The ranking may be based on – an even distribution of the search term occurrences

– A distribution that is to be expected, depending on word count frequencies for the search terms across the entire database

– The inverse of that [because the “unexpected” is unusual, or specific]

– The total count of occurrences of all search terms in the retrieved document.

Page 34: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

34

Word proximity

Searchers may enter a pair of [or more] search terms because they are looking for a phrase, so the ranking is based on word proximity, highest ranking given to immediate adjacency [i.e. as a phrase], with subsequent documents ranked by the proximity of some or all the search terms.

Page 35: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

35

Ranking by Google

• Citation analysis by another name• A popularity poll, with high ranked

Google sites having more votes [more weighting] than others.

• Is this a positive feedback system? How does a new site get a high ranking?

Page 36: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

36

Clustering

• Attempts to group retrieved items by some common elements.

• Linked to notions of how portals might work

• Could require AI techniques

Page 37: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

37

Personal/Desktop searchers

• Google has released version 1.0 of their desktop searcher. Download from http://desktop.google.com/?promo=mp-gds-v1-1 [Is this ver 1.1 already?]

• Another candidate is Copernic. Download ver 1.2 from http://www.copernic.com/en/products/desktop-search/download.html

• Each is free, quite powerful, very useful.

Page 38: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

38

A question?

• What does ranking, say beyond 300 sites [or 100, or 50?] mean to users? Google, for example, has an option to limit outcomes to 100 sites. Perhaps 100 sites is the number most users regard as their maximum search depth.

Page 39: Www.sims.monash.edu.au IMS5016/3616 Information Access Lecture 4 Information Seeking – the Internet.

www.sims.monash.edu.au

39

Summary

• Less-structured documents• What is being sought on the Web• Word searching• Search engines• Why full-text searching is good• Why full-text searching is poor• Phrases and other proximities• Ranking algorithms