Top Banner
LIS618 lecture 6 Thomas Krichel 2004-03-13
46

LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Mar 27, 2015

Download

Documents

Samantha Kidd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

LIS618 lecture 6

Thomas Krichel

2004-03-13

Page 2: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Structure

• Google – news– interfaces to non-web sources

• Usenet• ODP

• relational databases

• OpenURL

• file sharing

Page 3: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Google news

• Is a gathering of top stories from news stories.• The entire pages in built by computer. Which

stories make it to the top depends on– how prominently the stories appear on news sites– which sites the stories appear on– when the articles were published– how many articles cover the same story

• Note the side bar with stories of different topic sections.

Page 4: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

special syntax for news I

• source: gives news from a source only– example “source:cnn” works– examples “source:bbc”, “source:nytimes”

“source:"new your times"” don’t seem to get anywhere.

• location: gives a location. Can by a two-letter state or a country– “location:ny”– “location:russia”

Page 5: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

special syntax for news II

• allintitle: searches for words in the title of the article (not of the page)– example “allintitle: dead injured”

• allintext: searches for words in the text– example: “allintext: saarland government”

• allinurl: searches in article URLs– example: “allinurl:bbc Wales”

Page 6: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Google interfaces to 3rd party data

• Google groups are an interface to Usenet news, called Google Groups.

• Google directory is an interface to the Open Directory Project.

• In both cases Google is dependent on the quality of these underlying data source.

Page 7: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Usenet news

• Usenet is a collection of user-submitted notes on various subjects that are posted to servers on a worldwide network. Each subject collection of posted notes is known as a newsgroup.

• A newsgroup is a discussion about a particular subject consisting of notes written to a networked site and distributed through Usenet.

• Newsgroups are hierarchical. Hierarchical levels are separated by dots example: comp.text.tex.

• alt, news, Info, biz, rec, comp, sci, humanities, soc, misc, talk are classic world-wide groups.– alt stands for anarchists, lunatics and terrorists.

Page 8: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Usenet history

• The idea of network news was born in 1979 when two graduate students, Tom Truscott and Jim Ellis, thought of using UUCP to connect machines for the purpose of information exchange among users. They set up a small network of three machines in North Carolina.

• UUCP is ``UNIX to UNIX copy'' a protocol that is used to copy files between machines running some flavor of UNIX, without the need for IP protocol. Usenet is older than the Internet

Page 9: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

decline of Usenet

• essentially open to all (peer-to-peer system)

• used by spammers for – posting – gathering addresses

• steady decline of quality of contribution

• steady decline of quantity of contributions

Page 10: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Usenet worth checking out

• independent reviews of products, often written by experts.

• Example: interpretation of beethoven sonatas by Wilhelm Kempff.

• Sorting by date reveals that the newsgroup rec.music.classical.recordings is still active. On a good day, you will find no finer guide to records.

Page 11: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

special syntax for Google Groups

• group: limits posting to a certain group

• title: limits to titles of postings

• author: searches for author name or email address

• Mixing syntaxes works well.

• Example: “intitle:kempff group:rec.music.classical.recordings”

Page 12: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

the open directory project

• “The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors.”

• Claim that there is a historic precedence in the Oxford English Dictionary.

• Formerly known as ``GnuHoo'', then ``NewHoo'', then acquired by NetScape, and called ``dmoz''.

Page 13: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

dmoz.org

• dmoz is maintained by volunteers ``net-citizen''. No special qualifications required, but claimed to be experts.

• There are about 30,000 volunteers (they claim).• Powers the core directory services for the

Web's largest and most popular search engines and portals– Netscape Search AOL Search– Google Lycos– HotBot DirectHit

• Headquarters run by Netscape

Page 14: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Appearance of ODP

• If Google finds a relevant category it put• A Google response is a list of results.• Each result has

– title– snippet – URL

• Some results have optionally a category attached. Following such categories is a winner if your information need is broad.

Page 15: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

full-text databases

• These databases have an emphasis on providing full-text information in a web environment.

• Their particular strength is the aggregation of material from a range of publishers.

• This especially concerns scholarly publishing, where the source material are distributed among a large number of sources.

Page 16: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Access

• Some of the is arranged via the Brooklyn LIU campus.

• http://rpaserver.liu.edu/rpa/webauth.exe• The ones we can play with a

– proquest, http://rpaserver.liu.edu/rpa/ webauth.exe?rs=pq&lb=b

– ebsco host, http://rpaserver.liu.edu/rpa/ webauth.exe?rs=eh&lb=b

• The databases have some full-text, but not a lot.

Page 17: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Proquest

• go into the database selection, delete everything and then use the research library.

• we can search for Paul Levine. It appears that– not all articles have full-text– there is no distinction between different Paul

Levines

• Otherwise it appears straightforward to use

Page 18: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

aggregators

• Proquest and ebsco work as aggregators. They put different scholarly journals in one database together, so you don’t have do deal with publisher’s different interfaces.

• Publishers are reluctant to join and impose moving-wall embargos on full-text release.

• So you can not access the full-text via them. But your library may have the text somewhere.

Page 19: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

the library as aggregator

• typically, a library buys holdings from a publisher, as well as cross-publisher abstract and indexing data.

• when users finds a reference in an abstract and wants to access the full text, they are stuck

• Herbert Van de Sompel has been working on this problem.

Page 20: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

special effects (SFX)

• Herbert’s idea was to equip the interface with a special effects button.

• When users press the button, the interface would transmit metadata such as– author name– journal name– title– date

to a special database, called a resolver.

Page 21: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

resolver• The resolver examines the metadata and

makes a decision on what to show to the user. – if the journal is subscribed to and the date is

recent, it may formulate a query to the publishers database and fetch the record and/or full text there.

– if the journal is not held, suggest ILL – etc…

Page 22: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

configuring the resolver

• librarians, who know the local setting, will configure the server so that users are given the appropriate extended services given the local circumstance.

• Note that what is returned is a set of extended services, not the response to a specific query.

Page 23: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Bison Futé model

• This refers to further work by Herbert to generalize the idea. – On a web page, you find a link. It has been

made by the provider of the web pages.– But this link may not be a appropriate. There

maybe better technology that allows you to move in the same direction but with your own link.

– In other words we talk about context-sensitive linking.

Page 24: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

OpenURL

• This is now a draft standard with NISO to standardize the special effects request.

• The OpenURL is a transport architecture for context objects.

• Context objects unite descriptions of– the reference found– the context in which is was found

Page 25: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

implications for information retrieval

• The implications on the library world are already important. – many library systems software already

implement OpenURLs and provide resolvers

• But impact could be wider and could cover a whole new structure for the web, replacing static links with on-the-fly dynamic ones.

Page 26: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Databases

• Databases are collection of data with some organization to them.

• The classic example is the relational database.

• But not all database need to be relational databases.

Page 27: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Relational databases

• A relational database is a set of tables. There may be relations between the tables.

• Each table has a number of record. Each record has a number of fields.

• When the database is being set up, we fix – the size of each field – relationships between tables

Page 28: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Example: Movie database

ID | title | director | date

M1 | Gone with the wind | F. Ford Coppola | 1963

M2 | Room with a view | Coppola, F Ford | 1985

M3 | High Noon | Woody Allan | 1974

M4 | Star Wars | Steve Spielberg | 1993

M5 | Alien | Allen, Woody | 1987

M6 | Blowing in the Wind | Spielberg, Steven | 1962

• Single table• No relations between tables, of course

Page 29: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Problem with this database

• I made up all the data. It is just for illustration.

• Name covered inconsistently. There is no way to find films by Woody Allan without having to go through all spelling variations.

• Mistakes are difficult to correct. We have to wade through all records, a masochist’s pleasure.

Page 30: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Better movie database

ID | title | director | year

M1 | Gone with the wind | D1 | 1963

M2 | Room with a view | D1 | 1985

M3 | High Noon | D2 | 1974

M4 | Star Wars | D3 | 1993

M5 | Alien | D2 | 1987

M6 | Blowing in the Wind | D3 | 1962

ID | director name | birth year

D1 | Ford Coppola, Francis | 1942

D2 | Allan, Woody | 1957

D3 | Spielberg, Steven | 1942

Page 31: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Relational database

• We have a one to many relationship between directors and film– Each film has one director– Each director has produced many films

• Here it becomes possible for the computer, and then the user– To know which films have been directed by Woody

Allen– To find which films have been directed by a director

born in 1942

Page 32: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Many-to-many relationships

• Each film has one director, but many actors star in it. Relationship between actors and films is a many to many relationship.

• Here are a few actorsID | sex | actor name | birth year

A1 | f | Brigitte Bardot | 1972

A2 | m | George Clooney | 1927

A3 | f | Marilyn Monroe | 1934

Page 33: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Actor/Movie table

actor id | movie id

A1 | M4

A2 | M3

A3 | M2

A1 | M5

A1 | M3

A2 | M6

A3 | M4

… as many lines as required

Page 34: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

SQL

• Once we have the relational database, we can ask sophisticated questions:– Which director has had the most female

actors working for him?– In which years films have been shot that

starred actors born between 1926 and 1935?

• Such questions can be encoded in a language know as “structured query language” or SQL. All relational database vendors implement a dialect of SQL.

Page 35: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

importance of relational databases

• Relational databases dominate the world of structured information. Examples– employment and payroll in a company– stock management– e-commerce

• There are quite easy ways to get relational databases to work with web interfaces. Some are freely available. The most common one is the LAMP (Linux Apache MySQL PHP) architecture.

Page 36: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

relational databases in libraries

• A 2004 enquiry on the LITA revealed that many respondents said that they did regret most not having learned more about relational databases in library school.

• But there are problems with relational databases in libraries– Slow on very large databases (such as catalogs)– Library data has nasty ad-hoc relationships, e.g.

• Translation of the first edition of a book• CD supplement that comes with the print version

Difficult to deal with in a system where all relations and field have to be set up at the start, can not be changed easily later.

Page 37: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

off-web Internet information retrieval

• Under this heading, I principally think about activities known as ‘file-sharing’.

• They concern the (mostly illegal) exchange of files between users. Such files many encode– music– films

• There is a lot of it going on, but we are not sure how much.

Page 38: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

Napster

• Napster was the first prominent file-sharing service.

• Napster ran a central server. You connected to that server and announced what files you had to share.

• Every search was conducted on the dataset assembled at the central server.

• Connections to download files were done between peer machines only.

Page 39: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

end of Napster

• Napster argued since it was only involved in collecting the information about files available, it was legal.

• Napster never shared any illegal file.• The courts thought otherwise. • It was shut down. • Napster network died without a central machine.• To enable true piracy, we need a truly distributed

system.

Page 40: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

gnutella protocol

• This protocol underlies much of the current file-sharing activity on the Internet.

• It enables a peer-to-peer network between machines.

• To connect to a gnutella network, you need the IP address of one single machine that is already part of the network.

Page 41: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

connection to the guntella network

• Once you establish connection to the first servent, you announce your presence.

• The first servent will pass on that message to all the servents that it is connected to, and so on.

• This quickly adds up to a lot of traffic!

Page 42: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

time to live• Every gnutella message has a time to live

TTL. It is decremented every time it passes at a servent.

• The TTL is usually quite small. It can be arbitrarily reduced by servents.

• Therefore you only talk to servents that are close to you. But your software will determine which servents to try to contact first. That usually depends on previous query results.

Page 43: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

searches• When you do a search, it is passed on

from servent to servent through the p2p network.

• Servents have their own rule how to respond to queries. – Most of the time search strings are matched

against a file name.– Some may try to match against the directory

name.– Some general queries may be rejected.– Some results sets may be truncated.

Page 44: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

downloading

• If you see a file that you like to have, you can try to download it.

• To implement downloads the servents use http. Thus everyone who is connected to a file sharing network run a web server!

• However, there usually is a tight limit on how many downloads a server will accept.

• Modern servents have the ability to download from several servents.

Page 45: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

ease to infringe

• Clearly all the traffic on gnutella, with current technology, can be observed.

• But the infringement is so massive that it appears difficult to clamp down on.

• The easy to infringe is technological.

• RIAA have sued. They reach the tippy top of the iceberg, with the hope to dissuade.

Page 46: LIS618 lecture 6 Thomas Krichel 2004-03-13. Structure Google –news –interfaces to non-web sources Usenet ODP relational databases OpenURL file sharing.

http://openlib.org/home/krichel

Thank you for your attention!