Top Banner
The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves the development and use of web robot software for monitoring use of web technologies Papers, reports, articles and presentations of the findings are produced by the WebWatch project UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union. UKOLN also receives support from the University of Bath where it is based. A WebWatch Trawl A simple model of how the WebWatch robot trawls communities is shown below Input file of URLs 1001000101101011 001010101010101 101010101101011 WebWatch robot reads input file and retrieves resources Resource A Resource B Summary file 1001000101101011 001010101010101 101010101101011 Analysis and statistical programs produce reports Resource A,B, etc. could be individual pages or entire websites Report for UK Universities
12

The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

Mar 28, 2015

Download

Documents

Chloe Bentley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

The WebWatch Project

About WebWatch• The WebWatch project is funded by BLRIC (British

Library Research and Innovation Centre)• The WebWatch project involves the development and use

of web robot software for monitoring use of web technologies

• Papers, reports, articles and presentations of the findings are produced by the WebWatch project

UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union. UKOLN also receives support from the University of Bath where it is based.

A WebWatch TrawlA simple model of how the WebWatch robot trawls communities is shown below

Input fileof URLs

1001000101101011001010101010101101010101101011

WebWatch robot reads input file and retrieves resources

Resource A

Resource B

Summary file

1001000101101011001010101010101101010101101011

Analysis and statistical programs produce reports

Resource A,B, etc. could beindividual pages or entire websites

Resource A,B, etc. could beindividual pages or entire websites

Report for UK Universities

Page 2: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Trawl of UK University Entry Pages

BackgroundThe WebWatch project carried out a trawl of UK University entry points on 24 October 1997.

The trawl was repeated in 31 July 1998.

Web ServersThe most popular web server was Apache. This has grown in popularity, with a decline in the CERN, NCSA and other smaller servers.

Microsoft's IIS server has also grown in popularity, perhaps indicating growth in use of Windows NT.

Size of Entry PointsThe file size of HTML resource(s) (including frame sets) and images (but excluding background images) were analysed.

Four pages were less than 5 Kb.

The largest page was 193Kb.

The largest pages contained

animated GIF images.

Apache Netscape Microsoft NCSA CERN OtherOct-97 31% 15% 8% 21% 13% 12%Jul-98 42% 17% 13% 9% 9% 0%

Page 3: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Trawl of UK University Entry Pages

Web TechnologiesAn analysis of some of the technologies used in UK University entry points is given below.

Java and JavaScriptNone of the institutions trawled made use of Java.

Subsequently it was found that one institution used Java. This institution used the Robot Exclusion Protocol to stop robots from trawling the site.

JavaScriptIn October 1997 22 institutions used client-side scripting, such as JavaScript.

By July 1998 38 institutions were using JavaScript.

The University of Northumbria at Newcastle is one of about 38 institutions which use JavaScript.JavaScript is used to display picture fragments when the cursor moves over a menu option.

The University of Northumbria at Newcastle is one of about 38 institutions which use JavaScript.JavaScript is used to display picture fragments when the cursor moves over a menu option.

Java provides this scrolling news facilityJava provides this scrolling news facility

Liverpool University is probably the only university entry page using Java

Liverpool University is probably the only university entry page using Java

Page 4: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Trawl of UK University Entry Pages

MetadataIn October 1997 54 institutions used "Alta Vista" type metadata on their main entry point. By July 1998 the metadata was used on 74 entry points.

In contrast Dublin Core metadata was used on only 2 pages on both occasions.

CachabilityInterest in cache-friendly web resources has grown since the introduction of network charging on 1 August 1998.

Over 50% of institutional HTML resources were found to be cachable, with only 1% not cachable. Further analyses is needed for the other resources.

<META NAME="description" CONTENT="Mailbase is a national mailing list centre for UK HE"><META NAME="keywords" CONTENT="mail", "listserve">

<META NAME="DC.Title" CONTENT="The Mailbase Home Page"><META NAME="DC.Creator" CONTENT="John Smith">

Possible Use of Alta Vista and Dublin Core Metadata

% telnet www.ukoln.ac.uk:80GET / HTTP/1.0

HTTP/1.1 200 OKDate: Fri, 28 Aug 1998 16:22:51 GMTServer: Apache/1.2b8Content-Type: text/html

Telnet can be used to analyse HTTP headers, including caching information

A WebWatch service is being developed to provide a web-interface to the telnet command, to give more helpful information.

A WebWatch service is being developed to provide a web-interface to the telnet command, to give more helpful information.

URL: http://www.ukoln.ac.uk/

This resource uses HTTP/1.1.The resource is cachable.The resource was last updated on …

Possible Interface

Page 5: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Trawl of UK University Entry Pages

FramesIn July 1998 the following 19 sites used frames, compared with 12 in October 1997:

• Essex • Bretton Coll.• UCE • Royal College of Music

• Keele • King Alfred's Coll.

• Middlesex • Nottingham Trent

• Portsmouth • Ravensbourne Coll.

• Teeside • Birkbeck Coll.

• UMIST • Uni. Coll. Of St Martin

• Thames Valley • Queen Margaret Coll.

• Westhill • Scottish Agricultural Coll.

• Kent Institute of Art and Design

"Splash Screens"In July 1998 5 sites used client-side requests to provide redirects or "splash screens".

UMIST is an example of a framed website

UMIST is an example of a framed website

De Montfort University displays a screen with a yellow background. After 8 seconds a new screen is displayed.

De Montfort University displays a screen with a yellow background. After 8 seconds a new screen is displayed.

"Splash screens" are created by <META HTTP-EQUIV="refresh" CONTENT="n; URL=xxx.html">

Liverpool University also uses frames but this was not detected by the robot due to their use of the Robot Exclusion Protocol.

Page 6: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

# Hyperlink elements per page

75

.0

70

.0

65

.0

60

.0

55

.0

50

.0

45

.0

40

.0

35

.0

30

.0

25

.0

20

.0

15

.0

10

.0

5.0

0.0

Cou

nt

40

30

20

10

0

Std. Dev = 12.04

Mean = 19.3

N = 148.00

WebWatch Trawl of UK University Entry Pages

Hyperlinking IssuesThe WebWatch trawls revealed some interesting hyperlinking issues, which are described below.

Numbers of HyperlinksThe histogram of the numbers of hyperlinks from institutional entry points shows an approximately normal distribution.

Six sites were found to have fewer than 5 links.

One site contained over 75 links.

Limitations of SurveyThe analyses do not give a completely accurate view for a variety of reasons:

• The address of one of the sites with a small number of links was incorrectly given in the input file list (obtained from HESA).

• The analysis did not exclude duplicate links.

• Sites containing "splash screens" were reported as having small number of links, although arguably the links on the second screen should also be included.

DiscussionMany Links:

• Provide useful "short cuts" for experienced users

• Can minimise numbers of levels to navigate

Few Links:• Can be confusing for new

user

• Can cause accessibility problems (e.g. for the visually impaired)

What is your view?

DiscussionMany Links:

• Provide useful "short cuts" for experienced users

• Can minimise numbers of levels to navigate

Few Links:• Can be confusing for new

user

• Can cause accessibility problems (e.g. for the visually impaired)

What is your view?

Page 7: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

Trends in UK University Entry Points

Trawls of UK University Entry PointsThe WebWatch project has surveyed UK University web site entry points on three occasions: 24 October 1997, 31 July 1998 and 25 November 1998.

A summary of significant trends is given below.

Metadata UsageUse of Dublin Core (DC) metadata grew during the summer 1998 from 2 sites to 11. DC metadata is still dwarfed by "Alta Vista" style metadata.

Metadata UsageUse of Dublin Core (DC) metadata grew during the summer 1998 from 2 sites to 11. DC metadata is still dwarfed by "Alta Vista" style metadata.

"Splash Screens"The number of entry points using "splash" screen has increased from 5 (Oct 97), to 7 (Jul 98) to 10 (Nov 98).

"Splash Screens"The number of entry points using "splash" screen has increased from 5 (Oct 97), to 7 (Jul 98) to 10 (Nov 98).

Growth (Kb)

Server UsageThe Apache and Microsoft web servers are both growing in popularity, at the expense of the CERN and Netscape servers, and a number of more specialist servers.

Server UsageThe Apache and Microsoft web servers are both growing in popularity, at the expense of the CERN and Netscape servers, and a number of more specialist servers.

Size Of Entry PointsTrends in the sizes (HTML plus embedded images) have been analysed. The majority of entry points have not changed in size significantly, although one or two have grown (~ 100Kb) or decreased in size (~50Kb) substantially.

Size Of Entry PointsTrends in the sizes (HTML plus embedded images) have been analysed. The majority of entry points have not changed in size significantly, although one or two have grown (~ 100Kb) or decreased in size (~50Kb) substantially.

Page 8: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Services

HTTP-info ServiceA web form is available which can be used to obtain the HTTP headers sent when the resource is accessed.

This service can be useful for getting information, such as the name of the server software, HTTP version information, etc.

Doc-info ServiceA web form is available which can be used to obtain information on web resources.

The Doc-info service is integrated with the HTTP-info service, enabled the HTTP headers are all objects contained in a resource to be analysed.

WebWatch provides access to various tools and utilities which have been developed to support its work. These services can be accessed using a Web browser at the address <URL: http://www.ukoln.ac.uk/web-focus/webwatch/services/ >.

Page 9: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Technologies

TechnologiesThe WebWatch project has made use of the following technologies:

• The Harvest indexing and analysis suite

• Perl for developing the WebWatch robot

• Locally-developed indexing and analysis software

• A series of Unix Perl utilities for analysis and filtering the data

• Excel, Minitab and SPSS for statistical analysis

Trawling SoftwareThe Harvest software was used originally.

Harvest is widely used within the research community for indexing resources. For example the ACDC project uses Harvest to provide a distributed index of UK.AC web resources.

Unfortunately as Harvest was designed for indexing, it is limited in its ability to audit and monitor web technologies.

The current version of the WebWatch robot uses Perl.

ACDC uses Harvest. See <URL: http://acdc.hensa.ac.uk/>

Page 10: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

Restricting AccessWhy Restrict Access?Administrators may wish to restrict access by automated robot software to web resources for a variety of reasons:

• To prevent resources from being indexed

• To minimise load on the web server

• To minimise network load

Robot Exclusion ProtocolThe Robot Exclusion Protocol is a set of rules which robot software should obey.

A robots.txt file located in the root of the web server can contain information on:

• Areas which robots should not access

• Particular robots which are not allowed access

User-agent: *Disallow: /images/Disallow: /cgi-bin/

Typical robots.txt File

IssuesSome issues to be aware of:

• Prohibiting robots will mean that web resources will not be found on search engines such as Alta Vista

• Restricting access to the main search engine robots may mean that valuable new services cannot access the resources

• The existence of a small robots.txt file can have performance benefits

• It may be desirable to restrict access to certain areas, such as cgi-bin and images directories.

IssuesSome issues to be aware of:

• Prohibiting robots will mean that web resources will not be found on search engines such as Alta Vista

• Restricting access to the main search engine robots may mean that valuable new services cannot access the resources

• The existence of a small robots.txt file can have performance benefits

• It may be desirable to restrict access to certain areas, such as cgi-bin and images directories.WebWatch Hosts A

robots.txt Checker Service

Page 11: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

WebWatch Recommendations

RecommendationsThe final WebWatch report makes a number of recommendations, based on its trawls, including advice for Information Providers, Web Administrators and Robot Software Developers

Information ProvidersDirectory Structure

Directory structures can provide a form of metadata about a resource. It is recommended the information providers make consistent use of directories.

Metadata The use of "Alta Vista" type metadata is recommended for use on key entry points.

Frames Frames can prevent indexing robots from accessing resources. If frames are used, there should be an alternative route to resources for robots.

Information ProvidersDirectory Structure

Directory structures can provide a form of metadata about a resource. It is recommended the information providers make consistent use of directories.

Metadata The use of "Alta Vista" type metadata is recommended for use on key entry points.

Frames Frames can prevent indexing robots from accessing resources. If frames are used, there should be an alternative route to resources for robots.

System AdministratorsThe robots.txt File

Web system administrators should ensure that web servers contain a robots.txt file. This may be used to restrict access to robots.

HTTP/1.1 Web system administrators should ensure that their server software supports HTTP/1.1.

Analysis of Robot Usage Web system administrators should periodically check log files for access by robot software.

System AdministratorsThe robots.txt File

Web system administrators should ensure that web servers contain a robots.txt file. This may be used to restrict access to robots.

HTTP/1.1 Web system administrators should ensure that their server software supports HTTP/1.1.

Analysis of Robot Usage Web system administrators should periodically check log files for access by robot software.

Software DevelopersMemory Leaks

Memory leaks can cause problems, especially when accessing large nos. of resources. Robot software should include checkpoints, to facilitate restarts.

User-Agent Negotiation Robot developers should be aware of server use of "User-Agent Negotiation" which may provide different information to robots and browsers.

Software DevelopersMemory Leaks

Memory leaks can cause problems, especially when accessing large nos. of resources. Robot software should include checkpoints, to facilitate restarts.

User-Agent Negotiation Robot developers should be aware of server use of "User-Agent Negotiation" which may provide different information to robots and browsers.

Further InformationFurther recommendations are included in the final WebWatch report.

The report is available at <URL: http://www.ukoln.ac.uk/web-focus/webwatch/reports/final/ >.

Page 12: The WebWatch Project About WebWatch The WebWatch project is funded by BLRIC (British Library Research and Innovation Centre) The WebWatch project involves.

Finding Out More About WebWatch

AriadneOccasional WebWatch reports are published in the online version of the Ariadne magazine.

See:

<URL: http://www.ariadne.ac.uk/issue12/web-focus/ >

<URL: http://www.ariadne.ac.uk/issue15/robots/ >

WebWatch StaffThe WebWatch Officer is Ian Peacock (email [email protected]).

Ian's responsibilities include software development, running the robot trawls, analysing the data and producing reports.

The WebWatch project is managed by Brian Kelly (email [email protected]).

WebWatch StaffThe WebWatch Officer is Ian Peacock (email [email protected]).

Ian's responsibilities include software development, running the robot trawls, analysing the data and producing reports.

The WebWatch project is managed by Brian Kelly (email [email protected]).

PublicationsThe following WebWatch articles have been published:

• "Robot Seeks Public Library Web Sites" in LA Record, Dec 1997 Vol 99 (12)

• "Academic and Public Library Web Sites" in Library Technology, Aug 1998

• "WebWatching Academic Library Web Sites" in Library Technology, Jun 1998

• "WebWatching Public Library Web Site Entry Points" in Library Technology, Apr 1998

• "Public Library Domain Names" in Library Technology, Feb 1998

• "How is My Web Community Doing? Monitoring Trends In Web Service Provision" in Journal Of Documentation, Vol. 55 No. 1 Jan 1999

The final WebWatch report can be obtained from <URL:http://www.ukoln.ac.uk/web-focus/webwatch/reports/final/>

The final WebWatch report can be obtained from <URL:http://www.ukoln.ac.uk/web-focus/webwatch/reports/final/>