This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
1 of 110 4/17/2007 4:12 PM
Table of Contents | All Slides | Link List | CSCI E-12
Hypertext Transfer ProtocolApril 17, 2007
Harvard University Division of Continuing Education
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
2 of 110 4/17/2007 4:12 PM
HyperText Transfer Protocol
GET /
HTTP is a stateless protocol. Cookies provide a mechanism to "maintain state".
Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/
Maintaining State with Cookies
HTTP State Management Mechanism http://www.ics.uci.edu/pub/ietf/http/rfc2109.txtCookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/Persistent Client State HTTP Cookies http://www.netscape.com/newsref/std/cookie_spec.html
view plain print ?
minerva% telnet www.npr.org 80 1.Trying 216.35.221.77... 2.Connected to www.npr.org. 3.Escape character is '^]'. 4.GET / HTTP/1.1 5.Host: www.npr.org 6. 7.HTTP/1.1 200 OK 8.Date: Tue, 10 Apr 2007 20:07:33 GMT 9.Server: Apache 10.Set-Cookie: Apache=140.247.197.240.289451144786054516; path=/ 11.Cache-Control: max-age=0 12.Expires: Tue, 10 Apr 2007 20:07:33 GMT 13.Transfer-Encoding: chunked 14.Content-Type: text/html 15. 16.76c 17.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 18. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 19.<html xmlns="http://www.w3.org/1999/xhtml"> 20.<head> 21.<title>NPR - National Public Radio - News, Arts, World, US.</title> 22.<!-- content removed --> 23.</html> 24.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
3 of 110 4/17/2007 4:12 PM
Cookie Example
Server returns cookie to HTTP client ("Set-Cookie" response header)HTTP client returns cookie to server ("Cookie" request header)
minerva% telnet 140.247.197.240 80 1. Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.GET /http/cookie.cgi?name=David%20P.%20Heitmeyer HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Connection: close 7. 8.HTTP/1.1 200 OK 9.Connection: close 10.Date: Wed, 13 Apr 2005 18:05:04 GMT 11.Server: Apache/2.0.49 (Fedora) 12.Content-Type: text/html; charset=ISO-8859-1 13.Client-Date: Wed, 13 Apr 2005 18:05:04 GMT 14.Client-Peer: 140.247.197.240:80 15.Client-Response-Num: 1 16.Client-Transfer-Encoding: chunked 17.Set-Cookie: YourName=David%20P.%20Heitmeyer; \ 18. domain=cscie12.dce.harvard.edu; \ 19. path=/http/; \ 20. expires=Fri, 13-May-2005 18:05:04 GMT 21. 22.<?xml version="1.0" encoding="iso-8859-1"?> 23.<!DOCTYPE html 24. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 25. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 26.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"><head><title>For27.</head><body> 28.<h1>Hello, David P. Heitmeyer</h1> 29.</body></html> 30.Connection closed by foreign host. 31.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
6 of 110 4/17/2007 4:12 PM
Cookie Example: Returning a Cookie
Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi
view plain print ?
minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.GET /http/cookie.cgi HTTP/1.1 5.Cookie: YourName=David%20P.%20Heitmeyer 6.Host: cscie12.dce.harvard.edu 7.Connection: close 8. 9.HTTP/1.1 200 OK 10.Connection: close 11.Date: Wed, 13 Apr 2005 18:11:40 GMT 12.Server: Apache/2.0.49 (Fedora) 13.Content-Type: text/html; charset=ISO-8859-1 14.Client-Date: Wed, 13 Apr 2005 18:11:40 GMT 15.Client-Peer: 140.247.197.240:80 16.Client-Response-Num: 1 17.Client-Transfer-Encoding: chunked 18. 19.<?xml version="1.0" encoding="iso-8859-1"?> 20.<!DOCTYPE html 21. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 22. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 23.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> 24.<head><title>Form</title></head><body> 25.<h1>Hello, David P. Heitmeyer</h1> 26.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
7 of 110 4/17/2007 4:12 PM
Your Cookies
Firefox Webdeveloper Toolbar has a "Cookies" section. screenshot
Mozilla Cookie Manager
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
8 of 110 4/17/2007 4:12 PM
Cookies and Session IDs
A UserID or SessionID (a long character/number string that is uniquely assigned) is often stored incookie. The SessionID is used as the key or identifier when storing information about the user orsession.
For example, a user logs in to a site. If the username and password match, the server sets a cookie("Set-Cookie") in the browser that contains a session id; the server also makes an entry in websitedatabase that maps the session id to the username. When the cookie is returned, the session id isread and the username is looked up in the database.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
9 of 110 4/17/2007 4:12 PM
Google Cookie Example
Using Google's "Preference" page and setting:
Search Language preference to: English, French, GermanSafeSearch Filtering: Strict FilteringNumber of Results: 50
The Cookie name is: PREF The Value is:ID=bb504f37cd318aa9:FF=1:LR=lang_en|lang_fr|lang_de:LD=en:NR=50:TM=1113416195:LM=111
This cookie contains a session id as well as the values of certain preferences in a colon-separateddata structure.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
10 of 110 4/17/2007 4:12 PM
Cookies and Ad Tracking
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
11 of 110 4/17/2007 4:12 PM
Method: POST
Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
12 of 110 4/17/2007 4:12 PM
WebDAV: an extension of HTTP
Web-based Distributed Authoring and Versioning
WebDAV Resources http://www.webdav.org/From the WebDAV Resources :
WebDAV stands for "Web-based Distributed Authoring and Versioning". It is a setof extensions to the HTTP protocol which allows users to collaboratively edit andmanage files on remote web servers.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
13 of 110 4/17/2007 4:12 PM
HTTP Resources
W3C HTTP http://www.w3.org/Protocols/HTTP Pocket Reference http://www.oreilly.com/catalog/httppr/ by Clinton Wong (O'Reilly).Illustrated Guide to HTTP http://www.manning.com/hethmon/ by Paul Hethmon (Manning Publications; ISBN 0138582262) see sample chapters and resources online.
Other Readings:
W3C Recommendations Reduce 'World Wide Wait' http://www.w3.org/Protocols/NL-PerfNote.htmlApache Week: HTTP version 1.1 http://www.apacheweek.com/features/http11WebTechniques: HTTP 1.1: What's in it for Me? http://www.webtechniques.com/archives/1997/08/webm/Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
14 of 110 4/17/2007 4:12 PM
Apache HTTP Server
Apache Software FoundationApache HTTP Server Project
Apache 1.3Apache 2.x
Apache ModulesPHPPerlPythonmany, many others
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
15 of 110 4/17/2007 4:12 PM
Apache: The Most Widely Used Web Server on the PublicInternet
Netcraft Web Server Survey
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
16 of 110 4/17/2007 4:12 PM
In this Unit: Configuring Apache with .htaccess files
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
17 of 110 4/17/2007 4:12 PM
Apache Configuration Overview
Server Configuration (httpd.conf) Unless you are the server administrator, you generally will not have access to this account. Onthe DCE systems, you do not have read or write access to this file. Server configuration isread at server start or restart.Per Directory (.htaccess) Certain configuration directives for Apache can be placed within per-directory .htaccess files..htaccess file is read on a per request basis.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
19 of 110 4/17/2007 4:12 PM
Scope of .htaccess files
Directives within .htaccess files apply to the directory that contains the .htaccess file and all its descendants.
Directives within the file, /home/e12/htdocs/.htaccess would apply to all files within and "under" the public_html directory for the user cscie12.
Directives within the file, /home/e12/htdocs/assignments/.htaccess would apply to all files within and "under" the public_html/assignments directory for the usercscie12.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
20 of 110 4/17/2007 4:12 PM
Problems You Will Have with .htaccess files
Internal Server ErrorCan't "see" the fileIncorrect Permissions
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
21 of 110 4/17/2007 4:12 PM
Problems You will encounter when using .htaccess files
500 Internal Server Error If you see begin seeing 500 Internal Server Error responses from the server after you havecreated or edited an .htaccessfile, the most likely cause of the problem is incorrect permissions and/or an error in the directivesyntax.
Permissions on the .htaccessfile are not set correctly. Just like HTML and image files, the server must be able to read the.htaccess file. The simplest way to allow that is to make your .htaccess file readable by "other".
Syntax Error. An error in the syntax of a directive the .htaccess file will result in a 500 Internal Server Error. In addition, correct usage of a directive that is not allowed in the.htaccess file will result in a 500status code. Whether or not a directive is allowed depends upon the server configuration file(httpd.conf; AllowOverride) and the directive itself.
view plain print ?
minerva% pwd 1./home/courses/j/h/jharvard/public_html 2.minerva% ls -l .htaccess 3.-rw------- 1 jharvard founder 349 Nov 27 00:03 .htaccess 4.minerva% chmod o+r .htaccess 5.minerva% ls -l ~/public_html/.htaccess 6.-rw----r-- 1 jharvard founder 349 Nov 27 00:03 .htaccess 7.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
22 of 110 4/17/2007 4:12 PM
Problems You will encounter when using .htaccess files
You can't "see" your .htaccess file.
HTTP The web server is typically configured to deny requests for .htaccess files. For example, the file corresponding to the URL, http://cscie12.dce.harvard.edu/.htaccess exists and is readable by the Web server, but if we try to follow the link, we get a 403 Forbidden response.UNIX The ls command will not list files or directories that begin with a '.' (dot). In order to see the .htaccess file when you do a directory listing, use the -a (all) option:SFTP Sometimes your SFTP program will hide the "dot" files unless explicitly told to show them.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
23 of 110 4/17/2007 4:12 PM
Apache Configuration Sections
Configuration directives can be limited by using "sections", such as
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
28 of 110 4/17/2007 4:12 PM
Rewrite
mod_rewrite http://www.apache.org/docs/2.0/mod/mod_rewrite.htmlA Users Guide to URL Rewriting with the Apache Webserver http://www.engelschall.com/pw/apache/rewriteguide/
Rewrite uses regular expressions to match on a pattern and rewrite to a new location. For example,the Derek Bok Center site used to be a "user" account and had the "~bok_cen/" base. When movedto its own virtual host, all of the "~bok_cen" requests could be rewritten to the new site with a singlerewrite rule.
Old URL: http://www.fas.harvard.edu/~bok_cen/tf/resources.html(.*) matches on: /tf/resources.htmlNew URL: http://bokcenter.fas.harvard.edu/tf/resources.html
# rewrite for Bok Center 1.RewriteRule ^/~?bok_cen(.*) http://bokcenter.fas.harvard.edu$1 [R=301] 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
29 of 110 4/17/2007 4:12 PM
Examples of Rewrite Uses
Provide a standard mechanism to access course Web sites within HarvardCollege.
For example, Chemistry 7 has a catalog number of 5118, so the URL for the course Web site can be reached through:
http://www.courses.harvard.edu/5118
The "real" location of the site is:
http://my.harvard.edu/icb/icb.do?course=fas-chem7
HASCS Site Restructure
Dozens of rewrite directives were put in place when the HASCS site was restructured so that linksto documents within the previous site would get redirected to the appropriate page in the new site.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
30 of 110 4/17/2007 4:12 PM
Rewrite: Can be conditional
Rewrite rules can conditional (match against almost any environment variable).
Here we match on host and user-agent to deliver an error page explaining that their browser is not supported.
RewriteEngine On 1.RewriteCond %{HTTP_USER_AGENT} ^Lynx 2.RewriteRule ^(index.html)?$ text/ [R=302] 3.
# rewrite rule to catch IE Mac browsers since 1.# the PIN Service does not support them as of 10/16/2006 2.RewriteCond %{HTTP_HOST} "^login.icommons.harvard.edu$" 3.RewriteCond %{HTTP_USER_AGENT} "MSIE 5.*\; Mac_PowerPC" 4.RewriteRule ^/pinproxy.* /pin_error_ie_mac.html [R,L] 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
31 of 110 4/17/2007 4:12 PM
An aside: Text-only sites and "link"
Meta-information can be used to describe alternate content.
W3C Web Content Accessibility Guidelines: alternate pages http://www.w3.org/TR/WAI-WEBCONTENT-TECHS/#alt-pages
In ~cscie12/public_html/index2.html
Lynx view of index2.html provides the text-only version as a
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
32 of 110 4/17/2007 4:12 PM
Meta Refresh
Note: redirection may also be achieved on some browsers by using the http-equiv attribute of the <meta> element. More information and examples are provided athttp://www.fas.harvard.edu/~web/tutorial/meta/refresh/ . The recommended method is to do it at the server level.
view plain print ?
<!-- in head --> 1.<!-- will redirect in 10 seconds --> 2.<meta http-equiv="Refresh" content="10; URL=http://www.harvard.edu/"/> 3.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
33 of 110 4/17/2007 4:12 PM
Directory Index and Listings
Note: Remember the difference between a directory having rwx-----x and rwx---r-x permissions?
DirectoryIndex http://www.apache.org/docs/2.0/mod/mod_dir.html Would you prefer main.html or overview.html to be the default files returned when a directoryis requested?mod_autoindex http://www.apache.org/docs/2.0/mod/mod_autoindex.html Provides for automatic indexing of a directory.
Or, expire based upon modification time of document:
From the Apache mod_expires documentation:
This module controls the setting of the Expires HTTP header in server responses. The expirationdate can set to be relative to either the time the source file was last modified, or to the time of theclient access.
The Expires HTTP header is an instruction to the client about the document's validity andpersistence. If cached, the document may be fetched from the cache rather than from the sourceuntil this time has passed. After that, the cache copy is considered "expired" and invalid, and anew copy must be obtained from the source.
ExpiresActive On 1. 2.ExpiresByType text/html A3600 3.# HTML expires in 1 hour 4. 5.ExpiresByType image/gif A2592000 6.# GIF expires in 30 days 7. 8.ExpiresByType image/jpeg A2592000 9.# JPEG expires in 30 days 10. 11.ExpiresByType image/png A2592000 12.# PNG expires in 30 days 13. 14.# types not specified 15.ExpiresDefault "now plus 1 day" 16.# expires in 1 day 17.
ExpiresActive On 1.ExpiresByType text/html M86400 2.# HTML expires 1 day after it was last modified 3.ExpiresDefault M86400 4.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
38 of 110 4/17/2007 4:12 PM
Do not cache
If you do not want your page cached, set these HTTP response headers:
In .htaccess in Apache, this would translate to:
view plain print ?
Cache-control: no-cache 1.Pragma: no-cache 2.Expires: <set to now> 3.
ExpiresDefault "now" 1.Header set Pragma "no-cache" 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
The optional headers module allows for the customization of HTTP response headers. Headers canbe merged, replaced or removed. The server will always add a "Server" and "Date" header to the HTTP response.
view plain print ?
Header set Author "David P. Heitmeyer" 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
CookieTracking on 1.CookieStyle RFC2965 2.CookieName MyCookie 3.CookieExpires "1 month 3 days 2 hours" 4.CookieDomain .dce.harvard.edu 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
41 of 110 4/17/2007 4:12 PM
WWW Access Control
You can implement access control on all or part of your Web site so that:
users must provide a username and password (Basic Authentication);users' computers must be within a particular domain
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
42 of 110 4/17/2007 4:12 PM
Basic Authentication: Warning
Basic Authentication alone does not provide the security and privacy to adequately protecttruly confidential or personal information.
Basic Authentication is analogous to simply "closing a door" to parts of your Web site. It will preventthe casual or polite users from "opening the door", but will not prevent someone mildly determinedto walking in.
Two issues that contribute to the lack of security and privacy are:
the content is transmitted over the network in plaintextthe usernames and passwords (submitted with each HTTP request) is transmitted over thenetwork in plaintext
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
46 of 110 4/17/2007 4:12 PM
Implementing Access Control
To implement access control, you must create a file name '.htaccess' that contains with the properconfiguration instructions. You may also need to create a ".htpasswd" file using the utility"htpasswd" and a ".htgroup" file.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
47 of 110 4/17/2007 4:12 PM
htpasswd file
.htpasswd This file contains usernames and encrypted passwords (username:enc_passwd). It is created andmanaged with the utility, "htpasswd", which can be run from the command line.
This file should notlie within your public_html. It should reside at the root level of your home directory (for example,/home/courses/j/h/jharvard/.htpasswd
This file needs to be readable by the Web Server.
Sample content:
view plain print ?
minerva% which htpasswd 1./usr/bin/htpasswd 2.minerva% htpasswd 3.Usage: htpasswd [-c] passwordfile username 4.The -c flag creates a new file. 5. 6.
view plain print ?
minerva% more ~e12/.htpasswd.demo 1.guest:79WeSn3vYGsKQ 2.guest2:wGcgIYLtHNIpM 3.guest3:j9VzpSX/C8Kr2 4.guest4:CjHmW1PWNFwXM 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
48 of 110 4/17/2007 4:12 PM
htgroup file
.htgroup This file contains group definitions (group_name:member1 member2 ...).
This file should notlie within your public_html. It should reside at the root level of your home directory (for example,/home/courses/j/h/jharvard/.htgroup
This file needs to be readable by the Web Server.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
49 of 110 4/17/2007 4:12 PM
Access Control Examples
For the examples given, the user "cscie12" is used. You should substitute your username andhome directory appropriately.
The following .htpasswd.demo and .htgroup.demo files are used:
/home/e12/.htpasswd.demo The .htpasswd.demo was generated by using the utility "htpasswd"
Password for "guest" (and all other entries) is "guest". Entries for guest2, guest3, and guest4 arecreated without the "-c" flag, since the .htpasswd.demo file already exists.
Contents of file:
.htgroup.demo Contents of file:
view plain print ?
minerva% htpasswd 1.Usage: htpasswd [-c] passwordfile username 2.The -c flag creates a new file. 3.minerva% htpasswd -c /home/e12/.htpasswd.demo guest 4.Adding password for guest 5.New password: ***** 6.Re-type password: ***** 7.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
52 of 110 4/17/2007 4:12 PM
Access Control Example 3
Only members of a particular group are allowed access
Contents of .htaccess file:
Contents of .htgroup.demo file:
Demonstration of Example 3 Only members of the group "VIP" (as defined by /home/e12/.htgroup.demo) are authorized (guestand guest4): guest:guest guest4:guest
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
53 of 110 4/17/2007 4:12 PM
Access Control Example 4
Only certain computers are allowed access
Contents of sample .htaccess file:
Demonstration of Example 4 Computers that are on the Harvard network (computers with hostnames ending in .harvard.edu orwith IP addreses beginning with 128.103 or 140.247) will have access, others will be denied.
order deny,allow 1.deny from all 2.allow from 140.247 3.allow from 128.103 4.allow from .harvard.edu 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
54 of 110 4/17/2007 4:12 PM
Access Control Example 5
Only certain computers are denied access
Contents of sample .htaccess file:
Demonstration of Example 5 Connections from within the domain 'fas.harvard.edu' will be denied.
order allow,deny 1.allow from all 2.deny from .fas.harvard.edu 3.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
55 of 110 4/17/2007 4:12 PM
Access Control Example 6
Certain computers are allowed in; others must provide a username and password
Contents of sample .htaccess file:
Demonstration of Example 6 Connection from within ".yale.edu" will be allowed; others must provide a valid username andpassword.
order deny,allow 1.deny from all 2.allow from .yale.edu 3.AuthType Basic 4.AuthUserFile /home/e12/.htpasswd.demo 5.AuthName "Basic Authentication Tutorial 6" 6.require valid-user 7.satisfy any 8.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
56 of 110 4/17/2007 4:12 PM
Access Control Example 7
Only certain computers are allowed in and users must provide a valid username andpassword.
Contents of sample .htaccess file:
Demonstration of Example 7 and satisfy all
order deny,allow 1.deny from all 2.allow from .harvard.edu 3.AuthType Basic 4.AuthUserFile /home/e12/.htpasswd.demo 5.AuthName "Basic Authentication Tutorial 7" 6.require valid-user 7.satisfy all 8.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
57 of 110 4/17/2007 4:12 PM
Requiring SSL (https://)
SSL (Secure Socket Layer) is a protocol that encrypts data between the client and the server. httpsis HTTP over SSL. More details in our last lecture on Security and Privacy.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
58 of 110 4/17/2007 4:12 PM
Details about enabling .htaccess and allowed directives
Context: can these directives be in .htaccess files?AllowOverride: is the server configured to allow this group of directives to be overriden in thislocation?Is the required module loaded?
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
59 of 110 4/17/2007 4:12 PM
Legal Directives I: Context
Certain Apache directives are legal within .htaccess files. Some are not. See the Apache Documentation for details. Specifically, look at the Context line that is given for thedirective in question.
Apache Core Features http://www.apache.org/docs/2.0/mod/core.htmlApache Module List http://www.apache.org/docs/2.0/mod/standard Apache Directives http://www.apache.org/docs/2.0/mod/directives.html
The following is an excerpt from the Apache HTTP Server Version 1.3 documentation
ErrorDocument directive
Syntax: ErrorDocument error-code document Context: server config, virtual host, directory, .htaccess Status: core Override: FileInfo Compatibility: The directory and .htaccess contexts are only available in Apache 1.1 and later.
Also, the "a" indicator on the Apache Quick Reference Card indicates that the directive is valid within an .htaccess file.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
60 of 110 4/17/2007 4:12 PM
Legal Directives II: AllowOverride
Users are allowed to override certain aspects of the main server configuration. The main server configuration file (httpd.conf) contains an AllowOverride directive that determines which directives within .htaccess files Apache will process. The Override line that is given for eachdirective in the Apache documentationindicates which configuration directive must be active in order to use that directive with an .htaccessfile.
For the FAS system, the main server configuration file has the following directive in place for users'public_html directories:
The following is an excerpt from the Apache HTTP Server Version 1.3 documentation
ErrorDocument directive
Syntax: ErrorDocument error-code document Context: server config, virtual host, directory, .htaccess Status: core Override: FileInfo Compatibility: The directory and .htaccess contexts are only available in Apache 1.1 and later.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
61 of 110 4/17/2007 4:12 PM
Legal Directives III: Apache Modules
Apache is distributed with several modules. These modules may or may not be active within the Apache server with which you are working. The Core features will always be available.
For example, if the Rewrite Module (mod_rewrite) has not been activated, none of the Rewritedirectives will be available to use.
Refer to the Status and Modulelines in the documentation for each directive and to the documentation for the specific Apacheinstallation you are using.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
62 of 110 4/17/2007 4:12 PM
Apache Modules
On the Apache (Apache/2.0.51) minerva.dce.harvard.edu web server, the following Apachemodules are active:
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
65 of 110 4/17/2007 4:12 PM
SEO: Search Engine Optimization
Make your site ready for search engines
well-formed (and hopefully valid) HTML/XHTML.use mark-up language for headings and liststitles that stand on their own"meta" keywords and description
An example using O'Reilly OnLamp.com
In "head" element of page:
view plain print ?
<meta name="keywords" content="ONLamp.com,O'Reilly Network,oreillynet, 1.oreillynet.com,O'Reilly,OREILLY,o'reilly network,o'reilly, 2.onlamp.com,lamp,lampp,linux,apache,mysql,perl,python, 3.php,linux,bsd,web development,server development reference, 4.technical information,open source" /> 5. 6.<meta name="description" content="Welcome to ONLamp.com, 7.the high performance web development site from the O'Reilly Network 8.offering comprehensive Lamp developer information and resources. 9.O'Reilly Network's ONLamp site features original articles, 10.news and commentary." /> 11.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
66 of 110 4/17/2007 4:12 PM
Firefox as a Web Development Tool
Web Developer Extension
Firefox Extension - Live HTTP Headers
Firefox Extension - Firebug
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
67 of 110 4/17/2007 4:12 PM
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
68 of 110 4/17/2007 4:12 PM
xurl and churl
xurl. A simple Perl script that extract the links for a single page. Adapted from The Perl Cookbook.minerva% xurl URL
churl. A simple Perl script that will check the links for a single page. Adapted from The Perl Cookbook.minerva% churl URL
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
70 of 110 4/17/2007 4:12 PM
churl
view plain print ?
minerva$ churl http://www.extension.harvard.edu/ 1.http://www.extension.harvard.edu/: 2.200 OK http://dceweb.harvard.edu/prod/sowxcrq.taf?school=EXT 3.200 OK http://dceweb.harvard.edu/prod/sswckce.taf?wgrp=EXT 4.200 OK http://my.extension.harvard.edu/ 5.200 OK http://www.extension.harvard.edu/2006-07/courses/ 6.200 OK http://www.extension.harvard.edu/2006-07/courses/DistanceEd/courses/ 7.200 OK http://www.extension.harvard.edu/2006-07/courses/citations.jsp 8.200 OK http://www.extension.harvard.edu/2006-07/forms/ 9.200 OK http://www.extension.harvard.edu/2006-07/help/directory.jsp;jsessionid=KPLBEAHDCC10.200 OK http://www.extension.harvard.edu/2006-07/images/go2.jpg 11.200 OK http://www.extension.harvard.edu/2006-07/images/home.jpg 12.200 OK http://www.extension.harvard.edu/2006-07/images/profiles/default-5.jpg 13.200 OK http://www.extension.harvard.edu/2006-07/images/snaps.jpg 14.200 OK http://www.extension.harvard.edu/2006-07/images/veri.gif 15.200 OK http://www.extension.harvard.edu/2006-07/news/;jsessionid=KPLBEAHDCCIK 16.200 OK http://www.extension.harvard.edu/2006-07/news/chaisson.jsp;jsessionid=KPLBEAHDCCI17.200 OK http://www.extension.harvard.edu/2006-07/news/creatures.jsp;jsessionid=KPLBEAHDCC18.200 OK http://www.extension.harvard.edu/2006-07/news/earthday.jsp;jsessionid=KPLBEAHDCCI19.200 OK http://www.extension.harvard.edu/2006-07/news/gittleman.jsp;jsessionid=KPLBEAHDCC20.200 OK http://www.extension.harvard.edu/2006-07/news/retirement.jsp;jsessionid=KPLBEAHDC21.200 OK http://www.extension.harvard.edu/2006-07/news/volunteers.jsp;jsessionid=KPLBEAHDC22.200 OK http://www.extension.harvard.edu/2006-07/overview/ 23.200 OK http://www.extension.harvard.edu/2006-07/overview/tradition.jsp 24.200 OK http://www.extension.harvard.edu/2006-07/overview/video/ 25.200 OK http://www.extension.harvard.edu/2006-07/overview/welcome.jsp 26.200 OK http://www.extension.harvard.edu/2006-07/programs/ 27.200 OK http://www.extension.harvard.edu/2006-07/programs/default.jsp#cert 28.200 OK http://www.extension.harvard.edu/2006-07/programs/info.jsp 29.200 OK http://www.extension.harvard.edu/2006-07/register/ 30.200 OK http://www.extension.harvard.edu/2006-07/register/financial/ 31.200 OK http://www.extension.harvard.edu/2006-07/register/financial/finaid.jsp 32.200 OK http://www.extension.harvard.edu/2006-07/register/guidelines/calendar/ 33.200 OK http://www.extension.harvard.edu/2006-07/register/guidelines/international.jsp 34.200 OK http://www.extension.harvard.edu/2006-07/register/policies/transcripts.jsp 35.200 OK http://www.extension.harvard.edu/2006-07/stylesheets/home-print.css 36.200 OK http://www.extension.harvard.edu/2006-07/stylesheets/home-screen.css 37.200 OK http://www.extension.harvard.edu/DistanceEd/ 38.200 OK http://www.extension.harvard.edu/chooser;jsessionid=KPLBEAHDCCIK 39.200 OK http://www.google-analytics.com/urchin.js 40.SKIP https://dceweb.harvard.edu/prod/gowlogn3.taf 41.SKIP javascript:popUp('/2006-07/snapshots/') 42.SKIP javascript:popUp2('/2006-07/profiles/default.jsp?n=5') 43.SKIP mailto:[email protected] 44. 45.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
71 of 110 4/17/2007 4:12 PM
Page Weight
Page weight of http://www.harvard.edu/
Firefox Web Developer Tool Bar
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
73 of 110 4/17/2007 4:12 PM
view plain print ?
minerva% timefetch -h 1.Usage: /usr/local/bin/timefetch [dhFjrvXz] [-f host] [-a attempts] [-b broken_images] 2. [-s size] [-T text] [-t timeout] http://url [http://url ... ] 3./usr/local/bin/timefetch -h for help message 4. 5. -a Number of attempts for the initial page fetch. 6. -b Minimum number of broken images to trigger alarm. 7. -d Debug: view all kinds of marginally useful output. 8. -f Force host: before doing recursive downloads, munge each URL 9. and replace the host in the URL with some other host. 10. -h Help: print this help message. 11. -j Java: download java applets as well. 12. -F No frames: If the page is a frameset, do *not* fetch the frames. 13. Default is to fetch them. 14. -r Recursive: download all images and calculate cumulative time. 15. -s Minimum size for the entire document (in kilobytes). 16. -t Timeout value for HTTP requests. 17. -T HTML text to scan for (such as "</html>"). Not case sensitive. 18. -v Verbose: print out URLs as they are downloaded. 19. -X Don't exit on errors, just try to continue. 20. -z Exit immediately on errors in fetching the main page. 21. 22. NOTE: This program always downloads embedded frames and prints 23. a cumulative total for frames and framesets, even if you did not 24. specify a recursive download. 25.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
74 of 110 4/17/2007 4:12 PM
timefetch examples
timefetch is in need of updating (does not parse CSS and will not get images referenced in CSS), but can still be a useful tool
timefetch will show the actual download time (often not useful) and the total kilobytes downloaded (often useful). Warning: timefetch will not execute JavaScript, nor does it fetch images included byCSS).
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
75 of 110 4/17/2007 4:12 PM
Web Robots
Robots, Spider, Crawlers
As they "spider" a site, the robots can perform various actions, such as:
Gathering content for search engines or a website mirrorValidating, checking, or processing content
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
76 of 110 4/17/2007 4:12 PM
Spidering Behavior: an example with Lynx
After lynx is done, here's a look at the files we have:
the "lnkNNNNNNNN.dat" files contain the text dump of the pages lynx retrievedthe "traverse.dat" files contain the list of link that lynx retrivedthe "reject.dat" files contain a list of URLs that lynx did not fetch (due to the fact that they are outside the "realm" as specified on the command line).
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
81 of 110 4/17/2007 4:12 PM
Link Checking Robots
Check the links on a single page; or on an entire site.If following links, will do a get request, otherwise it should do a head request.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
82 of 110 4/17/2007 4:12 PM
Examples of Link Checking Robots
churlchecklinkwebbotcheckbotwebchecklynx
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
83 of 110 4/17/2007 4:12 PM
checklink
W3C Link Checkerhttp://validator.w3.org/docs/checklink
Use Onlinehttp://validator.w3.org/checklinkUse command line:
Perl, Free
view plain print ?
minerva% checklink URL 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
84 of 110 4/17/2007 4:12 PM
checklink
view plain print ?
minerva% checklink --help 1.W3C checklink version 3.6.2.26 (c) 1999-2004 W3C 2.Usage: checklink <options> <uris> 3.Options: 4. -s/--summary Result summary only. 5. -b/--broken Show only the broken links, not the redirects. 6. -e/--directory Hide directory redirects, for example 7. http://www.w3.org/TR -> http://www.w3.org/TR/ 8. -r/--recursive Check the documents linked from the first one. 9. -D/--depth n Check the documents linked from the first one 10. to depth n (implies --recursive). 11. -l/--location uri Scope of the documents checked in recursive mode. 12. By default, for example for 13. http://www.w3.org/TR/html4/Overview.html 14. it would be http://www.w3.org/TR/html4/ 15. -n/--noacclanguage Do not send an Accept-Language header. 16. -L/--languages Languages accepted (default: *). 17. -q/--quiet No output if no errors are found. Implies -s. 18. -v/--verbose Verbose mode. 19. -i/--indicator Show progress while parsing. 20. -u/--user username Specify a username for authentication. 21. -p/--password password Specify a password. 22. --hide-same-realm Hide 401's that are in the same realm as the 23. document checked. 24. -t/--timeout value Timeout for HTTP requests. 25. -d/--domain domain Regular expression describing the domain to 26. which the authentication information will be 27. sent. 28. --masquerade "base1 base2" Masquerade base URI base1 as base2. See manual 29. page for more information. 30. -y/--proxy proxy Specify an HTTP proxy server. 31. -h/--html HTML output. 32. -?/--help Show this message. 33. -V/--version Output version information. 34. 35.See "perldoc Net::FTP" for information about various environment variables 36.affecting FTP connections and "perldoc Net::NNTP" for setting a default 37.NNTP server for news: URIs. 38. 39.The W3C_CHECKLINK_CFG environment variable can be used to set the 40.configuration file to use. See details in the full manual page, it can 41.be displayed with: 42. perldoc /usr/local/bin/checklink 43. 44.More documentation at: http://www.w3.org/2000/07/checklink 45.Please send bug reports and comments to the www-validator mailing list: 46. [email protected] (with 'checklink' in the subject) 47. Archives are at: http://lists.w3.org/Archives/Public/www-validator/ 48.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
86 of 110 4/17/2007 4:12 PM
webbot
webbot is part of the W3C Libwww package. http://www.w3.org/Robot/
view plain print ?
minerva% webbot 1. 2.W3C OpenSource Software 3.----------------------- 4. 5. Webbot version 5.4.0 6. using the W3C libwww library version 5.4.0. 7. 8. See "http://www.w3.org/Robot/User/CommandLine" for help 9. See "http://www.w3.org/Robot/User/" for user information 10. See "http://www.w3.org/Robot/" for general information 11. 12. Please send feedback to the <[email protected]> mailing list, 13. see "http://www.w3.org/Library/#Forums" for details 14.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
87 of 110 4/17/2007 4:12 PM
webbot example
view plain print ?
minerva% webbot -img \ 1.> -depth 99 \ 2.> -prefix http://cscie12.dce.harvard.edu/lecture_notes/20060131/ \ 3.> -include http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 4.> -404 404.log 5.> -l clf.log 6.> -referer referer.log 7.> -reject reject.log 8.> http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 9....content removed... 10.Robot....... Received element 0, attribute 5 with anchor 0x8073700 11.Robot....... Found `http://cscie12.dce.harvard.edu/' - 12............. Already checked 13.Robot....... Received element 0, attribute 5 with anchor 0x8073688 14.Robot....... Found `http://cscie12.dce.harvard.edu/lecture_notes/20060131/slide1.html' - 15............. Already checked 16.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/slide0.html 17. 2 outstanding requests 18.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/images/verit19. 1 outstanding request 20.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/images/KUSea21. Everything is finished... 22. 23.Accessed 62 documents in 2.61 seconds (23.79 requests pr sec) 24. Did a GET on 53 document(s) and downloaded 182K bytes of document bodies (71396.825. Did a HEAD on 9 document(s) with a total of 49K bytes 26. 27.Raw Log files: 28. Logged 62 entries in general log file `clf.log' 29. Logged 61 entries in referer log file `referer.log' 30. Logged 51 entries in rejected log file `reject.log' 31. Logged 0 entries in not found log file `404.log' 32.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
88 of 110 4/17/2007 4:12 PM
checkbot
Checkbot http://degraaff.org/checkbot/
minerva% checkbot
Checkbot Example Output
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
89 of 110 4/17/2007 4:12 PM
Checkbot Example
view plain print ?
minerva% checkbot --help 1.Checkbot 1.75 command line options: 2. 3. --cookies Accept cookies from the server 4. --debug Debugging mode: No pauses, stop after 25 links. 5. --mailto address Mail brief synopsis to address when done. 6. --noproxy domains Do not proxy requests to given domains. 7. --verbose Verbose mode: display many messages about progress. 8. --url url Start URL 9. --match match Check pages only if URL matches `match' 10. If no match is given, the start URL is used as a match 11. --exclude exclude Exclude pages if the URL matches 'exclude' 12. --filter regexp Run regexp on each URL found 13. --ignore ignore Ignore URLs matching 'ignore' 14. --suppress file Use contents of 'file' to suppress errors in output 15. --file file Write results to file, default is checkbot.html 16. --note note Include Note (e.g. URL to report) along with Mail message. 17. --proxy URL URL of proxy server for HTTP and FTP requests. 18. --internal-only Only check internal links, skip checking external links. 19. --sleep seconds Sleep this many seconds between requests (default 0) 20. --style url Reference the style sheet at this URL. 21. --timeout seconds Timeout for http requests in seconds (default 120) 22. --interval seconds Maximum time interval between updates (default 10800) 23. --dontwarn codes Do not write warnings for these HTTP response codes 24. --enable-virtual Use only virtual names, not IP numbers for servers 25. --language Specify 2-letter language code for language negotiation 26. 27.Options --match, --exclude, and --ignore can take a perl regular expression 28.as their argument 29. 30.Use 'perldoc checkbot' for more verbose documentation. 31.Checkbot WWW page : http://degraaff.org/checkbot/ 32.Mail bugs and problems: [email protected] 33.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
91 of 110 4/17/2007 4:12 PM
For the Programmer: Writing Your Own
The Perl modules, LWP and WWW::Robot make writing robots almost trivial.
Examples in Perl Cookbook, published by O'ReillyPerl and LWP by Sean Burke, published by O'Reilly
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
92 of 110 4/17/2007 4:12 PM
Other Webmaster Tools
Checking HTML PagesWeb Site MirroringDocument Version ControlMonitor HTTP Server Performance
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
93 of 110 4/17/2007 4:12 PM
Checking HTML Pages
HTML tidy, http://tidy.sourceforge.net/W3C HTML Validation, http://validator.w3.org/W3C CSS Validation, http://jigsaw.w3.org/css-validator/WebXact (WAI and Section 508 Compliance), http://webxact.watchfire.com/Watchfire.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
94 of 110 4/17/2007 4:12 PM
Web Site Mirroring
GNU wget http://www.gnu.org/software/wget/wget.htmlw3mir http://langfeldt.net/w3mir/
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
95 of 110 4/17/2007 4:12 PM
GNU wget
GNU wget http://www.gnu.org/software/wget/wget.html
view plain print ?
minerva% wget --help 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
96 of 110 4/17/2007 4:12 PM
w3mir
w3mir http://langfeldt.net/w3mir/
w3miris a all purpose HTTP copying and mirroring tool. The main focus of w3mir is to create and maintaina browseable copy of one, or several, remote WWW site(s). Used to the max w3mir can retrieve thecontents of several related sites and leave the mirror browseable via a local web server, or from a filesystem, such as directly from a CDROM.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
97 of 110 4/17/2007 4:12 PM
HTTP Server Stress Test
ApacheBench (ab) is part of the Apache HTTP Server Distribution http://www.apache.org/httpd.html
TimeIP address / HostnameUsername (if under Authentication)RequestUser-AgentReferrer URLResponse StatusBytes returned
Possible Data:
The contents of a specified environment variableFilenameThe request protocolThe contents of specified HTTP request headersThe contents of specified HTTP response headersRemote logname (from identd, if supplied)The request methodThe canonical Port of the server serving the request.The process ID of the child that serviced the request.The query stringFirst line of requestThe time taken to serve the request.The URL path requested.The canonical ServerName of the server serving the request.The server name according to the UseCanonicalName setting.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
104 of 110 4/17/2007 4:12 PM
Log Formats
Common Log Format (CLF)host ident auth_user date request status bytes
User-Agent Logdate user-agent
Referer Logdate referrer-url request-url
Combined Log Formathost ident authuser date request status bytes referrer user-agent
Custom Log Formats in Apache
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
106 of 110 4/17/2007 4:12 PM
Web Server Logs: Two perspectives
Server AdministratorContent Provider
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
107 of 110 4/17/2007 4:12 PM
Web Server Logs: What we would like to know and what we can know
What is the busiest time?1.How long do they stay?2.How long did it take to fulfill a request?3.How many requests were there for a specific resource?4.Where are the users coming from?5.What browsers are people using?6.What pages have they been to?7.How many were looking versus buying?8.What requests resulted in errors (status 404, etc)?9.Where do the users go when they leave the site?10.Do they come back?11.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
108 of 110 4/17/2007 4:12 PM
Complicating Issues
HTTP is a stateless protocolLocal CacheProxy CacheProxy ServersShared Computers
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
109 of 110 4/17/2007 4:12 PM
Log Rotation
approximately 200 to 250 bytes per line (request)For example, 1,000,000 requests per day (12 requests per second)log grows at 2.8 kb per second238 Mb for 1,000,000 requestscompressed (gzip'ed) logs are 7 to 10% of original size!
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
110 of 110 4/17/2007 4:12 PM
Tools for Log Analysis
Analog http://www.analog.cx/ Stephen TurnerUNIX, Windows, MacOS, othersFree!!
Report Magic
WebTrends Log Analyzer http://www.webtrends.com/
Table of Contents | All Slides | Link List | CSCI E-12