The World Wide Web

04/22/23 CS403 Introduction 1

The World Wide Web

Modified by Linda Kenney2/4/08

04/22/23 CS403 The World Wide Web 2

Using the Web, it’s possible for anyone to publish their own Web pages on a host running a Web server and have those pages available to any Internet user with a Web browser.


HypertextThe Web was invented in 1990. But it was based on the concept of hypertext

which had been around for decades.

The basic idea of hypertext is to take the passive cross-references that are common in printed text and make them active.

When reading a book, a cross-reference passively informs the reader where to turn for additional info and the reader must manually perform the actions necessary to obtain that additional info if it is desired.

Examples?


Hypertext On a computer, it’s easy to make cross-

references active. You notify the reader that additional info is available, but let the computer take the actions necessary to obtain that info if the reader desires it.

Such an active cross-reference is called a hyperlink (or just “link”) and text that contains such links is called hypertext.

This concept is fundamental to the Web.


Web presentationsMost Web pages do not exist in isolation.

The vast majority of them are grouped together into collections of pages with a common purpose or theme.

Such a collection of Web pages is called a Web presentation or Web site.

Typically, all the pages within a given presentation are under the editorial control of a single individual or organization.


Web presentations (cont.) A given Web page is likely to contain

several links to other pages.

Often, those links will lead to other resources within the same presentation. These links are called “local links” or “links to local resources”.

Some of those links may lead to other resources which are part of a different presentation. These links are called “remote links” or “links to remote resources”.


Clients and servers on the WebLike most Internet services, the Web is based on the client/server model.

A Web browser is just a specific example of a client program.


Clients and servers on the Web (cont.)

The browser can’t accomplish much without the cooperation of a server.

A Web server is a program that makes files available to Web browsers upon request.

In general, the files a Web server makes available contain Web pages and the images, sounds, videos and other media that supplement them.

And all the files a Web server has access to are generally stored in the secondary storage of the host on which the server runs.


Hypertext Transfer ProtocolHypertext Transfer Protocol (HTTP) is the protocol that Web browsers and Web servers use to communicate with one another.

As a protocol, it carefully defines the range of possibilities, determining precisely what a browser may say to a server and when.

It also dictates what servers can say to browsers and when.


Hypertext Transfer Protocol

Browser Server

“I need the file page.html”

“Here is the file page.html”


HTTP requests and responsesWhen “speaking” HTTP, a Web browser generally sends an HTTP GET request to the Web server on a specific host requesting a specific resource.

When it receives an HTTP GET request from a browser, a Web server, in turn, sends some sort of HTTP response back to the browser.

Note that HTTP requests and responses rely on TCP (Transmission Control Protocol) and IP to get across the Internet. (see p 72-74)

In other words, HTTP is layered on top of TCP and IP.

BrowserServer

HTTP GET request for /page.html

HTTP responseStatus code: 200

Content-type: text/htmlContent-length: 4370

[contents of /page.html]

HTTP responseStatus code: 404 Not Found

Content-type: text/htmlContent-length: 1634

[contents of error status page]


The server’s responsibilitiesWhen it receives an HTTP GET request, a Web server must prepare an appropriate HTTP response message.

The request will specify the file it is requesting. The server must first locate the requested file

within the file system of its host.

If the file cannot be located, the server sends back a ‘404 File not found’ response message.


The server’s responsibilities (cont.) Having found the file, however, the server

must also verify that the file permissions allow it to access the file.

If the server is not able to access the file, it will typically return a ‘403 Forbidden’ response message.

If the requested file is located and accessible, the server generates a ‘200 OK’ response message that includes the contents of the file as well as a variety of headers that provide information about the file, such as its type, size and last modified date.


Locating filesA typical host stores thousands of files, all of which must be uniquely identified.

It’s impractical to give 100,000 files unique names.

Instead, a host uses a file system consisting of a hierarchy of directories to create uniquely identified locations in which files may be stored.


Locating files (cont.)Each location can be uniquely identified by the sequence of steps necessary to reach it from the top of the hierarchy.

The list of steps needed to reach a location from the top of the hierarchy is called the absolute path to that location, and every location has a unique absolute path.


Locating files (cont.) All items in a given location must have

unique names. So each item in the hierarchy can be uniquely

identified by combining its absolute path with its filename to form an absolute pathname.


Uniform Resource LocatorsBefore a browser can request a resource, it needs to know where it can find that resource and what type of server will be providing it.

To find a specific resource, the browser must be told not only the name of the file containing that resource, but also what host it is on and where it is in the file system of that host.

All the information needed to find a specific resource, out of the billions available on the Web, is contained in that resource’s Uniform Resource Locator (URL).

Every resource available on the Web is identified by a unique URL that contains all the information necessary for a browser to retrieve that resource.


Uniform Resource Locators (cont.) The browser always does the same thing

with the URL: it requests the resource and renders it on the screen.

In computer science, we use the term render to refer to the process of producing an image by interpreting some data.

A browser renders a Web resource by determining what to display on the screen based upon what it finds in the HTTP response that contains the contents of that resource.


The anatomy of a URLConsider a typical URL

A URL typically begins with the protocol to use when accessing the resource.

The remainder of the URL is the identifier that tells the browser how to locate the resource. The identifier starts with a hostname that uniquely

identifies the host on which the resource is stored.

The rest of the identifier is the pathname that uniquely locates the resource in that host’s file system.

The pathname consists of a path and a file name.

http://www.sample.com/products/catalog/prod1.htmlhttp://www.sample.com/products/catalog/prod1.htmlhttp://www.sample.com/products/catalog/prod1.html


The Web step-by-step – step 1The process of displaying a Web resource begins when the browser is given the URL of that resource by the user.

The browser examines that URL to find out what it needs to do next.

The first part (ex: http://) tells the browser what protocol to use, and indirectly what type of server to contact.

The identifier tells the browser where the resource is located.

The hostname in the identifier tells the browser which host is running the server responsible for the resource.

The pathname in the identifier tells the browser precisely where the desired resource is stored in that host’s file system.

Using this information, the browser composes an HTTP GET request message.

The GET request contains the pathname of the desired resource as well as the hostname of the server’s host and various other information.


The Web step-by-step – step 2The HTTP GET request must be sent to the appropriate server. Since it must arrive in its entirety at a specific

host, the request gets sent over the Internet using TCP and IP. To establish a TCP connection with the server, the

browser needs to know the IP address of the host running the server.

To get the IP address of the server’s host, the browser resolves the hostname in the URL’s identifier using DNS.

Using the IP address of the server’s host, the browser establishes connection with the server. The HTTP GET request message is sent to the server

over this connection. Since the request message is small, it takes little time to send.


The Web step-by-step – step 3When a Web server receives an HTTP GET request, it composes an HTTP response.

Using the pathname specified in the request, the server attempts to locate the file containing the resource within the file system of its host.

Once the resource’s file has been located, the server verifies that it has permission to access that file.


The Web step-by-step – step 3 (cont.)

If the server is able to locate and access the file, the HTTP response will indicate success.

The response will also indicate the date and time at which the file was last modified, the type of resource the file contains and how big it is.

And the server will include the contents of the resource’s file in the response message.

Note that this means the size of the response message is primarily determined by the size of the resource being requested.

If the server is unable to locate or access the file, the HTTP response will indicate the nature of the problem.

The response may also contain some content for the browser to use in lieu of the requested resource.


The Web step-by-step – step 4The server must now send the response back to the requesting browser.

It gets the IP address for the browser from the packet that carried the HTTP request.

Because they typically contain the contents of the requested resource, HTTP response messages tend to be significantly larger than HTTP request messages.

To minimize the time a user must wait to receive a requested resource, it’s up to the creator of that resource to minimize the size of the file(s) containing the resource(s).


The Web step-by-step – step 5Upon receiving an HTTP response message, the browser is responsible for rendering the resource it contains. Many resources will be Web pages, which are

written in Extensible Hypertext Markup Language (XHTML). Rendering a Web page involves interpreting the

XHTML to determine what the page should look like. Other resources, however, will be other forms

of media such as images, sounds and video. Rendering multimedia resources involves

interpreting the data those resources contain and producing the image, sound or video that data represents.

Browsers therefore need to understand a range of resource types.


The Web step-by-step – step 5 (cont.)

It’s also useful to note at this stage that even though a Web page may appear to contain images, sounds and videos, each of those resources must be stored separately in its own file. And each of those resources must therefore

be retrieved from a server with a separate HTTP transaction.

So, the time it takes to retrieve a Web page is the sum of the time it takes to retrieve all of its parts.


The browser lends a handBrowsers can play a role in minimizing the time the user must wait for a page to load.

A user often revisits the same resources repeatedly.

So, what you want is for the browser to have save the resource so that you can return to it without having to request it from the server again.


The browser cacheAs a browser receives each requested resource, it stores a copy of that resource in a special place called the browser cache. Along with the contents of the resource it stores

the current date and time and the URL used to retrieve the resource.

Each time a resource is requested, the browser checks to see if that resource is already stored in its cache. If it’s not, then the browser goes about retrieving

the resource as we’ve already described.


When things go wrong…Although it often goes off without a hitch, there are places in an HTTP transaction where problems can occur. Knowing what might go wrong can help us

make sense of otherwise cryptic or confusing error messages we may get from our browser.

Of course, different browsers and servers are free to use different error messages as they see fit, so the wording may differ.


When things go wrong… (cont.)

If the hostname in the URL cannot be resolved to an IP address using DNS, there’s no way to establish the necessary TCP connection to the server.

In this case, we’ll get an error to the effect of “Unable to locate server”.


When things go wrong… (cont.)

The hostname may resolve but the TCP connection may not be able to be established for a variety of other reasons.

In this case, we’ll get an error to the effect of “No response”.


When things go wrong… (cont.)If we’re able to get a TCP connection and

send an HTTP request to the server, there’s no guarantee it will be successful.

If the server is unable to locate the requested file, we’ll get an error to the effect of “Not found”.

If the server locates the file but does not have permission to access it, we’ll get an error to the effect of

“Forbidden” or “Access denied”.


…And how to fix it

Understanding the root cause of an error can often help you devise a solution to the problem.


…And how to fix it (cont.)If you get an “Unable to locate server” error, you know there’s a problem with the hostname in the URL.

Double-check your typing of the hostname.

Make sure your network connection is still working.

Ensure that your DNS server is functioning in general.


…And how to fix it (cont.)

If you get a “No response” error, you know the hostname is okay but the server is not able to respond.

Often, there’s nothing you can do about this yourself.

However, since this is often a temporary problem, try again a little later.



If you get a “Not found” error, you know there’s a problem with the pathname in the URL.

Again, double-check your typing, paying attention to case.

Try eliminating steps from the pathname one at a time, moving from right to left.



If you get a “Forbidden” error, the problem is with the permissions on the file containing the requested resource.

If the file belongs to you, simply adjust the permissions.

Otherwise, there’s little you can do about this problem yourself except contact the owner of the resource.


Resource typesAs we’ve seen, the Web consists of a variety of resource types.

In each HTTP response, the server includes an indicator of the resource’s type so the browser knows how to render it.

Since servers and browsers must agree on the meaning of this type info, it needs to be standardized.


Resource types (cont.) The standard used for this purpose is called

Multipurpose Internet Mail Extensions (MIME). As you can tell from its name, MIME was originally

designed for use with e-mail.

A MIME type consists of an indicator of the general resource type (text, image, audio, etc.) followed by a / followed by an indicator of the specific resource type (html, jpeg, mpeg, etc.).

For example, XHTML files are assigned a MIME type of text/html.

JPEG image files are assigned a MIME type of image/jpeg.

MP3 sound files are assigned a MIME type of audio/mpeg.


Filename extensionsThe server needs to know the type of each resource for which it is responsible.

Otherwise, it wouldn’t know what MIME type to list in the HTTP response message.

Servers are set up to use the extension of the resource’s filename to determine its type.

A filename extension is part of the actual filename, but it comes at the end and starts with a dot.

Examples?

The server is configured to associate certain filename extensions with specific MIME types.


Filename extensions (cont.)For this reason, it’s important to name all of the files containing your Web resources with appropriate filename extensions. We’ll generally use only a small number of

resource types in this course.

XHTML files are given .html (or .htm) extensions.

JPEG images are given .jpg (or . jpeg ) extensions.

GIF images are given .gif extensions.

CSS files are given .css extensions.


What Browsers UnderstandA browser understands the HTTP protocol for retrieving Web pages.

Most browsers also understand protocols for other Web services like file transfer, instant messaging, e-mail and network news.

A browser understands XHTML and HTML and can interpret it in order to render Web pages.

Many also understand other popular languages like CSS, JavaScript and XML .


What Browsers Understand (cont.)Most browsers understand common image file formats like JPEG and GIF and can render images stored in these formats.

Some also understand image file formats like BMP and PNG.

Many browsers understand other forms of media as well.

Flash presentations are used for interactive animations.

MP3 is a file format commonly used for storing sounds and music.

MPEG and AVI are common file formats for storing video.


What Browsers Understand (cont.)A good browser is designed to provide the functionality most Web users are likely to need.

Since people use the Web in many different ways most browsers are designed to accept two different types of add-ons that extend their capabilities.


An application is a program you run on your computer to accomplish specific tasks.

You can obtain applications from retail software stores or the Internet.

A browser often uses other applications to view the Web.

You can customize what applications your browser uses.

Add-Ons : Helpers and Plug-Ins (p. 76-83)


A helper application is an application a browser can launch. It can be any application on your computer. Examples?

When your browser encounters a file that requires special handling, it looks for an appropriate helper application and opens the file in that application.

Helpers


A browser plug-in is an application that expands the capabilities of a web browser.

When you install a plug-in, you extend the capabilities of your browser to handle a file type that it wasn’t originally designed to handle.

Any file requiring that plug-in will be displayed inside the browser window, with the plug-in working as if it were a part of your browser.

Plug-Ins


Plug-ins support everything from audio to animation to documents

Plug-ins increase your browser’s memory requirements and launch time.

You can find Web pages to help you locate plug-ins for your browser.

Plug-Ins (cont.)


Common plug-ins and helper applications:


Key termsAbsolute pathAbsolute pathnameBrowser cacheBrowsingConceptual networkFile systemFilename extensionHelper appHostnameHTTPHTTP GET requestHTTP HEAD requestHTTP responseHyperlinkHypermediaHypertextIdentifier

LinkLocal linkMIMEMIME typePathnamePermissionsPlug-inRemote linkRenderSchemeURLWeb browserWeb presentationWeb serverWeb siteWorld Wide WebXHTML


Some information used from:

Web 101 by Lehnert and Kopec

The World Wide Web

Documents

web servers

web browsers

web cont

hypertextthe web

web site

collection of web pages

web presentations cont

given web page