lecture-chap2-app-2

1

2: Application Layer 1

Chapter2: Application layer

Principles of network applicationsArchitecture: client-server or P2PServices that an application needs

important application-level protocolsFTP, SMTP, P2P, ……

programming network applicationssocket API

Web stuffHTTP, DNS, Web searching


Web and HTTP

First some jargonWeb page consists of objectsObject can be HTML file, JPEG image, Java applet, audio file,…Web page consists of base HTML-file which includes several referenced objectsEach object is addressable by a URLExample URL:www.someschool.edu/someDept/pic.gif

host name path name


HTTP overview

HTTP: hypertext transfer protocolWeb’s application layer protocolclient/server model

client: browser that requests, receives, “displays” Web objectsserver: Web server sends objects in response to requests

HTTP 1.0: RFC 1945HTTP 1.1: RFC 2068

PC runningExplorer

Server running

Apache Webserver

Mac runningNavigator

HTTP request

HTTP request

HTTP response

HTTP response


HTTP overview (continued)

Uses TCP:client initiates TCP connection (creates socket) to server, port 80server accepts TCP connection from clientHTTP messages (application-layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server)TCP connection closed

HTTP is “stateless”server maintains no information about past client requests

Protocols that maintain “state” are complex!past history (state) must be maintainedif server/client crashes, their views of “state” may be inconsistent, must be reconciled

aside

2


HTTP connections

Nonpersistent HTTPAt most one object is sent over a TCP connection.HTTP/1.0 uses nonpersistent HTTP

Persistent HTTPMultiple objects can be sent over single TCP connection between client and server.HTTP/1.1 uses persistent connections in default mode


Nonpersistent HTTPSuppose user enters URL

www.someSchool.edu/someDepartment/home.index

1a. HTTP client initiates TCP connection to HTTP server (process) at www.someSchool.edu on port 80

2. HTTP client sends HTTP request message (containing URL) into TCP connection socket. Message indicates that client wants object someDepartment/home.index

1b. HTTP server at host www.someSchool.edu waiting for TCP connection at port 80. “accepts” connection, notifying client

3. HTTP server receives request message, forms response message containing requested object, and sends message into its socket

time

(contains text, references to 10

jpeg images)


Nonpersistent HTTP (cont.)

5. HTTP client receives response message containing html file, displays html. Parsing html file, finds 10 referenced jpeg objects

6. Steps 1-5 repeated for each of 10 jpeg objects

4. HTTP server closes TCP connection.

time


Non-Persistent HTTP: Response time

Definition of RTT: time to send a small packet to travel from client to server and back.

Response time:one RTT to initiate TCP connectionone RTT for HTTP request and first few bytes of HTTP response to returnfile transmission time

total = 2RTT+transmit time

time to transmit file

initiate TCPconnection

RTT

requestfile

RTT

filereceived

time time

3


Persistent HTTP

Nonpersistent HTTP issues:requires 2 RTTs per objectOS overhead for each TCP connectionbrowsers often open parallel TCP connections to fetch referenced objects

Persistent HTTPserver leaves connection open after sending responsesubsequent HTTP messages between same client/server sent over open connectionclient sends requests as soon as it encounters a referenced objectas little as one RTT for all the referenced objects


HTTP request message

two types of HTTP messages: request, responseHTTP request message:

ASCII (human-readable format)

GET /somedir/page.html HTTP/1.1Host: www.someschool.edu User-agent: Mozilla/4.0Connection: close Accept-language:fr

(extra carriage return, line feed)

request line(GET, POST,

HEAD commands)

headerlines

Carriage return, line feed

indicates end of message


HTTP request message: general format


Uploading form input

Post method:Web page often includes form inputInput is uploaded to server in entity body

URL method:Uses GET methodInput is uploaded in URL field of request line:

www.somesite.com/animalsearch?monkeys&banana

4


Method types

HTTP/1.0GETPOSTHEAD

asks server to leave requested object out of response

HTTP/1.1GET, POST, HEADPUT

uploads file in entity body to path specified in URL field

DELETEdeletes file specified in the URL field


HTTP response message

HTTP/1.1 200 OK Connection closeDate: Thu, 06 Aug 1998 12:00:15 GMT Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 Jun 1998 …... Content-Length: 6821 Content-Type: text/html

data data data data data ...

status line(protocol

status codestatus phrase)

headerlines

data, e.g., requestedHTML file


HTTP response message


HTTP response status codes

200 OKrequest succeeded, requested object later in this message

301 Moved Permanentlyrequested object moved, new location specified later in this message (Location:)

400 Bad Requestrequest message not understood by server

404 Not Foundrequested document not found on this server

505 HTTP Version Not Supported

In first line in server->client response message.A few sample codes:

5


Trying out HTTP (client side) for yourself

1. Telnet to your favorite Web server:Opens TCP connection to port 80(default HTTP server port) at cis.poly.edu.Anything typed in sent to port 80 at cis.poly.edu

telnet cis.poly.edu 80

2. Type in a GET HTTP request:GET /~ross/ HTTP/1.1Host: cis.poly.edu

By typing this in (hit carriagereturn twice), you sendthis minimal (but complete) GET request to HTTP server

3. Look at response message sent by HTTP server!


User-server state: cookies

Many major Web sites use cookies

Four components:1) cookie header line of

HTTP response message2) cookie header line in

HTTP request message3) cookie file kept on

user’s host, managed by user’s browser

4) back-end database at Web site

Example:Susan always access Internet always from PCvisits specific e-commerce site for first timewhen initial HTTP requests arrives at site, site creates:

unique IDentry in backend database for ID


Cookies: keeping “state” (cont.)client server

usual http response msg

usual http response msg

cookie file

one week later:

usual http request msgcookie: 1678 cookie-

specificaction

access

ebay 8734usual http request msg Amazon server

creates ID1678 for user create

entry

usual http response Set-cookie: 1678

ebay 8734amazon 1678

usual http request msgcookie: 1678 cookie-

spectificaction

accessebay 8734amazon 1678

backenddatabase


Cookies (continued)What cookies can bring:

authorizationshopping cartsrecommendationsuser session state (Web e-mail)

Cookies and privacy:cookies permit sites to learn a lot about youyou may supply name and e-mail to sites

aside

How to keep “state”:protocol endpoints: maintain state at sender/receiver over multiple transactionscookies: http messages carry state

6


Web caches (proxy server)

user sets browser: Web accesses via cachebrowser sends all HTTP requests to cache

object in cache: cache returns object else cache requests object from origin server, then returns object to client

Goal: satisfy client request without involving origin server

client

Proxyserver

client

HTTP request

HTTP response

HTTP request HTTP request

origin server

origin server

HTTP response HTTP response


More about Web caching

cache acts as both client and servertypically cache is installed by ISP (university, company, residential ISP)

Why Web caching?reduce response time for client requestreduce traffic on an institution’s access link.Internet dense with caches: enables “poor”content providers to effectively deliver content (but so does P2P file sharing)


Caching example Assumptions

average object size = 100,000 bitsavg. request rate from institution’s browsers to origin servers = 15/secdelay from institutional router to any origin server and back to router = 2 sec

Consequencesutilization on LAN = 15%utilization on access link = 100%total delay = Internet delay + access delay + LAN delay

= 2 sec + minutes + milliseconds

originservers

publicInternet

institutionalnetwork 10 Mbps LAN

1.5 Mbps access link

institutionalcache


Caching example (cont)Possible solution

increase bandwidth of access link to, say, 10 Mbps

Consequencesutilization on LAN = 15%utilization on access link = 15%Total delay = Internet delay + access delay + LAN delay

= 2 sec + msecs + msecsoften a costly upgrade

originservers

publicInternet


10 Mbps access link

institutionalcache

7


Caching example (cont)

possible solution: install cachesuppose hit rate is 0.4

consequence40% requests will be satisfied almost immediately60% requests satisfied by origin serverutilization of access link reduced to 60%, resulting in negligible delays (say 10 msec)total avg delay = Internet delay + access delay + LAN delay = .6*(2.01) secs + .4*milliseconds < 1.4 secs

originservers

publicInternet


1.5 Mbps access link

institutionalcache


Conditional GET

Goal: don’t send object if cache has up-to-date cached versioncache: specify date of cached copy in HTTP requestIf-modified-since:

<date>

server: response contains no object if cached copy is up-to-date: HTTP/1.0 304 Not

Modified

cache serverHTTP request msgIf-modified-since:

<date>

HTTP responseHTTP/1.0

304 Not Modified

object not

modified

HTTP request msgIf-modified-since:

<date>

HTTP responseHTTP/1.0 200 OK

<data>

object modified






Web stuffDNS


DNS: Domain Name System

People: many identifiers:SSN, name, passport #

Internet hosts, routers:IP address (32 bit) -used for addressing datagrams“name”, e.g., ww.yahoo.com - used by humans

Q: map between IP addresses and name ?

Domain Name System:distributed databaseimplemented in hierarchy of many name serversapplication-layer protocolhost, routers, name servers to communicate to resolve names (address/name translation)

note: core Internet function, implemented as application-layer protocolcomplexity at network’s “edge”

8


DNS Why not centralize DNS?

single point of failuretraffic volumedistant centralized databasemaintenance

doesn’t scale!

DNS servicesHostname to IP address translationHost aliasing

Canonical and alias names

Mail server aliasingLoad distribution

Replicated Web servers: set of IP addresses for one canonical name


Root DNS Servers

com DNS servers org DNS servers edu DNS servers

poly.eduDNS servers

umass.eduDNS serversyahoo.com

DNS serversamazon.comDNS servers

pbs.orgDNS servers

Distributed, Hierarchical Database

Client wants IP for www.amazon.com; 1st approx:Client queries a root server to find com DNS serverClient queries com DNS server to get amazon.com DNS serverClient queries amazon.com DNS server to get IP address for www.amazon.com


DNS: Root name serverscontacted by local name server that can not resolve nameroot name server:

contacts authoritative name server if name mapping not knowngets mappingreturns mapping to local name server

13 root name servers worldwide

b USC-ISI Marina del Rey, CAl ICANN Los Angeles, CA

e NASA Mt View, CAf Internet Software C. Palo Alto, CA (and 17 other locations)

i Autonomica, Stockholm (plus 3 other locations)

k RIPE London (also Amsterdam, Frankfurt)

m WIDE Tokyo

a Verisign, Dulles, VAc Cogent, Herndon, VA (also Los Angeles)d U Maryland College Park, MDg US DoD Vienna, VAh ARL Aberdeen, MDj Verisign, ( 11 locations)


TLD and Authoritative Servers

Top-level domain (TLD) servers:responsible for com, org, net, edu, etc, and all

top-level country domains uk, fr, ca, jp.Network Solutions maintains servers for com TLDEducause for edu TLD

Authoritative DNS servers:organization’s DNS servers, providing authoritative hostname to IP mappings for organization’s servers (e.g., Web, mail).can be maintained by organization or service provider

9


Local Name Server

Does not strictly belong to hierarchyEach ISP (residential ISP, company, university) has one.

Also called “default name server”When a host makes a DNS query, query is sent to its local DNS server

Acts as a proxy, forwards query into hierarchy.


requesting hostcis.poly.edu

gaia.cs.umass.edu

root DNS server

local DNS serverdns.poly.edu

1

23

4

5

6

authoritative DNS serverdns.cs.umass.edu

78

TLD DNS server

DNS name resolution example

Host at cis.poly.edu wants IP address for gaia.cs.umass.edu

iterated query:contacted server replies with name of server to contact“I don’t know this name, but ask this server”


requesting hostcis.poly.edu

gaia.cs.umass.edu

root DNS server

local DNS serverdns.poly.edu

1

2

45

6

authoritative DNS serverdns.cs.umass.edu

7

8

TLD DNS server

3recursive query:puts burden of name resolution on contacted name serverheavy load?

DNS name resolution example


DNS: caching and updating records

once (any) name server learns mapping, it cachesmapping

cache entries timeout (disappear) after some timeTLD servers typically cached in local name servers

• Thus root name servers not often visitedupdate/notify mechanisms under design by IETF

RFC 2136http://www.ietf.org/html.charters/dnsind-charter.html

10

2: Application Layer 37 2: Application Layer 38

DNS recordsDNS: distributed db storing resource records (RR)

RR format: (name, value, type, ttl)Type=Aname is hostnamevalue is IP addressE.g.: (dns.umass.edu, 128.119.40.111, A)

Type=NSname is domain (e.g. foo.com)value is hostname of authoritative name server for this domainE.g.: (umass.edu, dns.umass.edu, NS)

Type=CNAMEname is alias name for some “canonical” (the real) name

www.ibm.com is really servereast.backup2.ibm.comvalue is canonical nameE.g. : (www.ibm.com, servereast.backup2.ibm.com, CNAME)

Type=MXvalue is name of mailserver associated with nameE.g. (foo.com, mail.bar.foo.com, MX)


Example

65.54.244.1363600Amx1.hotmail.com.65.54.245.83600Amx1.hotmail.com.64.4.50.503600Amx1.hotmail.com.65.54.244.83600Amx1.hotmail.com.65.54.190.1793600Amx4.hotmail.com.65.54.245.1043600Amx4.hotmail.com.65.54.244.2323600Amx4.hotmail.com.65.54.244.1043600Amx4.hotmail.com.65.54.245.723600Amx3.hotmail.com.65.54.244.723600Amx3.hotmail.com.64.4.50.1793600Amx3.hotmail.com.65.54.244.2003600Amx3.hotmail.com.65.54.245.403600Amx2.hotmail.com.65.54.190.503600Amx2.hotmail.com.65.54.244.403600Amx2.hotmail.com.65.54.244.1683600Amx2.hotmail.com.mx1.hotmail.com. 3600MXhotmail.com.mx4.hotmail.com. 3600MXhotmail.com.mx3.hotmail.com. 3600MXhotmail.com.mx2.hotmail.com. 3600MXhotmail.com.AnswerTTLTypeDomain


DNS protocol, messagesDNS protocol : query and reply messages, both with

same message format

msg headeridentification: 16 bit # for query, reply to query uses same #flags:

query or replyrecursion desired recursion availablereply is authoritative

11


DNS protocol, messages

Name, type fieldsfor a query

RRs in responseto query

records forauthoritative servers

additional “helpful”info that may be used


Inserting records into DNS

Example: just created startup “Network Utopia”Register name networkuptopia.com at a registrar(e.g., Network Solutions)

Need to provide registrar with names and IP addresses of your authoritative name server (primary and secondary)Registrar inserts two RRs into the com TLD server:

(networkutopia.com, dns1.networkutopia.com, NS)(dns1.networkutopia.com, 212.212.212.1, A)

Put in authoritative server Type A record for www.networkuptopia.com and Type MX record for mail.networkutopia.comHow do people get the IP address of your Web site?


ExerciseSuppose within your Web browser, you click on a link to obtain a Webpage. The IP address for the associated URL is not cached in your local host. Suppose that n DNS servers should be visited before your host receives the IP address. The successive visits incur an RTT of RTT1, RTT2, …, RTTn. Suppose that the base HTML file associated with the link references three very small objects (small pictures) on the same server. Let RTT0 denote the RTT between the local host and the server containing the objects. Neglecting transmission times, how much time elapses with

a) Non-persistent HTTP with no parallel TCP connections?b) Non-persistent HTTP with parallel connections?c) Persistent HTTP with pipelining?






Web stuffWeb searching

12


How Search Engines Work

Gather the contents of all web pages (using a program called a crawler or spider)Organize the contents of the pages in a way that allows efficient retrieval (indexing)Take in a query, determine which pages match, and show the results (ranking and display of results)


Standard Web Search Engine Architecture

crawl theweb

Create an invertedindex

Check for duplicates,store the documents

Inverted index

Search engine servers

docIDsCrawlermachines


Standard Web Search Engine Architecture

crawl theweb

Create an invertedindex

Check for duplicates,store the documents

Inverted index

Search engine servers

userquery

Show results To user

DocIdsCrawlermachines


More detailed architecture,from “Anatomy of a Large-Scale Hypertext Web Search Engine”, Brin & Page, 1998.http://dbpubs.stanford.edu:8090/pub/1998-8

13

Slide adapted from Lew & Davis2: Application Layer

Spiders or crawlers

How to find web pages to visit and copy?Can start with a list of domain names, visit the home pages there.Look at the hyperlink on the home page, and follow those links to more pages.• Use HTTP commands to GET the pages

Keep a list of URLs visited, and those still to be visited.Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.


Spider behaviour varies

Parts of a web page that are indexedHow deeply a site is indexed Types of files indexedHow frequently the site is spidered


Four Laws of Crawling

A Crawler must show identificationA Crawler must obey the robots exclusion standardhttp://www.robotstxt.org/wc/norobots.html

A Crawler must not hog resourcesA Crawler must report errors


Lots of tricky aspects

Servers are often down or slowHyperlinks can get the crawler into cyclesSome websites have junk in the web pagesNow many pages have dynamic content

The “hidden” webE.g., schedule.berkeley.edu

• You don’t see the course schedules until you run a query.

The web is HUG

14


The Internet Is Enormous

Image from http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.html


“Freshness”

Need to keep checking pagesPages change (25%,7% large changes)

• At different frequencies• Who is the fastest changing?• Pages are removed

Many search engines cache the pages (store a copy on their own servers)


What really gets crawled?

A small fraction of the Web that search engines know about; no search engine is exhaustiveNot the “live” Web, but the search engine’s indexNot the “Deep Web”


ii. Index (the database)

Record information about each pageList of words

In the title?How far down in the page?Was the word in boldface?

URLs of pages pointing to this oneAnchor text on pages pointing to this one

The anchor text summarizes what the website is about.<a href=http://web.njit… > CS 656 </a>

15


Inverted Index

How to store the words for fast lookupBasic steps:

Make a “dictionary” of all the words in all of the web pagesFor each word, list all the documents it occurs in.Often omit very common words

• “stop words”Sometimes stem the words

• (also called morphological analysis)• cats -> cat• running -> run


Inverted Index Example

Image from http://developer.apple.com/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html


Inverted Index

In reality, this index is HUGENeed to store the contents across many machinesNeed to do optimization tricks to make lookup fast.


Query Serving Architecture

Index divided into segments each served by a nodeEach row of nodes replicated for query loadQuery integrator distributes query and merges resultsFront end creates a HTML page with the query results

Load Balancer

FE1

QI1

Node1,1 Node1,2 Node1,3 Node1,N




QI2 QI8

FE2 FE8

“travel”

“travel”

“travel”

“travel”

“travel”

…

…

…………

…

…

16


iii. Results ranking

Search engine receives a query, thenLooks up the words in the index, retrieves many documents, thenRank orders the pages and extracts “snippets” or summaries containing query words.

Most web search engines assume the user wants all of the words (Boolean AND, not OR).

These are complex and highly guarded algorithms unique to each search engine.


Some ranking criteria

For a given candidate result page, use:Number of matching query words in the pageProximity of matching words to one anotherLocation of terms within the pageLocation of terms within tags e.g. <title>, <h1>, link text, body textAnchor text on pages pointing to this oneFrequency of terms on the page and in generalLink analysis of which pages point to this one(Sometimes) Click-through analysis: how often the page is clicked onHow “fresh” is the page

Complex formulae combine these together.


Measuring Importance of Linking

PageRank AlgorithmIdea: important pages are pointed to by other important pagesMethod:

• Each link from one page to another is counted as a “vote” for the destination page

• But the importance of the starting page also influences the importance of the destination page.

• And those pages scores, in turn, depend on those linking to them.

Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188 2: Application Layer 64

Measuring Importance of Linking

Example: each page starts with 100 points.Each page’s score is recalculated by adding up the score from each incoming link.

This is the score of the linking page divided by the number of outgoing links it has.E.g, the page in green has 2 outgoing links and so its “points” are shared evenly by the 2 pages it links to.

Keep repeating the score updates until no more changes.

Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188

17


Search Engine Information

www.searchenginewatch.comwww.searchenginejournal.comwww.searchengineshowdown.comhttp://battellemedia.comhttp://jeremy.zawodny.com/blog/


Acknowledgement

Slides about web searching are adapted from the slides authored by Dr. Marti Hearst.

lecture-chap2-app-2

Documents

requests http

http server process

http overviewhttp

http server closes tcp

browser http client

response http request

bytes of http response

serverclient response