1 2: Application Layer 1 Chapter2: Application layer Principles of network applications Architecture: client-server or P2P Services that an application needs important application-level protocols FTP, SMTP, P2P, …… programming network applications socket API Web stuff HTTP, DNS, Web searching 2: Application Layer 2 Web and HTTP First some jargon Web page consists of objects Object can be HTML file, JPEG image, Java applet, audio file,… Web page consists of base HTML-file which includes several referenced objects Each object is addressable by a URL Example URL: www.someschool.edu/someDept/pic.gif host name path name 2: Application Layer 3 HTTP overview HTTP: hypertext transfer protocol Web’s application layer protocol client/server model client: browser that requests, receives, “displays” Web objects server: Web server sends objects in response to requests HTTP 1.0: RFC 1945 HTTP 1.1: RFC 2068 PC running Explorer Server running Apache Web server Mac running Navigator HTTP r e quest HTTP request HTTP respons e HTTP respon se 2: Application Layer 4 HTTP overview (continued) Uses TCP: client initiates TCP connection (creates socket) to server, port 80 server accepts TCP connection from client HTTP messages (application- layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server) TCP connection closed HTTP is “stateless” server maintains no information about past client requests Protocols that maintain “state” are complex! past history (state) must be maintained if server/client crashes, their views of “state” may be inconsistent, must be reconciled aside
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
2: Application Layer 1
Chapter2: Application layer
Principles of network applicationsArchitecture: client-server or P2PServices that an application needs
important application-level protocolsFTP, SMTP, P2P, ……
programming network applicationssocket API
Web stuffHTTP, DNS, Web searching
2: Application Layer 2
Web and HTTP
First some jargonWeb page consists of objectsObject can be HTML file, JPEG image, Java applet, audio file,…Web page consists of base HTML-file which includes several referenced objectsEach object is addressable by a URLExample URL:www.someschool.edu/someDept/pic.gif
host name path name
2: Application Layer 3
HTTP overview
HTTP: hypertext transfer protocolWeb’s application layer protocolclient/server model
client: browser that requests, receives, “displays” Web objectsserver: Web server sends objects in response to requests
HTTP 1.0: RFC 1945HTTP 1.1: RFC 2068
PC runningExplorer
Server running
Apache Webserver
Mac runningNavigator
HTTP request
HTTP request
HTTP response
HTTP response
2: Application Layer 4
HTTP overview (continued)
Uses TCP:client initiates TCP connection (creates socket) to server, port 80server accepts TCP connection from clientHTTP messages (application-layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server)TCP connection closed
HTTP is “stateless”server maintains no information about past client requests
Protocols that maintain “state” are complex!past history (state) must be maintainedif server/client crashes, their views of “state” may be inconsistent, must be reconciled
aside
2
2: Application Layer 5
HTTP connections
Nonpersistent HTTPAt most one object is sent over a TCP connection.HTTP/1.0 uses nonpersistent HTTP
Persistent HTTPMultiple objects can be sent over single TCP connection between client and server.HTTP/1.1 uses persistent connections in default mode
2: Application Layer 6
Nonpersistent HTTPSuppose user enters URL
www.someSchool.edu/someDepartment/home.index
1a. HTTP client initiates TCP connection to HTTP server (process) at www.someSchool.edu on port 80
2. HTTP client sends HTTP request message (containing URL) into TCP connection socket. Message indicates that client wants object someDepartment/home.index
1b. HTTP server at host www.someSchool.edu waiting for TCP connection at port 80. “accepts” connection, notifying client
3. HTTP server receives request message, forms response message containing requested object, and sends message into its socket
time
(contains text, references to 10
jpeg images)
2: Application Layer 7
Nonpersistent HTTP (cont.)
5. HTTP client receives response message containing html file, displays html. Parsing html file, finds 10 referenced jpeg objects
6. Steps 1-5 repeated for each of 10 jpeg objects
4. HTTP server closes TCP connection.
time
2: Application Layer 8
Non-Persistent HTTP: Response time
Definition of RTT: time to send a small packet to travel from client to server and back.
Response time:one RTT to initiate TCP connectionone RTT for HTTP request and first few bytes of HTTP response to returnfile transmission time
total = 2RTT+transmit time
time to transmit file
initiate TCPconnection
RTT
requestfile
RTT
filereceived
time time
3
2: Application Layer 9
Persistent HTTP
Nonpersistent HTTP issues:requires 2 RTTs per objectOS overhead for each TCP connectionbrowsers often open parallel TCP connections to fetch referenced objects
Persistent HTTPserver leaves connection open after sending responsesubsequent HTTP messages between same client/server sent over open connectionclient sends requests as soon as it encounters a referenced objectas little as one RTT for all the referenced objects
2: Application Layer 10
HTTP request message
two types of HTTP messages: request, responseHTTP request message:
ASCII (human-readable format)
GET /somedir/page.html HTTP/1.1Host: www.someschool.edu User-agent: Mozilla/4.0Connection: close Accept-language:fr
(extra carriage return, line feed)
request line(GET, POST,
HEAD commands)
headerlines
Carriage return, line feed
indicates end of message
2: Application Layer 11
HTTP request message: general format
2: Application Layer 12
Uploading form input
Post method:Web page often includes form inputInput is uploaded to server in entity body
URL method:Uses GET methodInput is uploaded in URL field of request line:
www.somesite.com/animalsearch?monkeys&banana
4
2: Application Layer 13
Method types
HTTP/1.0GETPOSTHEAD
asks server to leave requested object out of response
HTTP/1.1GET, POST, HEADPUT
uploads file in entity body to path specified in URL field
DELETEdeletes file specified in the URL field
2: Application Layer 14
HTTP response message
HTTP/1.1 200 OK Connection closeDate: Thu, 06 Aug 1998 12:00:15 GMT Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 Jun 1998 …... Content-Length: 6821 Content-Type: text/html
data data data data data ...
status line(protocol
status codestatus phrase)
headerlines
data, e.g., requestedHTML file
2: Application Layer 15
HTTP response message
2: Application Layer 16
HTTP response status codes
200 OKrequest succeeded, requested object later in this message
301 Moved Permanentlyrequested object moved, new location specified later in this message (Location:)
400 Bad Requestrequest message not understood by server
404 Not Foundrequested document not found on this server
505 HTTP Version Not Supported
In first line in server->client response message.A few sample codes:
5
2: Application Layer 17
Trying out HTTP (client side) for yourself
1. Telnet to your favorite Web server:Opens TCP connection to port 80(default HTTP server port) at cis.poly.edu.Anything typed in sent to port 80 at cis.poly.edu
telnet cis.poly.edu 80
2. Type in a GET HTTP request:GET /~ross/ HTTP/1.1Host: cis.poly.edu
By typing this in (hit carriagereturn twice), you sendthis minimal (but complete) GET request to HTTP server
3. Look at response message sent by HTTP server!
2: Application Layer 18
User-server state: cookies
Many major Web sites use cookies
Four components:1) cookie header line of
HTTP response message2) cookie header line in
HTTP request message3) cookie file kept on
user’s host, managed by user’s browser
4) back-end database at Web site
Example:Susan always access Internet always from PCvisits specific e-commerce site for first timewhen initial HTTP requests arrives at site, site creates:
unique IDentry in backend database for ID
2: Application Layer 19
Cookies: keeping “state” (cont.)client server
usual http response msg
usual http response msg
cookie file
one week later:
usual http request msgcookie: 1678 cookie-
specificaction
access
ebay 8734usual http request msg Amazon server
creates ID1678 for user create
entry
usual http response Set-cookie: 1678
ebay 8734amazon 1678
usual http request msgcookie: 1678 cookie-
spectificaction
accessebay 8734amazon 1678
backenddatabase
2: Application Layer 20
Cookies (continued)What cookies can bring:
authorizationshopping cartsrecommendationsuser session state (Web e-mail)
Cookies and privacy:cookies permit sites to learn a lot about youyou may supply name and e-mail to sites
aside
How to keep “state”:protocol endpoints: maintain state at sender/receiver over multiple transactionscookies: http messages carry state
6
2: Application Layer 21
Web caches (proxy server)
user sets browser: Web accesses via cachebrowser sends all HTTP requests to cache
object in cache: cache returns object else cache requests object from origin server, then returns object to client
Goal: satisfy client request without involving origin server
client
Proxyserver
client
HTTP request
HTTP response
HTTP request HTTP request
origin server
origin server
HTTP response HTTP response
2: Application Layer 22
More about Web caching
cache acts as both client and servertypically cache is installed by ISP (university, company, residential ISP)
Why Web caching?reduce response time for client requestreduce traffic on an institution’s access link.Internet dense with caches: enables “poor”content providers to effectively deliver content (but so does P2P file sharing)
2: Application Layer 23
Caching example Assumptions
average object size = 100,000 bitsavg. request rate from institution’s browsers to origin servers = 15/secdelay from institutional router to any origin server and back to router = 2 sec
Consequencesutilization on LAN = 15%utilization on access link = 100%total delay = Internet delay + access delay + LAN delay
= 2 sec + minutes + milliseconds
originservers
publicInternet
institutionalnetwork 10 Mbps LAN
1.5 Mbps access link
institutionalcache
2: Application Layer 24
Caching example (cont)Possible solution
increase bandwidth of access link to, say, 10 Mbps
Consequencesutilization on LAN = 15%utilization on access link = 15%Total delay = Internet delay + access delay + LAN delay
= 2 sec + msecs + msecsoften a costly upgrade
originservers
publicInternet
institutionalnetwork 10 Mbps LAN
10 Mbps access link
institutionalcache
7
2: Application Layer 25
Caching example (cont)
possible solution: install cachesuppose hit rate is 0.4
consequence40% requests will be satisfied almost immediately60% requests satisfied by origin serverutilization of access link reduced to 60%, resulting in negligible delays (say 10 msec)total avg delay = Internet delay + access delay + LAN delay = .6*(2.01) secs + .4*milliseconds < 1.4 secs
originservers
publicInternet
institutionalnetwork 10 Mbps LAN
1.5 Mbps access link
institutionalcache
2: Application Layer 26
Conditional GET
Goal: don’t send object if cache has up-to-date cached versioncache: specify date of cached copy in HTTP requestIf-modified-since:
<date>
server: response contains no object if cached copy is up-to-date: HTTP/1.0 304 Not
Modified
cache serverHTTP request msgIf-modified-since:
<date>
HTTP responseHTTP/1.0
304 Not Modified
object not
modified
HTTP request msgIf-modified-since:
<date>
HTTP responseHTTP/1.0 200 OK
<data>
object modified
2: Application Layer 27
Chapter2: Application layer
Principles of network applicationsArchitecture: client-server or P2PServices that an application needs
important application-level protocolsFTP, SMTP, P2P, ……
programming network applicationssocket API
Web stuffDNS
2: Application Layer 28
DNS: Domain Name System
People: many identifiers:SSN, name, passport #
Internet hosts, routers:IP address (32 bit) -used for addressing datagrams“name”, e.g., ww.yahoo.com - used by humans
Q: map between IP addresses and name ?
Domain Name System:distributed databaseimplemented in hierarchy of many name serversapplication-layer protocolhost, routers, name servers to communicate to resolve names (address/name translation)
note: core Internet function, implemented as application-layer protocolcomplexity at network’s “edge”
8
2: Application Layer 29
DNS Why not centralize DNS?
single point of failuretraffic volumedistant centralized databasemaintenance
doesn’t scale!
DNS servicesHostname to IP address translationHost aliasing
Canonical and alias names
Mail server aliasingLoad distribution
Replicated Web servers: set of IP addresses for one canonical name
2: Application Layer 30
Root DNS Servers
com DNS servers org DNS servers edu DNS servers
poly.eduDNS servers
umass.eduDNS serversyahoo.com
DNS serversamazon.comDNS servers
pbs.orgDNS servers
Distributed, Hierarchical Database
Client wants IP for www.amazon.com; 1st approx:Client queries a root server to find com DNS serverClient queries com DNS server to get amazon.com DNS serverClient queries amazon.com DNS server to get IP address for www.amazon.com
2: Application Layer 31
DNS: Root name serverscontacted by local name server that can not resolve nameroot name server:
contacts authoritative name server if name mapping not knowngets mappingreturns mapping to local name server
13 root name servers worldwide
b USC-ISI Marina del Rey, CAl ICANN Los Angeles, CA
e NASA Mt View, CAf Internet Software C. Palo Alto, CA (and 17 other locations)
i Autonomica, Stockholm (plus 3 other locations)
k RIPE London (also Amsterdam, Frankfurt)
m WIDE Tokyo
a Verisign, Dulles, VAc Cogent, Herndon, VA (also Los Angeles)d U Maryland College Park, MDg US DoD Vienna, VAh ARL Aberdeen, MDj Verisign, ( 11 locations)
2: Application Layer 32
TLD and Authoritative Servers
Top-level domain (TLD) servers:responsible for com, org, net, edu, etc, and all
top-level country domains uk, fr, ca, jp.Network Solutions maintains servers for com TLDEducause for edu TLD
Authoritative DNS servers:organization’s DNS servers, providing authoritative hostname to IP mappings for organization’s servers (e.g., Web, mail).can be maintained by organization or service provider
9
2: Application Layer 33
Local Name Server
Does not strictly belong to hierarchyEach ISP (residential ISP, company, university) has one.
Also called “default name server”When a host makes a DNS query, query is sent to its local DNS server
Acts as a proxy, forwards query into hierarchy.
2: Application Layer 34
requesting hostcis.poly.edu
gaia.cs.umass.edu
root DNS server
local DNS serverdns.poly.edu
1
23
4
5
6
authoritative DNS serverdns.cs.umass.edu
78
TLD DNS server
DNS name resolution example
Host at cis.poly.edu wants IP address for gaia.cs.umass.edu
iterated query:contacted server replies with name of server to contact“I don’t know this name, but ask this server”
2: Application Layer 35
requesting hostcis.poly.edu
gaia.cs.umass.edu
root DNS server
local DNS serverdns.poly.edu
1
2
45
6
authoritative DNS serverdns.cs.umass.edu
7
8
TLD DNS server
3recursive query:puts burden of name resolution on contacted name serverheavy load?
DNS name resolution example
2: Application Layer 36
DNS: caching and updating records
once (any) name server learns mapping, it cachesmapping
cache entries timeout (disappear) after some timeTLD servers typically cached in local name servers
• Thus root name servers not often visitedupdate/notify mechanisms under design by IETF
DNS protocol, messagesDNS protocol : query and reply messages, both with
same message format
msg headeridentification: 16 bit # for query, reply to query uses same #flags:
query or replyrecursion desired recursion availablereply is authoritative
11
2: Application Layer 41
DNS protocol, messages
Name, type fieldsfor a query
RRs in responseto query
records forauthoritative servers
additional “helpful”info that may be used
2: Application Layer 42
Inserting records into DNS
Example: just created startup “Network Utopia”Register name networkuptopia.com at a registrar(e.g., Network Solutions)
Need to provide registrar with names and IP addresses of your authoritative name server (primary and secondary)Registrar inserts two RRs into the com TLD server:
(networkutopia.com, dns1.networkutopia.com, NS)(dns1.networkutopia.com, 212.212.212.1, A)
Put in authoritative server Type A record for www.networkuptopia.com and Type MX record for mail.networkutopia.comHow do people get the IP address of your Web site?
2: Application Layer 43
ExerciseSuppose within your Web browser, you click on a link to obtain a Webpage. The IP address for the associated URL is not cached in your local host. Suppose that n DNS servers should be visited before your host receives the IP address. The successive visits incur an RTT of RTT1, RTT2, …, RTTn. Suppose that the base HTML file associated with the link references three very small objects (small pictures) on the same server. Let RTT0 denote the RTT between the local host and the server containing the objects. Neglecting transmission times, how much time elapses with
a) Non-persistent HTTP with no parallel TCP connections?b) Non-persistent HTTP with parallel connections?c) Persistent HTTP with pipelining?
2: Application Layer 44
Chapter2: Application layer
Principles of network applicationsArchitecture: client-server or P2PServices that an application needs
important application-level protocolsFTP, SMTP, P2P, ……
programming network applicationssocket API
Web stuffWeb searching
12
2: Application Layer 45
How Search Engines Work
Gather the contents of all web pages (using a program called a crawler or spider)Organize the contents of the pages in a way that allows efficient retrieval (indexing)Take in a query, determine which pages match, and show the results (ranking and display of results)
2: Application Layer 46
Standard Web Search Engine Architecture
crawl theweb
Create an invertedindex
Check for duplicates,store the documents
Inverted index
Search engine servers
docIDsCrawlermachines
2: Application Layer 47
Standard Web Search Engine Architecture
crawl theweb
Create an invertedindex
Check for duplicates,store the documents
Inverted index
Search engine servers
userquery
Show results To user
DocIdsCrawlermachines
2: Application Layer 48
More detailed architecture,from “Anatomy of a Large-Scale Hypertext Web Search Engine”, Brin & Page, 1998.http://dbpubs.stanford.edu:8090/pub/1998-8
13
Slide adapted from Lew & Davis2: Application Layer
Spiders or crawlers
How to find web pages to visit and copy?Can start with a list of domain names, visit the home pages there.Look at the hyperlink on the home page, and follow those links to more pages.• Use HTTP commands to GET the pages
Keep a list of URLs visited, and those still to be visited.Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.
Slide adapted from Lew & Davis2: Application Layer
Spider behaviour varies
Parts of a web page that are indexedHow deeply a site is indexed Types of files indexedHow frequently the site is spidered
2: Application Layer 51
Four Laws of Crawling
A Crawler must show identificationA Crawler must obey the robots exclusion standardhttp://www.robotstxt.org/wc/norobots.html
A Crawler must not hog resourcesA Crawler must report errors
2: Application Layer 52
Lots of tricky aspects
Servers are often down or slowHyperlinks can get the crawler into cyclesSome websites have junk in the web pagesNow many pages have dynamic content
The “hidden” webE.g., schedule.berkeley.edu
• You don’t see the course schedules until you run a query.
The web is HUG
14
2: Application Layer 53
The Internet Is Enormous
Image from http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.html
2: Application Layer 54
“Freshness”
Need to keep checking pagesPages change (25%,7% large changes)
• At different frequencies• Who is the fastest changing?• Pages are removed
Many search engines cache the pages (store a copy on their own servers)
Slide adapted from Lew & Davis2: Application Layer
What really gets crawled?
A small fraction of the Web that search engines know about; no search engine is exhaustiveNot the “live” Web, but the search engine’s indexNot the “Deep Web”
Slide adapted from Lew & Davis2: Application Layer
ii. Index (the database)
Record information about each pageList of words
In the title?How far down in the page?Was the word in boldface?
URLs of pages pointing to this oneAnchor text on pages pointing to this one
The anchor text summarizes what the website is about.<a href=http://web.njit… > CS 656 </a>
15
2: Application Layer 57
Inverted Index
How to store the words for fast lookupBasic steps:
Make a “dictionary” of all the words in all of the web pagesFor each word, list all the documents it occurs in.Often omit very common words
• “stop words”Sometimes stem the words
• (also called morphological analysis)• cats -> cat• running -> run
2: Application Layer 58
Inverted Index Example
Image from http://developer.apple.com/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html
2: Application Layer 59
Inverted Index
In reality, this index is HUGENeed to store the contents across many machinesNeed to do optimization tricks to make lookup fast.
2: Application Layer 60
Query Serving Architecture
Index divided into segments each served by a nodeEach row of nodes replicated for query loadQuery integrator distributes query and merges resultsFront end creates a HTML page with the query results
Load Balancer
FE1
QI1
Node1,1 Node1,2 Node1,3 Node1,N
Node2,1 Node2,2 Node2,3 Node2,N
Node4,1 Node4,2 Node4,3 Node4,N
Node3,1 Node3,2 Node3,3 Node3,N
QI2 QI8
FE2 FE8
“travel”
“travel”
“travel”
“travel”
“travel”
…
…
…………
…
…
16
Slide adapted from Lew & Davis2: Application Layer
iii. Results ranking
Search engine receives a query, thenLooks up the words in the index, retrieves many documents, thenRank orders the pages and extracts “snippets” or summaries containing query words.
Most web search engines assume the user wants all of the words (Boolean AND, not OR).
These are complex and highly guarded algorithms unique to each search engine.
Slide adapted from Lew & Davis2: Application Layer
Some ranking criteria
For a given candidate result page, use:Number of matching query words in the pageProximity of matching words to one anotherLocation of terms within the pageLocation of terms within tags e.g. <title>, <h1>, link text, body textAnchor text on pages pointing to this oneFrequency of terms on the page and in generalLink analysis of which pages point to this one(Sometimes) Click-through analysis: how often the page is clicked onHow “fresh” is the page
Complex formulae combine these together.
2: Application Layer 63
Measuring Importance of Linking
PageRank AlgorithmIdea: important pages are pointed to by other important pagesMethod:
• Each link from one page to another is counted as a “vote” for the destination page
• But the importance of the starting page also influences the importance of the destination page.
• And those pages scores, in turn, depend on those linking to them.
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188 2: Application Layer 64
Measuring Importance of Linking
Example: each page starts with 100 points.Each page’s score is recalculated by adding up the score from each incoming link.
This is the score of the linking page divided by the number of outgoing links it has.E.g, the page in green has 2 outgoing links and so its “points” are shared evenly by the 2 pages it links to.
Keep repeating the score updates until no more changes.
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188