Top Banner
Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2006/10/5
50

Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

Basic WWW Technologies & Mathematic Background

(Chap 2 & 1, Baldi)

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering,

National Cheng Kung University2006/10/5

Page 2: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

2

World Wide Web

• The World Wide Web (Web) is a network of information resources.

• The Web relies on three mechanisms to make these resources available:1. A uniform naming scheme for locating

resources on the web (e.g., URIs).2. Protocols, for access to named resources

over the web (e.g., HTTP).3. Hypertext, for easy navigation among

resources (e.g., HTML).

Page 3: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

3

Internet vs. Web

• Internet:– Internet is a more general term – Includes physical aspect of underlying networks and

mechanisms such as email, FTP, HTTP…

• Web:– Associated with information stored on the Internet– Refers to a broader class of networks, i.e. Web of

English Literature

• Both Internet and web are networks

Page 4: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

4

Essential Components of WWW

• Resources (HTML, HyperText Markup Language)– Conceptual mappings to concrete or abstract entities, which do

not change in the short term– Taggin support for structuring and laying out documents

• Resource identifiers (hyperlinks):– Strings of characters represent generalized addresses that may

contain instructions for accessing the identified resource– http://www.google.com/ is used to identify the Google homepage

• Transfer protocols (HTTP, HyperText Transmission Protocol)– Conventions that regulate the communication between a

browser (web user agent) and a server

Page 5: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

5

Standard Generalized Markup Language (SGML)

• Based on GML (generalized markup language), developed by IBM in the 1960s

• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document – Markup: extra information characterizing structure of a

document

• Gave birth to the extensible markup language (XML), W3C recommendation in 1998

Page 6: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

6

SGML Components

• SGML documents have three parts:– Declaration: specifies which characters and delimiters

may appear in the application– DTD (Document Type Definition)/ style sheet: defines

the syntax of markup constructs– Document instance: actual text (with the tag) of the

documents

• More info could be found: http://www.W3.Org/markup/SGML

Page 7: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

7

HTML Background

• HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA.

• The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML.

• HTML standards are organized by W3C : http://www.w3.org/MarkUp/

Page 8: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

8

HTML Functionalities

• HTML gives authors the means to:– Publish online documents with headings, text, tables,

lists, photos, etc• Include spread-sheets, video clips, sound clips, and other

applications directly in their documents

– Link information via hypertext links, at the click of a button

– Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

Page 9: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

9

Sample Webpage

Page 10: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

10

Sample Webpage: HTML Structure

• <HTML>

• <HEAD>

• <TITLE>The title of the webpage</TITLE>

• </HEAD>

• <BODY> <P>Body of the webpage

• </BODY>

• </HTML>

Page 11: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

11

HTML Structure

• An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>)

• The title of the document appears in the head (along with other information about the document)

• The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>

Page 12: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

12

HTML Hyperlink

• <a href="relations/alumni">alumni</a>• A link is a connection from one Web resource

to another

• It has two ends, called anchors, and a direction

• Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

Page 13: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

13

Resource Identifiers

• Uniform Resource Identifiers (URI): include two overlapping subsets of identifiers– URL: Uniform Resource Locators

– URN: Uniform Resource Names

Page 14: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

14

Introduction to URIs

• Every resource available on the Web has an address that may be encoded by a URI

• URIs typically consist of three pieces:– The naming scheme of the mechanism used to

access the resource. (HTTP, FTP)– The name of the machine hosting the resource– The name of the resource itself, given as a path

Page 15: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

15

URI Example

• http://www.w3.org/TR

• There is a document available via the HTTP protocol

• Residing on the machines hosting www.w3.org

• Accessible via the path "/TR"

Page 16: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

16

Protocols

• Describe how messages are encoded and exchanged

• Different Layering Architectures

• ISO OSI 7-Layer Architecture

• TCP/IP 4-Layer Architecture

Page 17: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

17

ISO OSI Layering Architecture

Page 18: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

18

TCP/IP Layering Architecture

Page 19: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

19

TCP/IP Layering Architecture

• A simplified model, provides the end-to-end reliable connection

• The network layer – Hosts drop packages into this layer, layer

routes towards destination – Only promise “Try my best”

• The transport layer– Reliable byte-oriented stream

Page 20: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

20

Hypertext Transfer Protocol (HTTP)

• A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server

• One of the transport layer protocol supported by Internet

• HTTP communication is established via a TCP connection and server port 80

Page 21: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

21

GET Method in HTTP

Page 22: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

22

Form

Page 23: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

23

Form

• <HTML><Form action= http://140.116.246.174/cgi-bin/meshdb.cgi method=post>[1] Median Eminence ( 可複選 ):1.<input type=checkbox name=‘Median Eminence’ value= 分泌 > 分泌2.<input type=checkbox name=‘Median Eminence’ value= 一般 > 一般 3.<input type=checkbox name=‘Median Eminence’ value= 王錫崗 > 王錫崗 .<input type=checkbox name=‘Median Eminence’ value= 垂體 > 垂體其他 :<input type=“text” name =‘Median Eminence’ ><input type=submit value= 確認 ></Form></HTML>

Page 24: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

24

CGI processing

Page 25: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

25

CGI (Common Gateway Interface)

Web Browser Web Server

Database

CGI

Service Request

Service ProcessingOutput

Service Response

Page 26: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

26

HTTP Request Processing

Page 27: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

27

GNU Wget

Page 28: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

28

CGI: Get query search-results from Google using Wget

Page 29: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

29

Homework (1)

• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.

• Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.

Page 30: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

30

Domain Name System

• DNS (domain name service): mapping from domain names to IP address

• IPv4: • IPv4 was initially deployed January 1st. 1983 and

is still the most commonly used version.• 32 bit address, a string of 4 decimal numbers

separated by dot, range from 0.0.0.0 to 255.255.255.255.

• IPv6: • Revision of IPv4 with 128 bit address

Page 31: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

31

Top Level Domains (TLD)

• Top level domain names, .com, .edu, .gov and ISO 3166 country codes .de, .fr, .it

• There are three types of top-level domains:• Generic domains were created for use by the Internet

public • Country code domains were created to be used by

individual country • The .arpa domain Address and Routing Parameter Area

domain is designated to be used exclusively for Internet-infrastructure purposes

Page 32: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

32

Server Log Files

• Server Transfer Log: transactions between a browser and server are logged

• IP address, the time of the request• Method of the request (GET, HEAD, POST…)• Status code, a response from the server• Size in byte of the transaction

• Referrer Log: where the request originated

• Agent Log: browser software making the request (spider)

• Error Log: request resulted in errors (404)

Page 33: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

33

Server Log Analysis

• Most and least visited web pages

• Entry and exit pages

• Referrals from other sites or search engines

• What are the searched keywords

• How many clicks/page views a page received

• Error reports, like broken links

Page 34: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

34

Server Log Analysis

Page 35: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

35

Search Engines

• According to Pew Internet & American Life Project Report (2002), search engines are the most popular way to locate information online

• About 33 million U.S. Internet users query on search engines on a typical day.

• More than 80% have used search engines

• Search Engines are measured by coverage and recency

Page 36: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

36

Web Crawler

• A crawler is a program that picks up a page and follows all the links on that page

• Crawler = Spider

• Types of crawler:– Breadth First– Depth First

Page 37: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

37

Breadth First Crawlers

• Use breadth-first search (BFS) algorithm

• Get all links from the starting page, and add them to a queue

• Pick the 1st link from the queue, get all links on the page and add to the queue

• Repeat above step till queue is empty

Page 38: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

38

Breadth First Crawlers

Page 39: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

39

Depth First Crawlers

• Use depth first search (DFS) algorithm

• Get the 1st link not visited from the start page

• Visit link and get 1st non-visited link

• Repeat above step till no non-visited links

• Go to next non-visited link in the previous level and repeat 2nd step

Page 40: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

40

Depth First Crawlers

Page 41: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

41

Coverage

• Overlap analysis used for estimating the size of the indexable web

• W: set of webpages• Wa, Wb: pages crawled by two

independent engines a and b• P(Wa), P(Wb): probabilities that a page

was crawled by a or b– P(Wa)=|Wa| / |W| – P(Wb)=|Wb| / |W|

Page 42: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

42

Overlap Analysis

• P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb) = |Wa Wb| / |Wb|

• If a and b are independent:– P(Wa Wb) = P(Wa)*P(Wb)– P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)

= |Wa| / |W| * (|Wb| / |W|) / (|Wb| / |W|) = |Wa| / |W| = P(Wa)

Page 43: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

43

Overlap Analysis

• Using |W| = |Wa|/ P(Wa), the researchers found:– Web had at least 320 million pages in 1997– 60% of web was covered by six major engines– Maximum coverage of a single engine was

1/3 of the web

Page 44: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

44

How to Improve the Coverage?

• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.

• Any suggestions?

• Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.

Page 45: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

45

Probability

• Model uncertainty: make inferences about events given observed data

• An event e: proposition or statement about the world at large– “the number of Web pages in existence on 1 January

2003 was greater than five billion”

• A probability P(e): can be viewed as a number that reflects our uncertainty about whether e is true or false in the real world, given whatever information we have available.

Page 46: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

46

Learning from a Bayesian Perspective

• A conditional probability P(e | D): represent the degree of belief (Bayesian interpretation of probability), where D is the background information (data) on which our belief is based.

• Bayesian approach: probability as being a dynamic entity updated when more data arrive

– Prior probability: P(e) is your belief in the event e before you see any data

– Posterior probability: P(e | D) reflects your updated belief in event e given the observed data D

– Likelihood: P(D | e) is the probability of the data under the assumption that e is true

• How to model P(D | e)?

)(

)()|()|(

DP

ePeDPDeP

Page 47: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

47

Standard Probabilistic Distribution

• Discrete distributions • Continuous distributions

!)|(

)1()(

...!!...

!),...,(

)1(),|(

1

11

111

kekXP

ppkXP

ppkk

nkXkXP

ppk

nnpkXP

k

k

km

k

mmm

knk

m

x

x

x

exx

exf

exN

1

)(2

1

)(),|(

)|(

2

1),|(

22

Geometric

Poisson

Exponential

Gamma

Page 48: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

48

Learning from a Bayesian Perspective (cont.)

• Take logarithms for easier operations

• Obtain more data D2 (second data set)

)(

)()|()|(

DP

ePeDPDeP

)(log)(log)|(log)|(log DPePeDPDeP

)|(

)|(),|(),|(

2

22 DDP

DePDeDPDDeP

Page 49: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

49

Parameter Estimation from Data

• Maximum a posteriori (MAP)– The objective of parameter estimation is to find or approximate

the best set of parameters for a model, i.e., to find the set of parameters maximizing the posterior P(|D), or log P(|D). This is called maximum a posteriori (MAP) estimation.

– To deal with positive quantities, we can minimize - log P(|D)

– P(D) plays the role of a normalizing constant and is thus irrelevant for the optimization, i.e.,the minimization of

– If the prior P() is uniform over sample space, then the problem reduces to finding the maximum of P(D|), or log P(D|). This is known as maximum likelihood (ML) estimation.

– Simpler ML estimation procedure, i.e., the minimization of

)(log)|(log)( PDP

)(log)(log)|(log)|(log)( DPPDPDP

)|(log)( DP

Page 50: Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

WMMKS LabWMMKS Lab

Basic FormulaBasic Formula

h

hxPxP ),()(

h

yhxPyxP )|,()|(

h

hyxPyhPyxP ),|()|()|(

),|()|()|,( hyxPyhPyhxP

h

hxPyhPyxP )|()|()|(