1 Web Basics Slides adapted from –Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan –CS345A, Winter.

1

Web Basics

Slides adapted from

–Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan

–CS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman

2

Web search

• Due to the large size of the Web, it is not easy to find the needle in the hay.

• Solutions– Classification

– Early search engines

– Modern search engines

– Semantic web

– …

3

Early solutions to web search

• Classification of web pages– Yahoo

– Mostly done by humans. Difficult to scale.

• Paid search ranking: GOTO– Your search ranking depended on how much you paid

– Auction for keywords: casino was expensive!

• Ranking page by its relevance to the query– Early keyword-based engines ca. 1995-1997

– Altavista, Excite, Infoseek, Inktomi, Lycos

– Decide how queries match pages, mostly based on vector space model

– Most queries match large amount of pages– which page is more authoritative?

4

Ranking of web pages by popularity

• Originated from graph theory and social network analysis

• Jon Kleinberg at IBM developed HITS (Hypertext Induced Topic Search) in 1998

• Larry Page and Sergey Brin developed PageRank algorithm in 1998

– Blew away all early engines save Inktomi

– Great user experience in search of a business model

5

Web search overall picture

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

links

queries

6

Key components in web search• Links and graph: The web is a hyperlinked document collection, a

graph.

• Queries: Web queries are different, more varied and there are a lot of them. How many?

– 108 every day, approaching 109

• Users: Users are different, more varied and there are a lot of them. How many?

– 109

• Documents: Documents are different, more varied and there are a lot of them. How many?

– 1011. Indexed: 1010

• Context: Context is more important on the web than in many other IR applications.

• Ads and spam

CrawlUser RankRankCrawlUserGraph Spam

7

Web as graph

• Web Graph

– Node: web page

– Edge: hyperlink

RankCrawlUserGraph Spam

8

Why web graph

• Example of a large, dynamic and distributed graph

• Possibly similar to other complex graphs in social, biological and other systems

• Reflects how humans organize information (relevance, ranking) and their societies

• Efficient navigation algorithms

• Study behavior of users as they traverse the web graph (e-commerce)


9

In-degree and out-degree

• In-degree: number of in-coming edges of a node

• Out-degree: number of out-going edges of a node

• E.g., – Node 8 has 3 in-degrees, 0 out-

degree

– Node 2 has 2 in-degrees, and 4 out-degrees

• Degree distribution


10

Degree distribution

• Degree distribution is the fraction of the nodes that have degree i, i.e.

• Degree of Web graph obeys power law distribution

• Study at Notre Dame University reported

– a = 2.45 for out-degree distribution

– a = 2.1 for in-degree distribution

• Random graphs have Poisson distribution€

p(i) = Ai−α

log(p(i)) = log(A) −α log(i)

y = log(A) −αx

degreesofnumbertotal

idegreehavingverticesofnumber)( ip


11

Power law plotted

• 500 random numbers are generated, following power law with xmin=1, alpah=2

• Subplots C and D are produced using equal bin size (bin size=5)

• To remove the noise in the tail of subplot (D), we need to use log bin size

• Subplot (F) shows a straight line as desired.

• You can download the matlab program to experience with power law


12

Power law of web graph in 1999

• Note that the in/out distributions are slightly different

• Out-degree may be better fitted by Mandelbrot law

• What about current web?– clueWeb data consist of 4 billion web pages.


13

Scale-free networks

• A network is scale free if the degree distribution follows power law

– Mathematical model behind: Preferential attachment

• Many networks obey power law– Internet at the router and inter domain level

– Citation network/co-author network

– Collaboration network of actors

– Networks associated with metabolic pathways

– Networks formed by interacting genes and proteins

– Web graph

– microblogs such as twitter

– Semantic web


14

Other graph properties

– Distance from A to B: the length of the shortest path connecting A to B– Distance from node 0 to node 9: 1

– Length: the average of the distances between all the pairs of nodes

– Diameter: the maximum of the distances

– Strongly connected: for any pair of nodes, there is a path connecting them

– Clustering coefficient

– Betweeness


15

Small world

• It is a ‘small world’– Millions of people. Yet, separated by “six degrees” of acquaintance

relationships

– Popularized by Milgram’s famous experiment (1967)

• Mathematically– Diameter of graph is small (log N) as compared to overall size

– For a fixed average degree– The diameter of a complete graph never grows (always 1)– This property also holds in random graphs


16

Bow tie structure of Web

• Study of 200 million nodes & 1.5 billion links

– SCC: Strongly connected component (SCC) in the center

– Up Stream: Lots of pages that link to other pages, but don’t get linked to (IN)

– Down stream: Lots of pages that get linked to, but don’t link (OUT)

– Tendrils, tubes, islands

• Small-world property not applicable to entire web

– Some parts unreachable

– Others have long paths

• Power-law connectivity holds though– Page in-degree (alpha = 2.1), out-

degree (alpha = 2.72)


17

Empirical numbers for bow-tie

• Maximal diameter– 28 for SCC, 500 for entire graph

• Probability of a path between any 2 nodes– ~1 quarter (0.24)

• Average length – 16 (directed path exists), 7 (undirected)

• Shortest directed path between 2 nodes in SCC: 16-20 links on average


18

Component properties

• Each component is roughly same size– ~50 million nodes

• Tendrils not connected to SCC– But reachable from IN and can reach OUT

• Tubes: directed paths IN->Tendrils->OUT

• Disconnected components– Diameter/length is infinite


19

Where we are in web graph

• Distribution of incoming and outgoing connections

• Power law, scale free network

• Small world, diameter and length of the graph

• Web site and distribution of pages per site

• Size of the graph


20

Web site

• Simple estimates suggest over billions nodes

• Distribution of site sizes measured by the number of pages follow a power law distribution

– Note that degree distribution also follows power law

• Observed over several orders of magnitude with an exponent a in the 1.6-1.9 range


21

Web Size

• The web keeps growing.

• But growth is no longer exponential?

• Who cares?

– Media, and consequently the user

– Engine design

– Engine crawl policy. Impact on recall.

• What is size?

– Number of web servers/web sites?

– Number of pages?

– Terabytes of data available?

– Size of search engine index?


22

Difficulties in defining the web size

• Some servers are seldom connected.– Example: Your laptop running a web server

– Is it part of the web?

• The “dynamic” web is infinite.– Soft 404: www.yahoo.com/<anything> is a valid page

– Dynamic content, e.g., – Whether forecast– calendar– Any sum of two numbers is its own dynamic page on Google. Example: “2+4”

• Deep web content– E.g., all the articles in nytimes.

• Duplicates– Static web contains syntactic duplication, mostly due to mirroring (~30%)

Sec. 19.5RankCrawlUserGraph Spam

23

Two sizes (web and search engine index)•The (relative) sizes of search engines – The notion of a page being indexed is still reasonably well

defined.

– Already there are problems– Document extension: e.g. engines index pages not yet crawled, by

indexing anchor text.– Document restriction: All engines restrict what is indexed (first n

words, only relevant words, etc.)


Anchor text

Bottom of a doc

24

“Search engine index contains N pages”: Issues

• Can I claim a page is in the index if I only index the first 4000 bytes?

– Usually long documents are not fully indexed. Bottom parts are ignored.

• Can I claim a page is in the index if I only index anchor text pointing to the page?

– E.g., Apple web site may not contain the key word ‘computer’, but many anchor text pointing to Apple contains ‘computer’.

– Hence when people search for ‘computer’, Apple page may be returned

• There used to be (and still are?) billions of pages that are only indexed by anchor text.


25

Size of search engine

• The statically indexable web is whatever search engines index.

• Large index is not everything– Different engines have different preferences

– max url depth, max count/host, anti-spam rules, priority rules, etc.

– Different engines index different things under the same URL:– Frames (e.g., some frames are navigational, should be indexed in a

different way)– meta-keywords, e.g., put more weight on the title – document restrictions, document extensions, ...


Estimate index size by queries

• Basic idea: send two random queries and count the number of returns, and the duplicates

• The size can be estimated by MLE (Maximum likelihood Estimator)

• ni is the matches of the query i, where i=1,2. d is the duplicate between the two matches.

• It is called the capture-recapture method, inspired from ecology.

• This model can be extended to multiple queries (multiple capture-recapture)

• It is unbiased if the data is homogeneous, i.e., every document has equal probability of being matched (and returned),

26

€

N =n1n2d

27

Biases induced by random query

• Query Bias: Large documents have higher probability being captured by queries

– Solution 1: produce uniform sample by some sampling methods– e.g., rejection sampling method, reject large documents with some

probability

– Solution 2: modify the estimator

• Ranking Bias: Search engine ranks the matched documents and returns only top-k documents.

– Try to use queries whose size is commensurate to k.

• Operational Problems– Time-outs, failures, engine inconsistencies, index modification.


28

Random IP addresses

• Generate random IP addresses

• Find a web server at the given address– If there’s one

• Collect all pages from server– From this, choose a page at random


29

Random IP addresses

• Ignored: empty or authorization required or excluded

• [Lawr99] Estimated from observing 2500 servers– 2.8 million IP addresses running crawlable web servers

– 16 million total servers

– 800 million pages

– Also estimated use of metadata descriptors:– Meta tags (keywords, description) in 34% of home pages, Dublin core

metadata in 0.3%

• OCLC using IP sampling found 8.7 M hosts in 2001

• Netcraft [Netc02] accessed 37.2 million hosts in July 2002


Question: estimate social network size

• Some microblog account number is a (random) number

• E.g., http://weibo.com/2125720833

• Thus we can obtain a random sample and estimate the– Size, average degree, etc.

30

http://weibo.com/2125720833



31

Advantages & disadvantages

• Advantages– Clean statistics– Independent of crawling strategies

• Disadvantages– Doesn’t deal with duplication – Many hosts might share one IP, or not accept requests– No guarantee all pages are linked to root page.

– Eg: employee pages

– Power law for # pages/hosts generates bias towards sites with few pages.– But bias can be accurately quantified IF underlying distribution

understood

– Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)


32

Random walks

• View the Web as a directed graph

• Build a random walk on this graph– Start from one or more seed page– Follow the links randomly

– There are several strategies to select the link– Better to follow the less ‘important’ link with higher probability

– Includes various “jump” rules back to visited sites– Mimic the behavior of a web surfer– Avoid being stuck in spider traps

– Converges to a stationary distribution (ref pageRank and Markov chain)– Time to convergence may be long

• Sample from stationary distribution of walk


33

Advantages & disadvantages• Advantages

– “Statistically clean” method at least in theory

– Could work even for infinite web (assuming convergence) under certain metrics.

• Disadvantages– The web may (is) not connected

– Isolated components can not sampled if seeds are not in those components

– List of seeds is a problem.

– Each page does not has the probability being sampled.

– Subject to link spamming


34

The Web document collection• Architecture

– No design/co-ordination– Distributed content creation, linking, democratization of

publishing

• Content – includes truth, lies, obsolete information, contradictions..

• Structure– Unstructured (text, html, …), semi-structured (XML,

annotated photos), structured (Databases)…

• Scale – much larger than previous text collections … but

corporate records are catching up

• Growth– slowed down from initial “volume doubling every few

months” but still expanding

• Semantics– Mostly no semantic descriptions

• Dynamic– Content can be dynamically generated

The Web

35

Documents

• Dynamically generated content (deep web)– Dynamic pages are generated from scratch when the user requests

them – usually from underlying data in a database.

– Example: current status of flight LH 454

– Most (truly) dynamic content is ignored by web spiders.

– It’s too much to index it all.

– Actually, a lot of “static” content is also assembled on the fly (asp, php etc.: headers, date, ads etc)

36


The Web

Ad indexes






Sponsored Links


Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

links

queries

37

Users

• Use short queries (average < 3)

• Rarely use operators

• Don’t want to spend a lot of time on composing a query

• Only look at the first couple of results

• Want a simple UI, not a search engine start page overloaded with graphics

• Extreme variability in terms of user needs, user expectations, experience, knowledge, . . .

– Industrial/developing world, English/Estonian, old/young, rich/poor, differences in culture and class

• One interface for hugely divergent needs

RankCrawlGraph User Spam

38

User’s evaluation on search engines

• Classic IR relevance (as measured by F, or precision and recall) can also be used for web IR.

• Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups

• On the web, precision is more important than recall.– Precision at 1, precision at 10, precision on the first 2-3 pages

– But there is a subset of queries where recall matters.


39

Users’ empirical evaluation of engines

• Relevance and validity of results

• UI – Simple, no clutter, error tolerant

• Trust – Results are objective

• Coverage of topics for polysemic queries

• Pre/Post process tools provided– Mitigate user errors (auto spell check, search assist,…)

– Explicit: Search within results, more like this, refine ...

– Anticipative: related searches

• Deal with idiosyncrasies– Web specific vocabulary

– Impact on stemming, spell-check, etc

– Web addresses typed in the search box

• “The first, the last, the best and the worst …”


40

Queries

• Queries have a power law distribution – Power law again !

• Same here: a few very frequent queries, a large number of very rare queries

• Examples of rare queries: search for names, towns, books etc


41

Types of queries

• Informational user needs: I need information on something. (~40% / 65%)

– “web service”, “information retrieval”

• Navigational user needs: I want to go to this web site. (~25% / 15%)– “hotmail”, “myspace”, “United Airlines”

• Transactional user needs: I want to make a transaction. (~35% / 20%)– Buy something: “MacBook Air”– Download something: “Acrobat Reader”– Chat with someone: “live soccer chat”

• Gray areas– Find a good hub– Exploratory search “see what’s there”

• Difficult problem: How can the search engine tell what the user need or intent for a particular query is?


42

How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)


http://www.iprospect.com/

43


The Web

Ad indexes






Sponsored Links


Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

links

queries