Top Banner
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 http://www.ee.technion.ac.il/cours es/049011
36

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 6

April 13, 2005

http://www.ee.technion.ac.il/courses/049011

Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

2

Web Structure II:Bipartite Cores

and Bow Tie Structure

Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

3

Outline

Bipartite cores The copying model Bow-tie structure of the web

Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

4

Web as a Social Network

Small-world network: Low (average) diameter High clustering coefficient

vMany of v’s neighbors are neighbors of each other.

Reason: Web is built of communities.

Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

5

Cyber Communities

Cyber community: A group of people sharing a common interest. Web pages authored/cited by these people.

Examples Israeli student organizations in the United States Large automobile manufacturers Oil spills off the coast of Japan Britney Spears fans

Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

6

Structure of Cyber Communities [Kumar et al, 1999]

Hubs: resource pages about the community’s shared interest Examples:

Directory of Israeli Student Organizations in the US Yahoo! Autos Oil spills near Japan: bookmarks Donna’s Britney Spears links

Authorities: central pages on the community’s shared interest Examples:

ISO: Stanford’s Israeli Student Organization Mazda.com Britney Spears: The official site

Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

7

Dense Bipartite Subgraphs

Hubs Cite many authorities Have overlapping citations

Authorities Cited by many hubs Frequently co-cited

Therefore: a cyber community is characterized by a dense directed bipartite subgraphs

hubs authorities

Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

8

Bipartite Cores

(i,j)-bipartite core: (H’,A’) H’: a subset of H of size i A’: a subset of A of size j Subgraph induced on (H’,A’) is a

complete bipartite graph

Hypothesis: “Most” dense bipartite subgraphs of the web have cores.

Therefore: bipartite cores are footprints of cyber communities.

H A

Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

9

Finding Cyber Communities Bipartite cores can be found efficiently from a crawl

A few one-pass scans of the dataA few sorts

Web is rife with cyber communitiesAbout 200k disjoint (3,*)-cores in a 1996 crawlCrawl had ~200M pages A random graph of this size is not likely to have even a single (3,3) core !

Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

10

The Copying Model[Kleinberg et al 1999] [Kumar et al 2000]

Initialization: A single node Evolution: At every step, a new node v is added

to the graph. v connects to d out-neighbors. Prototype selection: v chooses a random node u

from the graph. Bernoulli copying: For each i = 1,…,d,

v tosses a coin with heads probability of If coin is heads, v connects to a random node If coin is tails, v connects to the i-th out-neighbor of u

Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

11

The Copying Model: Motivation When a new page is created, author has

some “topic” in mind Author chooses links from a “prototype” u

about the topic Author introduces his own spin on the

topic, by linking to new “random” pages.

Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

12

The Copying Model: Degree Distribution If = 0, then i-th neighbor of v is u with

probability indeg(u)/w indeg(w) Identical to the preferential attachment model In the limit, fraction of pages with in-degree k is

1/k2. For arbitrary

Fraction of pages with in-degree k is 1/k(2-)/(1 - )

Similar analysis

Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

13

Erdős-Rényi Random Graph: Bipartite Cores Gn,p with p = d/n

Fix any A,B Gn,p, |A| = i, |B| =j

Probability A,B form a complete bipartite graph:

# of such pairs A,B:

Expected # of (i,j)-bipartite cores is at most

Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

14

The Copying Model: Bipartite Cores Consider the graph after n steps

Theorem: For any i < log n, expected # of (i,d) bipartite cores is (n/ci)

Definition: v is a duplicator of u, if it copies all its neighbors from the prototype u.

Observation: If v1,…,vi are duplicators of u, then v1,…,vi and their neighbors form an (i,d)-bipartite core.

Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

15

The Copying Model: Bipartite Cores (cont.) Lemma: w.h.p. (almost all) the first O(n/ci) nodes added

to the graph have at least i duplicators.

Probability a new node v is a duplicator: (1-)d

Define: c = 21/(1-)d

Let u be any node born at some step t < O(n/ci) Probability a node v born at step t’ > t chooses u as a

prototype: 1/(t’ – 1).

Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

16

The Copying Model: Bipartite Cores (cont.) Split steps t+1,…,n into O(log(n/t)) “epochs”: (t,2t], (2t,4t],(4t,8t],…,

(n/2,n] Probability at least one node at the first epoch chooses u as a

prototype:

Same for the other epochs Expected # of duplicators of u is at least

# of duplicators is sharply concentrated about the mean

Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

17

Bow Tie Structure of the Web[Broder et al 2000]

Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

18

Random Samplingof

Web Pages

Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

19

Outline

Problem definition Random sampling of web pages according

to their PageRank Uniform sampling of web pages

(Henzinger et al) Uniform sampling of web pages (Bar-

Yossef et al)

Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

20

Random Sampling of Web Pages W = a snapshot of the “indexable

web” Consider only “static” HTML web

pages

= a probability distribution over W

Goal: Design an efficient algorithm for generating samples from W distributed according to .

Our focus: = PageRank = Uniform

Indexable web

Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

21

Random Sampling of Web Pages: Motivation Compute statistics about the web

Ex: What fraction of web pages belong to .il? Ex: What fraction of web pages are written in Chinese? Ex: What fraction of hyperlinks are advertisements?

Compare coverage of search engines Ex: Is Google larger than MSN? Ex: What is the overlap between Google and Yahoo?

Data mining of the web Ex: How frequently computer science pages cite

biology pages? Ex: How are pages distributed by topic?

Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

22

Random Sampling of Web Pages: Challenges Naïve solution: crawl, index, sample

Crawls cannot get complete coverage Web is constantly changing Crawling is slow and expensive

Our goals: Accuracy: generate samples from a snapshot of the

entire indexable web Speed: samples should be generated quickly Low cost: sampling procedure should run on a desktop

PC

Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

23

A Random Walk Approach

Design a random walk on W whose stationary distribution is P = Random walk’s probability transition matrix P =

Run random walk for sufficiently many steps Recall: For any initial distribution q, Mixing time: # of steps required to get close to the limit

Use reached node as a sample

Repeat for as many samples as needed

Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

24

A Random Walk Approach:Advantages & Issues Advantages:

Accuracy: random walk can potentially visit every page on the web

Speed: no need to scan the whole web Low cost: no need for large storage or multiple

processors

Issues: How to design the random walk so it converges to ? How to analyze the mixing time of the random walk?

Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

25

PageRank Sampling [Henzinger et al 1999]

Use the “random surfer” random walk:Start at some initial node v0

When visiting a page v Toss a coin with heads probability If coin is heads, go to a uniformly chosen page If coin is tails, go to a random out-neighbor of v

Limit distribution: PageRank Mixing time: fast (will see later)

Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

26

PageRank Sampling:Reality Problem: how to pick a page uniformly at

random? Solutions:

Jump to a random page from the history of the walk

Creates bias towards dense web-sitesPick a random host from the hosts in the walk’s

history and jump to a random page from the pages visited on that host

Not converging to PageRank anymore Experiments indicate it is still fine

Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

27

Uniform Sampling via PageRank Sampling [Henzinger et al 2000]

Sampling algorithm:1. Use previous random walk to generate a sample w

according to the PageRank distribution

2. Toss a coin with heads probability

3. If coin is heads, output w as a sample

4. If coin is tails, goto step 1

Analysis: Need C/|W| iterations until getting a single sample

Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

28

Uniform Sampling via PageRank Sampling: Reality How to estimate PR(w)?

Use the random walk itself: VR(w) = visit ratio of w (# of times w was visited by

the walk divided by length of the walk) Approximation is very crude

Use the subgraph spanned by nodes visited to compute PageRank

Bias towards to neighborhood of the initial pageUse Google

Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

29

Uniform Sampling by RW on Regular Graphs [Bar-Yossef et al 2000]

Fact: A random walk on an undirected, connected, non-bipartite, and regular graph converges to a uniform distribution.

Proof: P: random walk’s probability transition matrix

P is stochastic 1 is a right eigenvector with e.v. 1: P1 = 1

Graph is connected RW is irreducible Graph is non-bipartite RW is aperiodic Hence, RW is ergodic, and thus has a stationary

distribution : is a left eigenvector of P with e.v. 1: P =

Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

30

Random Walks on Regular Graphs

Proof (cont.):d: graph’s degree,A: graph’s adjacency matrix

Symmetric, because graph is undirected

P = (1/d) A Hence, also P is symmetric Its left eigenvectors and right eigenvectors are the

same = (1/n) 1

Page 31: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

31

Web as a Regular Graph

Problems Web is not connected Web is directed Web is non-regular

Solutions Focus on indexable web, which is connected Ignore directions of links Add a weighted self loop to each node

weight(w) = degmax – deg(w)

All pages then have degree degmax

Overestimate on degmax doesn’t heart

Page 32: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

32

Mixing Time Analysis

Theorem

Mixing time of a random walk is log(|W|) / (1 - 2) 1 - 2: spectral gap of P

Experiment (over a large web crawl): 1 – 2 ~ 1/100,000 log(|W|) ~ 34

Hence: mixing time ~ 3.4 million steps Self loop steps are free About 1 in 30,000 steps is not a self loop step (degmax

~ 300,000, degavg ~ 10) Actual mixing time: ~ 115 steps!

Page 33: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

33

Random Walks on Regular Graphs: Reality

How to get incoming links? Search engines

Potential bias towards search engine index Do not provide full list of in-links? Costly communication

Random walk’s history Important for avoiding dead ends Requires storage

How to estimate deg(w)? Solution: run random walk on the sub-graph of W

spanned by the available links Sub-graph may no longer have the good mixing time

properties

Page 34: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

34

Top 20 Internet Domains (Summer 2003)

10.36%

5.57%4.15%3.01%

0.61%

9.19%

51.15%

0%

10%

20%

30%

40%

50%

60%

.com

.org

.net

.edu .d

e .uk

.au .u

s.e

s .jp .ca .nl .it .ch .p

l .il .nz

.gov

.info .m

x

Page 35: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

35

68%

54%50% 50%

48%

38%

0%

10%

20%

30%

40%

50%

60%

70%

80%

Google AltaVista Fast Lycos HotBot Go

Search Engine Coverage(Summer 2000)

Page 36: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005 .

36

End of Lecture 6