Top Banner
Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Capital Fund Management France Liran Katzir Advanced Technology Labs Microsoft Research, Israel *Research was conducted while the author was unaffiliated
34

Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

May 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Estimating Clustering Coefficients and Size of Social Networks via

Random Walk Stephen J. Hardiman*

Capital Fund Management

France

Liran Katzir

Advanced Technology Labs Microsoft Research, Israel

*Research was conducted while the author was unaffiliated

Page 2: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Motivation: Social Networks

Facebook Twitter Qzone Google+

Sina Weibo

Habbo Renren

LinkedIn Vkontakte

Bebo

Tagged Orkut

Netlog

Friendster hi5

Flixster

MyLife Classmates.com

Sonico.com

Plaxo

Page 3: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Motivation: External access

v1 v2

v3 v5

v6

v7

v4 v8

v9

Social Analytics

The online social network

Disk Space

Communication

Privacy

Page 4: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Task: Estimate parameters

Business development/ advertisement/ market size.

Predicting Social Products’ Potential.

Global Clustering Coefficient

Network Average

CC

Number of Registered

Users

Page 5: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global CC = 3 x number of triangles

number of connected triplet

Global Clustering Coefficient

v1 v2

v3 v5

v6

v7

v4 v8

v9

Triangle Connected Triplet

Page 6: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global Clustering Coefficient

Exact: [Alon et al, 1997]

Estimation – input is read at least once:

β€’ Random Access: [Avron, 2010]

β€’ Streaming Model: [Buriol et al, 2006]

Estimation – sampling:

β€’ Random Access: [Schank et al, 2005]

β€’ External Access: This work.

Page 7: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Ci = #connections between viβ€²s neighbors

di (diβˆ’1)/2

Local Clustering Coefficient

v1 v2

v3 v5

v6

v7

v4 v8

v9

di – degree of node i

d1 = 1 d9 = 2 d2 = 3

C2 =1/3

Network Average CC = average local CC

Page 8: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Network Average CC

Exact: NaΓ―ve.

Estimation – input is read at least once:

β€’ Streaming Model: [Becchetti et al, 2010]

Estimation – sampling:

β€’ Random Access: [Schank et al, 2005]

β€’ External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.

Page 9: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Number of Registered Users

Exact: trivial

Estimation – sampling:

β€’ External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.

Page 10: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Random Walk

v1 v2

v3 v5

v6

v7

v4 v8

v9

Sampled Nodes: v1 v2 v3 v4

1

22

3

22

2

22

2

22

Stationary

Distribution = 𝑑𝑖

𝑑𝑖

3

22

2

22

3

22

4

22

2

22

v5

Page 11: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Random Walk - Summary

v1 v2

v3 v5

v6

v7

v4 v8

v9

Visible Nodes Invisible Nodes Sampled Nodes

Visible Edges

Invisible Edges

Page 12: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global CC Algorithm

1. Ψ𝑔 – Sampled nodes average degree - 1.

πœ™π‘˜ = 1 if there is an edge π‘£π‘˜βˆ’1 βˆ’ π‘£π‘˜+1,

0 Otherwise.

2. Φ𝑔 – Sampled nodes average πœ™π‘˜π‘‘π‘˜ .

The estimated global clustering coefficient:

𝑐𝑔 =Φ𝑔

Ψ𝑔

πœ™π‘˜ = 1 iff π‘£π‘˜βˆ’1, π‘£π‘˜ , π‘£π‘˜+1 is a triangle

Page 13: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global CC Example

v1 v2

v3 v5

v4

πœ™2 = 0

πœ™3 = 1

Φ𝑔 =1

30 + 2 + 0 =

2

3 Ψ𝑔 =

1

50 + 2 + 1 + 3 + 1 =

7

5

𝑐𝑔 = 2

3

5

7 β‰ˆ 0.47

𝑐𝑔 =9

23β‰ˆ 0.39

πœ™4 = 0 v6

v7

Page 14: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Expectation of π“π’Œ

𝐸 πœ™π‘˜π‘‘π‘˜ = 𝑑𝑖

𝐷𝐸 πœ™π‘˜π‘‘π‘˜|π‘₯π‘˜ = 𝑣𝑖

𝑛

𝑖=1

= 𝑑𝑖

𝐷

𝑛

𝑖=1

2𝑙𝑖𝑑𝑖𝑑𝑖

𝑑𝑖

= 2𝑙𝑖𝐷

𝑛

𝑖=1

Total expectation

𝑑𝑖𝑑𝑖 combinations. 2𝑙𝑖 yield πœ™π‘˜=1

𝑙𝑖 – The number of triangles contain vi.

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐷 = 𝑑𝑖

𝑛

𝑖=1

Page 15: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global CC Proof

𝐷 = 𝑑𝑖

𝑛

𝑖=1

𝑙𝑖 – The number of triangles contain vi.

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐸 Φ𝑔 = 𝐸 πœ™π‘˜π‘‘π‘˜ =2

𝐷 𝑙𝑖

𝑛

𝑖=1

𝐸 Ψ𝑔 =1

𝐷 𝑑𝑖 𝑑𝑖 βˆ’ 1

𝑛

𝑖=1

𝑐𝑔 =Φ𝑔

concentration bounds𝐸 Φ𝑔

Ψ𝑔

concentration bounds𝐸 Ψ𝑔

β‰…2 𝑙𝑖

𝑛𝑖=1

𝑑𝑖 𝑑𝑖 βˆ’ 1𝑛𝑖=1

= 𝑐𝑔

Page 16: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Guarantees

For any πœ– ≀1

8 and 𝛿 ≀ 1, we have

Prob 1 βˆ’ νœ€ 𝑐𝑔 ≀ 𝑐𝑔 ≀ 1 + νœ€ 𝑐𝑔 β‰₯ 1 βˆ’ 𝛿

when the number of samples, r, satisfies

π‘Ÿ β‰₯ π‘Ÿπ‘” = 𝑂 mixing time(νœ€)

Page 17: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Network Average CC Algorithm

1. Ψ𝑙 – Sampled nodes average 1/degree .

πœ™π‘˜ = 1 if there is an edge π‘£π‘˜βˆ’1 βˆ’ π‘£π‘˜+1,

0 Otherwise.

2. Φ𝑙 – Sampled nodes average πœ™π‘˜1

π‘‘π‘˜βˆ’1.

The estimated network average CC:

𝑐𝑙 =Φ𝑙

Ψ𝑙

Page 18: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Evaluations

Network n (size) D/n cl cg

DBLP 977,987 8.457 0.7231 0.1868

Orkut 3,072,448 76.28 0.1704 0.0413

Flickr 2,173,370 20.92 0.3616 0.1076

Live Journal 4,843,953 17.69 0.3508 0.1179

DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.

Page 19: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global CC

Relative improvement ranges between 300% and 500% depending on the network.

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

DBLP Network

Gjoka et al*

Ribeiro et al*

This work

Page 20: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Network Average CC

Relative improvement ranges between 50% and 400% depending on the network.

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

Orkut Network

Ribeiro et al

Gjoka et al

Random walk

Page 21: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Conclusions

1. New external access estimator from Global Clustering Coefficient.

2. Improved estimator for Network Average Clustering Coefficient.

3. Improved estimator for number of registered users.

Page 22: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir

Yahoo! Labs, Haifa, Israel

Edo Liberty

Yahoo! Labs, Haifa, Israel

Oren Somekh

Yahoo! Labs, Haifa, Israel

Page 23: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

The expected number of collisions in a list of r

i.i.d. samples from a set of n elements is π‘Ÿ π‘Ÿβˆ’1

2𝑛.

The Birthday β€œParadox”

A collision is a pair of identical samples.

Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2, x5), and (x3, x5)

Page 24: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Cardinality estimation uniform

Needs π‘Ÿ = 𝑂 𝑛 samples to converge. Used by [Ye et al, 2010] to estimate the size.

When C collisions are observed

n β‰…π‘Ÿ π‘Ÿ βˆ’ 1

2𝐢

Page 25: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Stationary distribution sampling

v1 v2

v3 v5

v6

v7

v4 v8

v9

Sampled Nodes: v5

1

22

3

22

2

22

2

22

Stationary

Distribution = 𝑑𝑖

𝑑𝑖

3

22

2

22

3

22

4

22

2

22

v2 v5 v4 v2

Page 26: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Cardinality estimation stationary

Needs π‘Ÿ = 𝑂 𝑛4 log 𝑛 samples to converge when 𝑑𝑖~𝑧𝑖𝑝𝑓( 𝑛, 2).

When C collisions are observed

n β‰… 𝑑π‘₯

1𝑑π‘₯

2𝐢

Page 27: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Example:

v1 v2

v3 v5

v6

v7

v4 v8

v9

v5 v2 v5 v4 v2

𝑑π‘₯ = 2 + 3 + 2 + 4 + 3 1

𝑑π‘₯=

1

2+

1

3+

1

2+

1

4+

1

3

𝑛 =14

23

12

2βˆ™2 β‰ˆ 6.7

Page 28: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Global CC Proof

𝐷 = 𝑑𝑖

𝑛

𝑖=1

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐸 𝑑π‘₯ = 𝑑𝑖

𝐷𝑑𝑖

𝑛

𝑖=1

𝐸1

𝑑π‘₯=

𝑑𝑖

𝐷

1

𝑑𝑖

𝑛

𝑖=1

=𝑛

𝐷

𝑛 = 𝑑π‘₯

1𝑑π‘₯

concentration bounds𝐸 𝑑π‘₯ 𝐸

1𝑑π‘₯

2𝐢concentration bounds

2𝐸 𝐢≅

𝑑𝑖𝐷

𝑑𝑖𝑛𝐷

𝑑𝑖𝐷

𝑑𝑖𝐷

= 𝑛

𝐸 𝐢 = 𝑑𝑖

𝐷

𝑑𝑖

𝐷

𝑛

𝑖=1

Page 29: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Improvements

1. Using all samples (Hardiman et al 2009).

2. Using Conditional Monte Carlo (This work).

Page 30: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

All Samples

Restrict computation to indexes m steps apart, 𝐼 = π‘˜, 𝑙 | π‘˜ βˆ’ 𝑙 β‰₯ π‘š

A collision is only be considered within 𝐼. Ξ¦ = π‘₯π‘˜ = π‘₯𝑙 | π‘˜, 𝑙 ∈ 𝐼

Ratio of degrees is similarly defined

Ξ¨ = 𝑑π‘₯π‘˜

𝑑π‘₯π‘™π‘˜,𝑙 ∈𝐼

Page 31: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Conditional Monte Carlo

A collision between π‘₯π‘˜ and π‘₯𝑙, is replaced by the conditional collision is steps k+1 and l+1 respectively.

𝐸 1π‘₯π‘˜+1=π‘₯𝑙+1|π‘₯π‘˜ , π‘₯𝑙 =

Common Neighbors

𝑑π‘₯π‘˜π‘‘π‘₯𝑙

Page 32: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Conditional Monte Carlo

β€’ The pair 𝑣4, 𝑣7 is not a collision, but it

contributes 1

12 to the collision counter.

v1 v2

v3 v5

v6

v7

v4 v8

v9

Page 33: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Size Estimation

0

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

DBLP Network Priot art

This work

Page 34: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. πœ™ = 1 if there is an edge 𝑣 βˆ’1βˆ’π‘£

Thanks