Top Banner
Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes
32

Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Jan 11, 2016

Download

Documents

Sybil Francis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Data Mining - Volinsky - 2011 - Columbia University

Topic 13

Network Models

1

Credits:

C. Faloutsos and J. Leskovec Tutorial

E. Kolaczyk Notes

Page 2: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Data Mining - Volinsky - 2011 - Columbia University

Social Networks

• Network: A collection of inter-connected things– Also called “graph mining”

• Data consisting of nodes and edges– Note: different than “graphical models” (graphical

representation of dependence of random variables)

• Edges represent:– Relationship between nodes– Behavior observed between nodes– High similarity between nodes

• Edges typically weighted• Nodes and edges both can have attributes associated• Can be directed or undirected

– Directed: phone calls, emails– Undirected: collaboration, physical networks, friendship

2

Page 3: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Examples

Data Mining - Volinsky - 2011 - Columbia University 3

Page 4: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Networks are everywhere!

Data Mining - Volinsky - 2011 - Columbia University 4

Page 5: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Layout

• Layout matters!– Especially with directed graphs

Data Mining - Volinsky - 2011 - Columbia University 5

Page 6: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Data Mining - Volinsky - 2011 - Columbia University 6

Facebook “Friend Wheel”

Page 7: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

LinkedIn

Data Mining - Volinsky - 2011 - Columbia University 7

LinkedIN community from LinkedIn labs

Page 8: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Networks: A Matter of Scale

Data Mining - Volinsky - 2011 - Columbia University 8

Page 9: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Measurements on networks: Nodes and Edges

• Node degree (node)– Number of edges coming in and out of a node is its

degree– If directed, in-degree and out-degree are different

• Degree centrality (node):– How ‘central’ is a given data point– How many times does it appear in a ‘shortest path’– Centrality = importance

• Centrality (edge):– How central is an edge?– Similar ‘shortest path’ definition– Does removing it create more clusters?

Data Mining - Volinsky - 2011 - Columbia University 9

Page 10: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Measurements on networks (graph)

• Degree Distribution– The distribution of all edge degrees characterizes the graph

• Normal or highly skewed?

• Clustering Coefficient (graph): – How “dense” is the graph?

• Given n nodes, how many possible edges?• Density = #Edges/Possible edges

– How likely is it that your friends are friends• Count: how many triangles

• Diameter (graph)– Largest shortest path

• Shortest paths (graph)– Histogram of shortest paths

• Connectivity (graph)– Fully connected?– Connected components– For directed: strongly connected components

Data Mining - Volinsky - 2011 - Columbia University 10

Page 11: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

• Random (Erdos-Renyi)– All edges occur randomly w probability p– Degree distribution follows Poisson

distribution

• Exponential (p*) models– Statistical model: Extension of Erdos-Renyi– Defines a probability distribution over graph

properties

• Preferential attachment– Generative Model: New nodes create m links

(based on Poisson)– attach to existing nodes proportional to

degree of that node– Rich get richer

Models on networks

Data Mining - Volinsky - 2011 - Columbia University 11

Page 12: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Real-world networks• Degree distributions in real-world networks are

heavily skewed to the right – preferential attachment fits this model

• Long tail of values above the mean– Large mean, small median, small diameter

• Leads to a “power law”– Let k = degree and pk = the number of nodes that

have that degree

– A plot of log k vs. log pk should be linear.

• Many real world data sets follow a power law:– Online sales– Word length distributions– Number of friends on Facebook!

Data Mining - Volinsky - 2011 - Columbia University12

Page 13: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

More Power Law

Data Mining - Volinsky - 2011 - Columbia University 13

Page 14: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Erdos-Renyi vs. Power-law

Data Mining - Volinsky - 2011 - Columbia University 14From Leskovec & Faloutsos

Page 15: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Small World

• Real-world data sets tend to have power-law distributions

• Also, tend to have a “small world” property– Everyone is reachable via a small

number of edges– Small diameters

• Stanley Milgram experiment 1967– People given letter, asked to forward to

one friend– source: random residents of Omaha– target: stockbroker in Boston– Of completed chains, averaged 6 hops– hence,

Data Mining - Volinsky - 2011 - Columbia University 15

Page 16: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Data Mining - Volinsky - 2011 - Columbia University 16

Small World Networks• Watts and Strogatz [1998]

introduced small-world. • Navigable Social Networks

[Kleinberg 2000]– Showed how small world

networks are created

• put n people on a k-dimensional grid

• connect each to its immediate neighbors

• add one long-range link per person

• Everyone will be connected via a short path

• This is the way the real world works!!!

Page 17: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Small World Networks• Another look

Data Mining - Volinsky - 2011 - Columbia University 17

Page 18: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sampling Networks

• How do you sample from a massive network?• Simplest method – Induced Subgraph

– Randomly sampled nodes and edges between them– Not so great!

Data Mining - Volinsky - 2011 - Columbia University 18

Yellow nodes randomly sampled but don’t have the

same graph properties!

Page 19: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sampling Networks

• Snowball Sampling:– Pick a random sample and then follow their

‘tree’ for a set number of ‘hops’

Data Mining - Volinsky - 2011 - Columbia University 19

Still not perfect but better

Other ideas abound but little agreement

Great area for research!

Page 20: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Network Problems of Interest

Data Mining - Volinsky - 2011 - Columbia University 20

• Link Prediction:– can we use existing network data to infer

links where they don’t exist?• Links in the future?• Missing data

– Simple methods• Look for many common neighbors

– Complex methods• Stochastic Blockmodels• Similar to using SVD to ‘fill in’ a matrix• Agarwal and Pregibon ‘04

Page 21: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Network Problems of Interest

• Graph Matching / Similarity– Fraud (‘repetitive debtors’)– Citation de-noising– Need a metric to define difference between graphs

• Collective Inference– What can you learn about someone from their network?

• Fraud (‘guilt by association’)• Viral marketing

• Following example courtesy of Sofus MacSkassy

Data Mining - Volinsky - 2011 - Columbia University 21

Page 22: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 22

A Relational Neighbor Classifier (wvRN)

?

Page 23: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 23

?

?

?

?

A Relational Neighbor Classifier (wvRN)

Page 24: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 24

?

?

?

?

A Relational Neighbor Classifier (wvRN)

Page 25: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 25

?

?

?

?

?

?

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN

Page 26: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 26

?

?

?

?

?

?

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN

Page 27: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 27

?

?

?

?

?

?

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN

Page 28: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 28

?

?

?

?

?

?

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN

Page 29: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 29

?

?

?

?

?

?

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN

Page 30: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Sofus A. Macskassy Slide 30

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN

Page 31: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Network Problems of Interest

• Diffusion– Information or virus diffusion

• Community Detection– Subgroups have a higher density within the

subgroup– Can remove edges with high centrality to

try and find communities

• Understanding of Social Networks– Facebook

Data Mining - Volinsky - 2011 - Columbia University 31

Page 32: Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

References

• Leskovec / Faloutsos Tutorial (mostly part 1)• Eric Kolacyzk Notes and book• Watts and Strogatz: “Collective dynamics of

`small-world' networks”: Nature 393 p.440-442• Networks. MEJ Newman book.• Linked: How Everything Is Connected to

Everything Else and What It Means : Albert Barabasi

• Enron Data• Tools

– Graphviz.org for visualization– Igraph (R package)

Data Mining - Volinsky - 2011 - Columbia University 32