Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Data Mining - Volinsky - 2011 - Columbia University

Topic 13

Network Models

1

Credits:

C. Faloutsos and J. Leskovec Tutorial

E. Kolaczyk Notes

Data Mining - Volinsky - 2011 - Columbia University

Social Networks

• Network: A collection of inter-connected things– Also called “graph mining”

• Data consisting of nodes and edges– Note: different than “graphical models” (graphical

representation of dependence of random variables)

• Edges represent:– Relationship between nodes– Behavior observed between nodes– High similarity between nodes

• Edges typically weighted• Nodes and edges both can have attributes associated• Can be directed or undirected

– Directed: phone calls, emails– Undirected: collaboration, physical networks, friendship

2

Examples

Data Mining - Volinsky - 2011 - Columbia University 3

Networks are everywhere!


Layout

• Layout matters!– Especially with directed graphs



Facebook “Friend Wheel”

LinkedIn


LinkedIN community from LinkedIn labs

http://inmaps.linkedinlabs.com/share/Chris_Volinsky/140181036374613851084790009915258618160

Networks: A Matter of Scale


Measurements on networks: Nodes and Edges

• Node degree (node)– Number of edges coming in and out of a node is its

degree– If directed, in-degree and out-degree are different

• Degree centrality (node):– How ‘central’ is a given data point– How many times does it appear in a ‘shortest path’– Centrality = importance

• Centrality (edge):– How central is an edge?– Similar ‘shortest path’ definition– Does removing it create more clusters?


Measurements on networks (graph)

• Degree Distribution– The distribution of all edge degrees characterizes the graph

• Normal or highly skewed?

• Clustering Coefficient (graph): – How “dense” is the graph?

• Given n nodes, how many possible edges?• Density = #Edges/Possible edges

– How likely is it that your friends are friends• Count: how many triangles

• Diameter (graph)– Largest shortest path

• Shortest paths (graph)– Histogram of shortest paths

• Connectivity (graph)– Fully connected?– Connected components– For directed: strongly connected components


• Random (Erdos-Renyi)– All edges occur randomly w probability p– Degree distribution follows Poisson

distribution

• Exponential (p*) models– Statistical model: Extension of Erdos-Renyi– Defines a probability distribution over graph

properties

• Preferential attachment– Generative Model: New nodes create m links

(based on Poisson)– attach to existing nodes proportional to

degree of that node– Rich get richer

Models on networks


Real-world networks• Degree distributions in real-world networks are

heavily skewed to the right – preferential attachment fits this model

• Long tail of values above the mean– Large mean, small median, small diameter

• Leads to a “power law”– Let k = degree and pk = the number of nodes that

have that degree

– A plot of log k vs. log pk should be linear.

• Many real world data sets follow a power law:– Online sales– Word length distributions– Number of friends on Facebook!

Data Mining - Volinsky - 2011 - Columbia University12

More Power Law


Erdos-Renyi vs. Power-law

Data Mining - Volinsky - 2011 - Columbia University 14From Leskovec & Faloutsos

Small World

• Real-world data sets tend to have power-law distributions

• Also, tend to have a “small world” property– Everyone is reachable via a small

number of edges– Small diameters

• Stanley Milgram experiment 1967– People given letter, asked to forward to

one friend– source: random residents of Omaha– target: stockbroker in Boston– Of completed chains, averaged 6 hops– hence,



Small World Networks• Watts and Strogatz [1998]

introduced small-world. • Navigable Social Networks

[Kleinberg 2000]– Showed how small world

networks are created

• put n people on a k-dimensional grid

• connect each to its immediate neighbors

• add one long-range link per person

• Everyone will be connected via a short path

• This is the way the real world works!!!

Small World Networks• Another look


Sampling Networks

• How do you sample from a massive network?• Simplest method – Induced Subgraph

– Randomly sampled nodes and edges between them– Not so great!


Yellow nodes randomly sampled but don’t have the

same graph properties!

Sampling Networks

• Snowball Sampling:– Pick a random sample and then follow their

‘tree’ for a set number of ‘hops’


Still not perfect but better

Other ideas abound but little agreement

Great area for research!

Network Problems of Interest


• Link Prediction:– can we use existing network data to infer

links where they don’t exist?• Links in the future?• Missing data

– Simple methods• Look for many common neighbors

– Complex methods• Stochastic Blockmodels• Similar to using SVD to ‘fill in’ a matrix• Agarwal and Pregibon ‘04


• Graph Matching / Similarity– Fraud (‘repetitive debtors’)– Citation de-noising– Need a metric to define difference between graphs

• Collective Inference– What can you learn about someone from their network?

• Fraud (‘guilt by association’)• Viral marketing

• Following example courtesy of Sofus MacSkassy


Sofus A. Macskassy Slide 22

A Relational Neighbor Classifier (wvRN)

?


?

?

?

?



?

?

?

?



?

?

?

?

?

?

?

?

?

?

Classify all entities in the network simultaneously, because (if done well) inferences about neighbors

can reduce statistical bias (cf. Jensen et al. KDD-04)

Collective wvRN


?

?

?

?

?

?

?

?

?

?



Collective wvRN


?

?

?

?

?

?

?

?

?

?



Collective wvRN


?

?

?

?

?

?

?

?

?

?



Collective wvRN


?

?

?

?

?

?

?

?

?

?



Collective wvRN


?

?

?

?



Collective wvRN


• Diffusion– Information or virus diffusion

• Community Detection– Subgroups have a higher density within the

subgroup– Can remove edges with high centrality to

try and find communities

• Understanding of Social Networks– Facebook


http://www.telegraph.co.uk/technology/facebook/8906693/Facebook-cuts-six-degrees-of-separation-to-four.html

References

• Leskovec / Faloutsos Tutorial (mostly part 1)• Eric Kolacyzk Notes and book• Watts and Strogatz: “Collective dynamics of

`small-world' networks”: Nature 393 p.440-442• Networks. MEJ Newman book.• Linked: How Everything Is Connected to

Everything Else and What It Means : Albert Barabasi

• Enron Data• Tools

– Graphviz.org for visualization– Igraph (R package)


http://cs.stanford.edu/people/jure/talks/www08tutorial/

http://www.samsi.info/sites/default/files/Kolaczyk-CN.pdf

Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes.

Documents

networksdata mining

columbia universitynetworks

columbia universitytopic

given data pointhow

new nodes

n nodes

probability distribution

shortest pathcentrality