Data Mining - Volinsky - 2011 - Columbia University Topic 13 Network Models 1 Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes
Jan 11, 2016
Data Mining - Volinsky - 2011 - Columbia University
Topic 13
Network Models
1
Credits:
C. Faloutsos and J. Leskovec Tutorial
E. Kolaczyk Notes
Data Mining - Volinsky - 2011 - Columbia University
Social Networks
• Network: A collection of inter-connected things– Also called “graph mining”
• Data consisting of nodes and edges– Note: different than “graphical models” (graphical
representation of dependence of random variables)
• Edges represent:– Relationship between nodes– Behavior observed between nodes– High similarity between nodes
• Edges typically weighted• Nodes and edges both can have attributes associated• Can be directed or undirected
– Directed: phone calls, emails– Undirected: collaboration, physical networks, friendship
2
Examples
Data Mining - Volinsky - 2011 - Columbia University 3
Networks are everywhere!
Data Mining - Volinsky - 2011 - Columbia University 4
Layout
• Layout matters!– Especially with directed graphs
Data Mining - Volinsky - 2011 - Columbia University 5
Data Mining - Volinsky - 2011 - Columbia University 6
Facebook “Friend Wheel”
Data Mining - Volinsky - 2011 - Columbia University 7
LinkedIN community from LinkedIn labs
Networks: A Matter of Scale
Data Mining - Volinsky - 2011 - Columbia University 8
Measurements on networks: Nodes and Edges
• Node degree (node)– Number of edges coming in and out of a node is its
degree– If directed, in-degree and out-degree are different
• Degree centrality (node):– How ‘central’ is a given data point– How many times does it appear in a ‘shortest path’– Centrality = importance
• Centrality (edge):– How central is an edge?– Similar ‘shortest path’ definition– Does removing it create more clusters?
Data Mining - Volinsky - 2011 - Columbia University 9
Measurements on networks (graph)
• Degree Distribution– The distribution of all edge degrees characterizes the graph
• Normal or highly skewed?
• Clustering Coefficient (graph): – How “dense” is the graph?
• Given n nodes, how many possible edges?• Density = #Edges/Possible edges
– How likely is it that your friends are friends• Count: how many triangles
• Diameter (graph)– Largest shortest path
• Shortest paths (graph)– Histogram of shortest paths
• Connectivity (graph)– Fully connected?– Connected components– For directed: strongly connected components
Data Mining - Volinsky - 2011 - Columbia University 10
• Random (Erdos-Renyi)– All edges occur randomly w probability p– Degree distribution follows Poisson
distribution
• Exponential (p*) models– Statistical model: Extension of Erdos-Renyi– Defines a probability distribution over graph
properties
• Preferential attachment– Generative Model: New nodes create m links
(based on Poisson)– attach to existing nodes proportional to
degree of that node– Rich get richer
Models on networks
Data Mining - Volinsky - 2011 - Columbia University 11
Real-world networks• Degree distributions in real-world networks are
heavily skewed to the right – preferential attachment fits this model
• Long tail of values above the mean– Large mean, small median, small diameter
• Leads to a “power law”– Let k = degree and pk = the number of nodes that
have that degree
– A plot of log k vs. log pk should be linear.
• Many real world data sets follow a power law:– Online sales– Word length distributions– Number of friends on Facebook!
Data Mining - Volinsky - 2011 - Columbia University12
More Power Law
Data Mining - Volinsky - 2011 - Columbia University 13
Erdos-Renyi vs. Power-law
Data Mining - Volinsky - 2011 - Columbia University 14From Leskovec & Faloutsos
Small World
• Real-world data sets tend to have power-law distributions
• Also, tend to have a “small world” property– Everyone is reachable via a small
number of edges– Small diameters
• Stanley Milgram experiment 1967– People given letter, asked to forward to
one friend– source: random residents of Omaha– target: stockbroker in Boston– Of completed chains, averaged 6 hops– hence,
Data Mining - Volinsky - 2011 - Columbia University 15
Data Mining - Volinsky - 2011 - Columbia University 16
Small World Networks• Watts and Strogatz [1998]
introduced small-world. • Navigable Social Networks
[Kleinberg 2000]– Showed how small world
networks are created
• put n people on a k-dimensional grid
• connect each to its immediate neighbors
• add one long-range link per person
• Everyone will be connected via a short path
• This is the way the real world works!!!
Small World Networks• Another look
Data Mining - Volinsky - 2011 - Columbia University 17
Sampling Networks
• How do you sample from a massive network?• Simplest method – Induced Subgraph
– Randomly sampled nodes and edges between them– Not so great!
Data Mining - Volinsky - 2011 - Columbia University 18
Yellow nodes randomly sampled but don’t have the
same graph properties!
Sampling Networks
• Snowball Sampling:– Pick a random sample and then follow their
‘tree’ for a set number of ‘hops’
Data Mining - Volinsky - 2011 - Columbia University 19
Still not perfect but better
Other ideas abound but little agreement
Great area for research!
Network Problems of Interest
Data Mining - Volinsky - 2011 - Columbia University 20
• Link Prediction:– can we use existing network data to infer
links where they don’t exist?• Links in the future?• Missing data
– Simple methods• Look for many common neighbors
– Complex methods• Stochastic Blockmodels• Similar to using SVD to ‘fill in’ a matrix• Agarwal and Pregibon ‘04
Network Problems of Interest
• Graph Matching / Similarity– Fraud (‘repetitive debtors’)– Citation de-noising– Need a metric to define difference between graphs
• Collective Inference– What can you learn about someone from their network?
• Fraud (‘guilt by association’)• Viral marketing
• Following example courtesy of Sofus MacSkassy
Data Mining - Volinsky - 2011 - Columbia University 21
Sofus A. Macskassy Slide 22
A Relational Neighbor Classifier (wvRN)
?
Sofus A. Macskassy Slide 23
?
?
?
?
A Relational Neighbor Classifier (wvRN)
Sofus A. Macskassy Slide 24
?
?
?
?
A Relational Neighbor Classifier (wvRN)
Sofus A. Macskassy Slide 25
?
?
?
?
?
?
?
?
?
?
Classify all entities in the network simultaneously, because (if done well) inferences about neighbors
can reduce statistical bias (cf. Jensen et al. KDD-04)
Collective wvRN
Sofus A. Macskassy Slide 26
?
?
?
?
?
?
?
?
?
?
Classify all entities in the network simultaneously, because (if done well) inferences about neighbors
can reduce statistical bias (cf. Jensen et al. KDD-04)
Collective wvRN
Sofus A. Macskassy Slide 27
?
?
?
?
?
?
?
?
?
?
Classify all entities in the network simultaneously, because (if done well) inferences about neighbors
can reduce statistical bias (cf. Jensen et al. KDD-04)
Collective wvRN
Sofus A. Macskassy Slide 28
?
?
?
?
?
?
?
?
?
?
Classify all entities in the network simultaneously, because (if done well) inferences about neighbors
can reduce statistical bias (cf. Jensen et al. KDD-04)
Collective wvRN
Sofus A. Macskassy Slide 29
?
?
?
?
?
?
?
?
?
?
Classify all entities in the network simultaneously, because (if done well) inferences about neighbors
can reduce statistical bias (cf. Jensen et al. KDD-04)
Collective wvRN
Sofus A. Macskassy Slide 30
?
?
?
?
Classify all entities in the network simultaneously, because (if done well) inferences about neighbors
can reduce statistical bias (cf. Jensen et al. KDD-04)
Collective wvRN
Network Problems of Interest
• Diffusion– Information or virus diffusion
• Community Detection– Subgroups have a higher density within the
subgroup– Can remove edges with high centrality to
try and find communities
• Understanding of Social Networks– Facebook
Data Mining - Volinsky - 2011 - Columbia University 31
References
• Leskovec / Faloutsos Tutorial (mostly part 1)• Eric Kolacyzk Notes and book• Watts and Strogatz: “Collective dynamics of
`small-world' networks”: Nature 393 p.440-442• Networks. MEJ Newman book.• Linked: How Everything Is Connected to
Everything Else and What It Means : Albert Barabasi
• Enron Data• Tools
– Graphviz.org for visualization– Igraph (R package)
Data Mining - Volinsky - 2011 - Columbia University 32