CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large Graphs: Algorithms, Inference, and Discoveries 2. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation 3. Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang
63
Embed
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 1
Overview
Goal: scalable algorithms to find patterns and anomalies on graphs
1. Mining Large Graphs: Algorithms, Inference, and Discoveries
2. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation
3. Patterns on the Connected Components of Terabyte-Scale Graphs
PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 2
Mining Large Graphs:Algorithms, Inference, and
Discoveries
U Kang
Duen HorngChau
ChristosFaloutsos
School of Computer ScienceCarnegie Mellon University
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 3
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 4
Motivation
Inference on graph: “guilt by association” Adult sites tend to be connected to adult sites, while
edu. sites are connected to educational ones Given labels(adult or edu) on a subset of the nodes,
infer the labels of other unlabeled nodes on graph Tool: Belief Propagation(BP)
red nodes connected
to red nodes
blue nodes connected
to blue nodes
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Prior prob
Messages from neighbors
Node belief
Propagation matrix
~Messages from neighbors
Messsage from node i to node j
Message computation
Belief computation
Prior prob
Belief Propagation
5
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
A Challenge in BP
Scalability!
Existing works assume that all the nodes (and/or edges) of the input graph fit in memory Problem: what if the graph is too large to fit in
memory? Challenge: Scaling up the inference algorithm
for very large graphs whose nodes do not fit in memory
6
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Problem Definition
How can we scale up the BP algorithm to very large graphs?
Goal Scalability: to billions of nodes and edges Efficiency: fast algorithm
7
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 8
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Main Idea
Our approach Use Hadoop to scale-up BP
Challenge How can we formulate BP using a simple, efficient
operation supported by Hadoop?
9
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Main Idea
Key observation BP message update equation = local message
exchange
10
0 1 2
3
4
m13m31
m01
m10
m12
m21
m24
m42
A message is updated from its neighboring messages.
For example, m12 is updatedfrom m01 and m31
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G Nodes in L(G) are edges in G Two nodes in L(G) are connected if they are adjacent in G
Main Idea
11
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G
Proposed: HA-LFP algorithm
12
mGLm G )('New message vector
Old message vector
Line graph of G
Generalized m-v multiplication
Multiply repeatedly until convergence
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Complexity
OneIteration of
HA-LFP on L(G)
OneMatrix Vector
Multiplication on G=
Time : O((V+E) / M)Space: O(V + E)
V : # of nodesE : # of nodesM : # of machines
13
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 14
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC15
Questions
Q1: How fast is HA-LFP?
Q2: How does HA-LFP scale-up?
Q3: How can we find `good’ and `bad’ sites in a web graph?
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Running TimeQ1: How fast is HA-LFP?
[10 iteration]
16
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Scale UpQ2: How does HA-LFP scale-up?
Linear on the number of machines, edges
17
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Advantage of HA-LFP
Scalability The only solution when the node information
cannot fit in memory. Near-linear scale up
Running Time Faster than the single-machine, for large graphs
Fault Tolerance
18
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Analysis of Web Graph
Q3: How can we find `good’ and `bad’ sites in a web graph?
Pages whose goodness scores < 0.9are likely to be adult pages
19
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 20
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 21
Conclusion
HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges
Many applications Finding `good’ and `bad’ web sites Fraud detection …
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 22
Spectral Analysis of Billion-Scale Graphs:
Discoveries and Implementation
U Kang
BrendanMeeder
ChristosFaloutsos
School of Computer ScienceCarnegie Mellon University
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 23
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 24
Problem Definition
Eigensolver Computes top-k eigenvalues and eigenvectors Application:
SVD, triangle counting, spectral clustering, …
Existing eigensolver Can handle up to millions of nodes
How can we scale up eigensolvers to billion-scale graphs?
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 25
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Expensive operation: on Hadoop for scalability Inexpensive operation: on a single-machine for accuracy
Block encoding Block encoding, and then do matrix-vector multiplication
Exploiting skewness in matrix-matrix mult. In matrix-matrix multiplication when a matrix is very large
and the other is very small
26
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Application of HEigen
Triangle Counting Real social networks have a lot of triangles
Friends of friends are friends
But: triangles are expensive to compute (3-way join; several approx. algos)
Q: Can we do that quickly? A: Yes!
#triangles = 1/6 Sum ( λi3 )
(and, because of skewness in eigenvalues, we only need the top few eigenvalues!) [Tsourakakis
ICDM 2008]
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 28
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC29
Questions
Q1: How does HEigen scale-up?
Q2: Which Matrix-Matrix multiplication algorithm runs the fastest?
Q3: How can we find anomalous sites in a web graph?
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Running TimeQ1: How does HEigen scale-up?
Heigen-BLOCK is faster than PLAIN ver.Linear on the number of machines, edges
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Scale Up
Cache-based MM runs the fastest!
Q2: Which Matrix-Matrix multiplication algorithm runs the fastest?
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 32
Results
Triangle counting on Twitter social network
[Twitter 2009; ~ 3 billion edges]
• U.S. politicians: moderate number of triangles vs. degree• Adult sites: very large number of triangles vs. degree
Q3: How can we find anomalous sites in a web graph?
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 33
Outline
Problem Definition Proposed Method Experiment Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 34
Conclusion
HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts
Many applications Triangle counting SVD …
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 35
Patterns on the Connected Components of
Terabyte-Scale Graphs
U Kang*
MaryMcGlohon*†
LemanAkoglu*
ChristosFaloutsos*
(*) SCS, Carnegie Mellon University(†) Google
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 36
Outline
Problem Definition Static Patterns Evolution Patterns Model Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
A large graph is composed of many connected components
37
Problem Definition
Q2: evolution patterns?
Q3: model?
Size
Q1: static patterns?
Count
YahooWeb graph|V| = 1.4 billion|E| = 6.7 billion
120 GBytes
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 38
Outline
Problem Definition Static Patterns Evolution Patterns Model Conclusion
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC 39
Q1: Static Patterns
What are the regularities in the connected components of a static graph? How do they look like? Do the GCC and the other connected
components look similar?
Chain?Clique?
Idea: use ‘density’ and ‘radius’ to find patterns
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Density of Connected Component
What is a good metric for the density of a connected component? A candidate: |E| / |V| (“average degree”) Problem: it increases over time
40
Number of Nodes
Number of Edges
CMU SCS
I2.2 Large Scale Information Network ProcessingINARC
Density of Connected Component
We want a metric that can measure the ‘intrinsic’ density of a component