CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

CMU SCS

I2.2 Large Scale Information Network ProcessingINARC 1

Overview

Goal: scalable algorithms to find patterns and anomalies on graphs

1. Mining Large Graphs: Algorithms, Inference, and Discoveries

2. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation

3. Patterns on the Connected Components of Terabyte-Scale Graphs

PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang

CMU SCS


Mining Large Graphs:Algorithms, Inference, and

Discoveries

U Kang

Duen HorngChau

ChristosFaloutsos

School of Computer ScienceCarnegie Mellon University

CMU SCS


Outline

Problem Definition Proposed Method Experiment Conclusion

CMU SCS


Motivation

Inference on graph: “guilt by association” Adult sites tend to be connected to adult sites, while

edu. sites are connected to educational ones Given labels(adult or edu) on a subset of the nodes,

infer the labels of other unlabeled nodes on graph Tool: Belief Propagation(BP)

red nodes connected

to red nodes

blue nodes connected

to blue nodes

CMU SCS

I2.2 Large Scale Information Network ProcessingINARC

Prior prob

Messages from neighbors

Node belief

Propagation matrix

~Messages from neighbors

Messsage from node i to node j

Message computation

Belief computation

Prior prob

Belief Propagation

5

CMU SCS


A Challenge in BP

Scalability!

Existing works assume that all the nodes (and/or edges) of the input graph fit in memory Problem: what if the graph is too large to fit in

memory? Challenge: Scaling up the inference algorithm

for very large graphs whose nodes do not fit in memory

6

CMU SCS


Problem Definition

How can we scale up the BP algorithm to very large graphs?

Goal Scalability: to billions of nodes and edges Efficiency: fast algorithm

7

CMU SCS


Outline


CMU SCS


Main Idea

Our approach Use Hadoop to scale-up BP

Challenge How can we formulate BP using a simple, efficient

operation supported by Hadoop?

9

CMU SCS


Main Idea

Key observation BP message update equation = local message

exchange

10

0 1 2

3

4

m13m31

m01

m10

m12

m21

m24

m42

A message is updated from its neighboring messages.

For example, m12 is updatedfrom m01 and m31

CMU SCS


BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G Nodes in L(G) are edges in G Two nodes in L(G) are connected if they are adjacent in G

Main Idea

11

CMU SCS


BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G

Proposed: HA-LFP algorithm

12

mGLm G )('New message vector

Old message vector

Line graph of G

Generalized m-v multiplication

Multiply repeatedly until convergence

CMU SCS


Complexity

OneIteration of

HA-LFP on L(G)

OneMatrix Vector

Multiplication on G=

Time : O((V+E) / M)Space: O(V + E)

V : # of nodesE : # of nodesM : # of machines

13

CMU SCS


Outline


CMU SCS

I2.2 Large Scale Information Network ProcessingINARC15

Questions

Q1: How fast is HA-LFP?

Q2: How does HA-LFP scale-up?

Q3: How can we find `good’ and `bad’ sites in a web graph?

CMU SCS


Running TimeQ1: How fast is HA-LFP?

[10 iteration]

16

CMU SCS


Scale UpQ2: How does HA-LFP scale-up?

Linear on the number of machines, edges

17

CMU SCS


Advantage of HA-LFP

Scalability The only solution when the node information

cannot fit in memory. Near-linear scale up

Running Time Faster than the single-machine, for large graphs

Fault Tolerance

18

CMU SCS


Analysis of Web Graph

Q3: How can we find `good’ and `bad’ sites in a web graph?

Pages whose goodness scores < 0.9are likely to be adult pages

19

CMU SCS


Outline


CMU SCS


Conclusion

HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges

Many applications Finding `good’ and `bad’ web sites Fraud detection …

CMU SCS


Spectral Analysis of Billion-Scale Graphs:

Discoveries and Implementation

U Kang

BrendanMeeder

ChristosFaloutsos

School of Computer ScienceCarnegie Mellon University

CMU SCS


Outline


CMU SCS


Problem Definition

Eigensolver Computes top-k eigenvalues and eigenvectors Application:

SVD, triangle counting, spectral clustering, …

Existing eigensolver Can handle up to millions of nodes

How can we scale up eigensolvers to billion-scale graphs?

CMU SCS


Outline


CMU SCS


Main Idea

HEigen algorithm (Hadoop Eigen-solver) Selective parallelize ‘Lanczos’ algorithm

Expensive operation: on Hadoop for scalability Inexpensive operation: on a single-machine for accuracy

Block encoding Block encoding, and then do matrix-vector multiplication

Exploiting skewness in matrix-matrix mult. In matrix-matrix multiplication when a matrix is very large

and the other is very small

26

CMU SCS


Application of HEigen

Triangle Counting Real social networks have a lot of triangles

Friends of friends are friends

But: triangles are expensive to compute (3-way join; several approx. algos)

Q: Can we do that quickly? A: Yes!

#triangles = 1/6 Sum ( λi3 )

(and, because of skewness in eigenvalues, we only need the top few eigenvalues!) [Tsourakakis

ICDM 2008]

CMU SCS


Outline


CMU SCS

I2.2 Large Scale Information Network ProcessingINARC29

Questions

Q1: How does HEigen scale-up?

Q2: Which Matrix-Matrix multiplication algorithm runs the fastest?

Q3: How can we find anomalous sites in a web graph?

CMU SCS


Running TimeQ1: How does HEigen scale-up?

Heigen-BLOCK is faster than PLAIN ver.Linear on the number of machines, edges

CMU SCS


Scale Up

Cache-based MM runs the fastest!

Q2: Which Matrix-Matrix multiplication algorithm runs the fastest?

CMU SCS


Results

Triangle counting on Twitter social network

[Twitter 2009; ~ 3 billion edges]

• U.S. politicians: moderate number of triangles vs. degree• Adult sites: very large number of triangles vs. degree

Q3: How can we find anomalous sites in a web graph?

CMU SCS


Outline


CMU SCS


Conclusion

HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts

Many applications Triangle counting SVD …

CMU SCS


Patterns on the Connected Components of

Terabyte-Scale Graphs

U Kang*

MaryMcGlohon*†

LemanAkoglu*

ChristosFaloutsos*

(*) SCS, Carnegie Mellon University(†) Google

CMU SCS


Outline

Problem Definition Static Patterns Evolution Patterns Model Conclusion

CMU SCS


A large graph is composed of many connected components

37

Problem Definition

Q2: evolution patterns?

Q3: model?

Size

Q1: static patterns?

Count

YahooWeb graph|V| = 1.4 billion|E| = 6.7 billion

120 GBytes

CMU SCS


Outline


CMU SCS


Q1: Static Patterns

What are the regularities in the connected components of a static graph? How do they look like? Do the GCC and the other connected

components look similar?

Chain?Clique?

Idea: use ‘density’ and ‘radius’ to find patterns

CMU SCS


Density of Connected Component

What is a good metric for the density of a connected component? A candidate: |E| / |V| (“average degree”) Problem: it increases over time

40

Number of Nodes

Number of Edges

CMU SCS



We want a metric that can measure the ‘intrinsic’ density of a component

Proposed: Graph Fractal Dimension(GFD) log |E| / log |V|

41

[Leskovec+ KDD05]Number of NodesNumber of Nodes

Number of Edges

Number of Edges

CMU SCS



Graph Fractal Dimension(GFD) log |E| / log |V|

42

Chain: GFD ~1

Star: GFD ~1

Bipartite Core: 1 < GFD < 2

Clique: GFD ~2

CMU SCS



43

What are the GFDs of connected components in a large, real graph?

CMU SCS



GFDs of CCs in YahooWeb graph

GFDs of CCs are slightly denser than the tree44

Slope=1.08

GFDs of CCs are constant on average

Number of Nodes Number of Nodes

Number of Edges

Number of Edges

CMU SCS


Radius of Connected Component

45

Q1.1: What does the GCC look like?Q1.2: What do the rest CC’s look like? ( What are the GFDs?)

CMU SCS


Radius of Connected Component

What are the patterns of radii in connected components?

A1.2: Chain-like disconnected components46

Slope=1.38

Core

Chain

Average Radius

A1.1: GCC looks like a ‘kite’

Max.Radius

Avg.

Max.

CMU SCS


Outline


CMU SCS


Q2: Evolution Patterns

How do the connected components evolve? Do largest connected components grow with

the same rate? How often does a newcomer join the

disconnected components?newcomer

??

CMU SCS


Gelling Point

Gelling Point [McGlohon+ KDD08] Diameter starts to shrink

49

CMU SCS


Growth of Connected Component

GFDs of Top 3 CC’s over time

50

Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like.

After “deviation point”: GFD of GCC takes off, becomes denser.

CMU SCS


‘Rebel’ Probability

What are the chances that a newcomer doesn’t belong to GCC? (“rebel” prob.)

51

newcomer

?

GCC DCs

CMU SCS



What are the chances that a newcomer doesn’t belong to GCC? (“rebel” prob.)

52

rebelP

rebelP

newcomer

d: degree of a newcomer

s: size (|V|) of DC

But, how exactly?

CMU SCS


‘Rebel’ Prob. power of |V| in dc


53

‘Rebel’ Prob. exponential to the degree

drebel sP d: degree of a newcomer

s: size (|V|) of DC

CMU SCS


Outline


CMU SCS


Q3: Model

How can we explain the static and the evolution patterns by a generative model?

Modeling Goals (G1) Constant GFDs (G2) ERP (Exponential Rebel Probability) (G3) Disconnected Components

CMU SCS


CommunityConnection Model

CommunityConnection model Defines a behavior of a new node joining the

network 1. Chooses a host to link to. 2. Visits the neighbors

Repeat the two processes!

56

CMU SCS



How does the CommunityConnection model match reality?

57

CMU SCS



Results

(G1) Constant GFDs58

Number of Nodes

Number of Edges

Number of Nodes

Number of Edges

<Real Graph> <Our Model>

CMU SCS



Results

(G2) ERP (Exponential Rebel Probability)

(G3) Disconnected Components59

Degree log(|V| in DC)

log(RebelProb.)

log(RebelProb.)

<Our Model>

<Real Graph>

CMU SCS


Outline


CMU SCS


Conclusion

Patterns in the Connected Components Goal 1 : Static Patterns

Chain-like disconnected components ‘Kite’-like GCC

Goal 2 : Evolution Patterns Constant, low GFD(“density”) until the gelling point ERP (Exponential Rebel Probability)

Goal 3 : Model CommunityConnection Model (matches reality)

CMU SCS


Hadoop/PEGASUS

Degree Distr.

Pagerank

Diameter

Conn. Comp

Eigensolver

Belief Propagation

Clustering, …

Future Plan

62

CMU SCS


Thank you!http://www.cs.cmu.edu/~pegasus

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

Documents

cmu scs i2

mining large graphs

scale graphs pi

red nodes blue nodes

original graph g nodes

billions of nodes

unlabeled nodes

u kang slide