Top Banner
Graph-based Proximity Measures Practical Graph Mining with R Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University
53

Graph-based Proximity Measures

Feb 20, 2016

Download

Documents

Kendra

Practical Graph Mining with R. Graph-based Proximity Measures. Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University. Outline. Defining Proximity Measures Neumann Kernels - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Graph-based  Proximity Measures

Graph-based Proximity Measures

Practical Graph Mining with R

Nagiza F. SamatovaWilliam Hendrix

John JenkinsKanchana Padmanabhan

Arpan ChakrabortyDepartment of Computer

ScienceNorth Carolina State

University

Page 2: Graph-based  Proximity Measures

Outline

• Defining Proximity Measures• Neumann Kernels• Shared Nearest Neighbor

2

Page 3: Graph-based  Proximity Measures

3

Similarity and Dissimilarity• Similarity

– Numerical measure of how alike two data objects are.– Is higher when objects are more alike.– Often falls in the range [0,1]:– Examples: Cosine, Jaccard, Tanimoto,

• Dissimilarity– Numerical measure of how different two data objects

are– Lower when objects are more alike– Minimum dissimilarity is often 0– Upper limit varies

• Proximity refers to a similarity or dissimilarity

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 4: Graph-based  Proximity Measures

4

Distance Metric• Distance d (p, q) between two points p and

q is a dissimilarity measure if it satisfies:

1. Positive definiteness: d (p, q) 0 for all p and q and d (p, q) = 0 only if p = q.

2. Symmetry: d (p, q) = d (q, p) for all p and q.3. Triangle Inequality: d (p, r) d (p, q) + d (q, r) for all points p, q, and r.

• Examples:– Euclidean distance– Minkowski distance– Mahalanobis distance

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 5: Graph-based  Proximity Measures

5

Is this a distance metric?

Not: Positive definite

Not: Symmetric

Not: Triangle Inequality

1( , ) max( , )j jj dd p q p q

Distance Metric

1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q

1( , ) max( )j jj dd p q p q

2

1

( , ) ( )d

j jj

d p q p q

1( , ) min | |j jj dd p q p q

Page 6: Graph-based  Proximity Measures

6

Distance: Euclidean, Minkowski, Mahalanobis

1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q

2

1

( , ) ( )d

j jj

d p q p q

Euclidean1

1

( , ) | |d r

rr j j

j

d p q p q

Minkowski

1

1:City block distanceManhattan distance

-norm

r

L

2

2 : Euclidean, -normr

L

Mahalanobis1( , ) ( ) ( )Td p q p q p q

Page 7: Graph-based  Proximity Measures

7

Euclidean Distance 2

1

( , ) ( )d

j jj

d p q p q

Standardization is necessary, if scales differ.

1 2( , ,...., ) ddp p p p

1

1 d

kk

p pd

Mean of attributes Standard deviation of attributes

2

1

1 ( )1

d

p kk

s p pd

Ex: ( , )p age salary

Standardized/Normalized Vector1 2( , ,..., ) dd

newp p p p

p pp p p pp pps s s s

01

new

new

p

ps

Page 8: Graph-based  Proximity Measures

8

Distance Matrix

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

2

1

( , ) ( )d

j jj

d p q p q

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Output Distance Matrix: D

• P = as.matrix (read.table(file=“points.dat”));• D = dist (P[, 2;3], method = "euclidean");• L1 = dist (P[, 2;3], method = “minkowski", p=1);• help (dist)

point x yp1 0 2p2 2 0p3 3 1p4 5 1

Input Data Table: P

File name: points.dat

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 9: Graph-based  Proximity Measures

9

Covariance of Two Vectors, cov(p,q)

1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q

1

1 d

kk

p pd

Mean of attributes

1

1cov( , ) ( )( )1

d

pq k kk

p q s p p q qd

One definition:

cov( , ) [( ( ))( ( )) ]Tp q E p E p q E q Or a better definition:

E is the Expected values of a random variable.

Page 10: Graph-based  Proximity Measures

10

Covariance, or Dispersion Matrix,

1 11 12 1

1 2

( , ,...., ) .....

( , ,...., )

dd

dN N N Nd

P p p p

P p p p

1 1 1 2 1

2 1 2 2 21 2

1 2

cov( , ) cov( , ) ... cov( , )cov( , ) cov( , ) ... cov( , )

( , ,..., )... ... ... ...

cov( , ) cov( , ) ... cov( , )

N

NN

N N N N

P P P P P PP P P P P P

P P P

P P P P P P

N points in d-dimensional space:

The covariance, or dispersion matrix:

The inverse, Σ-1, is concentration matrix or precision matrix

Page 11: Graph-based  Proximity Measures

11

Common Properties of a Similarity• Similarities, also have some well known

properties.

– s(p, q) = 1 (or maximum similarity) only if p = q.

– s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 12: Graph-based  Proximity Measures

12

Similarity Between Binary Vectors• Suppose p and q have only binary attributes• Compute similarities using the following quantities

– M01 = the number of attributes where p was 0 and q was 1

– M10 = the number of attributes where p was 1 and q was 0

– M00 = the number of attributes where p was 0 and q was 0

– M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients:

SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 13: Graph-based  Proximity Measures

13

SMC versus Jaccard: Examplep = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1

M01 = 2 (the number of attributes where p was 0 and q was 1)M10 = 1 (the number of attributes where p was 1 and q was 0)M00 = 7 (the number of attributes where p was 0 and q was 0)M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Page 14: Graph-based  Proximity Measures

14

Cosine Similarity• If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where:

indicates vector dot product and || d || is the length of vector d.

• Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 cos( d1, d2 ) = .3150

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 15: Graph-based  Proximity Measures

15

Extended Jaccard Coefficient (Tanimoto)

• Variation of Jaccard for continuous or count attributes– Reduces to Jaccard for binary attributes

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 16: Graph-based  Proximity Measures

16

Correlation (Pearson Correlation)• Correlation measures the linear relationship

between objects• To compute correlation, we standardize data

objects, p and q, and then take their dot product

)(/))(( pstdpmeanpp kk

)(/))(( qstdqmeanqq kk

qpqpncorrelatio ),(Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 17: Graph-based  Proximity Measures

17

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 18: Graph-based  Proximity Measures

18

General Approach for Combining Similarities

• Sometimes attributes are of many different types, but an overall similarity is needed.

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 19: Graph-based  Proximity Measures

19

Using Weights to Combine Similarities

• May not want to treat all attributes the same.– Use weights wk which are between 0 and 1 and sum to

1.

Src: “Introduction to Data Mining” by Vipin Kumar et al

Page 20: Graph-based  Proximity Measures

Graph-Based Proximity Measures

Within-graph proximity measures:

Hyperlink-Induced Topic Search (HITS)

The Neumann Kernel

Shared Nearest Neighbor (SNN)

Page 21: Graph-based  Proximity Measures

Outline

• Defining Proximity Measures• Neumann Kernels• Shared Nearest Neighbor

21

Page 22: Graph-based  Proximity Measures

Neumann Kernels: Agenda

Neumann Kernel

Introduction

Co-citation and Bibliographic

Coupling

Document and Term

Correlation

Diffusion/Decay factors

Relationship to HITS

Strengths and

Weaknesses

Page 23: Graph-based  Proximity Measures

Neumann Kernels (NK)

von Neumann

Generalization of HITS

Input: Undirected or Directed Graph

Output: Within Graph Proximity Measure

Importance

Relatedness

Page 24: Graph-based  Proximity Measures

NK: Citation graph

• Input: Graph– n1…n8 vertices (articles)– Graph is directed– Edges indicate a citation

•Citation Matrix C can be formed– If an edge between two vertices exists then

the matrix cell = 1 else = 0

n1 n2 n3 n4

n5 n6 n7 n8

Page 25: Graph-based  Proximity Measures

NK: Co-citation graph

• Co-citation graph: A graph which has two nodes connected if they appear simultaneously in the reference list of a third node in citation graph.

• In above graph n1 and n2 are connected because both are referenced by same node n5 in citation graph

• CC=CTC

n1 n2 n3 n4

n5 n6 n7 n8

Page 26: Graph-based  Proximity Measures

NK: Bibliographic Coupling Graph

· Bibliographic coupling graph: A graph which has two nodes connected if they share one or more bibliographic references.

· In above graph n5 and n6 are connected because both are referencing same node n2 in citation graph

· CC=C CT

n1 n2 n3 n4

n5 n6 n7 n8

Page 27: Graph-based  Proximity Measures

NK: Document and Term Correlation

Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function of their relationship (e.g. frequency of the given term in the document).Example:D1: “I like this book” D2: “We wrote this book”

Term-Document Matrix X

Page 28: Graph-based  Proximity Measures

NK: Document and Term Correlation (2)

Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents.Example:D1: “I like this book” D2: “We wrote this book”

Document Correlation matrix K = (XTX)

Page 29: Graph-based  Proximity Measures

NK: Document and Term Correlation (3)

Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms.

Example:D1: “I like this book” D2: “We wrote this book”

Term Correlation Matrix T = (XXT)

Page 30: Graph-based  Proximity Measures

Neumann Kernel Block Diagram

.Input: Graph

Output: Two matrices of dimensions n x n called K γ and Tγ

Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance

Page 31: Graph-based  Proximity Measures

NK: Diffusion Factor - Equation & Effect

Neumann Kernel defines two matrices incorporating a diffusion factor:

Simplifies with our definitions of K and

T

When

When

Page 32: Graph-based  Proximity Measures

Indegree = The indegree, δ-(v), of vertex v is the number of edges leading to vertex v.δ- (B)=1

Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v.δ+(A)=3

Maximal indegree= The maximal indegree, Δ-, of the graph is the maximum of all indegree counts of all vertices of graph.Δ-(G)= 2

Maximal outdegree= The maximal outdegree, Δ+, of the graph is the maximum of all outdegree counts of all vertices of graph.Δ+(G)= 3

A

B C D

NK: Diffusion Factor - Terminology

Page 33: Graph-based  Proximity Measures

NK: Diffusion Factor - Algorithm

Page 34: Graph-based  Proximity Measures

NK: Choice of Diffusion Factor and its effects on the Neumann Algorithm

• Neumann Kernel outputs relatedness between documents and between terms when g = γ

• Similarly when γ is larger, then the Kernel output matches with HITS

Page 35: Graph-based  Proximity Measures

.HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8

Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following rankingn3 > n4 > n2 > n1 > n5 = n6 = n7 = n8

For higher values of gamma Neumann Kernel converges to HITS

n1 n2 n3 n4

n5 n6 n7 n8

Comparing NK, HITS, andCo-citation Bibliographic Coupling

Page 36: Graph-based  Proximity Measures

Strengths and WeaknessesGeneralization of HITS

Merges relatedness and importance

Useful in many graph applications

Strengths

Topic Drift

No penalty for loops in adjacency matrix

Weaknesses

Page 37: Graph-based  Proximity Measures

Outline

• Defining Proximity Measures• Neumann Kernels• Shared Nearest Neighbor

37

Page 38: Graph-based  Proximity Measures

Shared Nearest Neighbor (SNN)

• An indirect approach to similarity

• Uses a dynamic method of a k-Nearest Neighbor graph to determine the similarity between the nodes

• If two vertices have more than k neighbors in common then they can be considered similar to one another even if a direct link does not exist

Page 39: Graph-based  Proximity Measures

SNN - Agenda

Page 40: Graph-based  Proximity Measures

SNN – Understanding Proximity

What makes a node a neighbor to another node is based off of the definition of proximity

Definition: the closeness between a set of objects

Proximity can measure the extent to which the two nodes belong to the same

cluster.

Proximity is a subtle notion whose definition can depend on a specific

application

Page 41: Graph-based  Proximity Measures

SNN - Proximity Graphs

• A graph obtained by connecting two points, in a set of points, by an edge if the two points, in some sense, are close to each other

Page 42: Graph-based  Proximity Measures

SNN – Proximity Graphs(continued)

1

2

34

5

1

2

3

45

6

7

1 2 3 4 5 6

CYCLIC

LINEAR

RADIAL

Various Types of Proximity Graphs

Page 43: Graph-based  Proximity Measures

SNN – Proximity Graphs(continued)

Other types of proximity graphs.

MINIMUM SPANNING TREE

RELATIVE NEIGHBOR GRAPH

GABRIEL GRAPH

NEAREST NEIGHBOR GRAPH(Voronoi diagram)

Page 44: Graph-based  Proximity Measures

SNN – Proximity Graphs (continued)

Represents neighbor relationships between objects

Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason

Using a proximity graph increases the scale range over which good segmentations are possible

Can be formulated with respect to many metrics

Page 45: Graph-based  Proximity Measures

SNN – Kth Nearest Neighbor (k-NN) Graph

Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure

Has applications in cluster analysis and outlier detection

Page 46: Graph-based  Proximity Measures

SNN – Shared Nearest Neighbor Graph

• An SNN graph is a special type of KNN graph.

• If an edge exists between two vertices, then they both belong to each other’s k-neighborhood

In the figure to the left, each of the two black vertices, i and j, have eight nearest neighbors, including each other. Four of those nearest neighbors are shared which are shown in red. Thus, the two black vertices are similar when parameter k=4 for SNN graph.

Page 47: Graph-based  Proximity Measures

SNN – The Algorithm

Input: G: an undirected graphInput: k: a natural number (number of shared neighbors)

for i = 1 to N(G) do for j = i+1 to N(G) do if j < = N(G) then counter = 0 end if for m = 1 to N(G) do if vertex i and vertex j both have an edge with vertex m then counter ++ end if end for if counter k then Connect an edge between vertex i and vertex j in SNN graph. end if end forend forreturn SNN graph

Page 48: Graph-based  Proximity Measures

SNN – Time Complexity

O(n3)

for i = 1 to n

for j = 1 to n

for k = 1 to n The number of vertices of graph G

can be defined as n “for loops” i

and k iterate once for each vertex in graph G (n times)

Cumulatively this results in a total running time of:

“for loop” j iterates at most n -1 times (O(n))

Page 49: Graph-based  Proximity Measures

SNN – R Code Example

• library(“igraph”)• library(“ProximityMeasure”)• data = c( 0, 1, 0, 0, 1, 0,

1, 0, 1, 1, 1, 0,0, 1, 0, 1, 0, 0,0, 1, 1, 0, 1, 1,1, 1, 0, 1, 0, 0,0, 0, 0, 1, 0, 0)

• mat = matrix(data,6,6)• G = graph.adjacency(mat,mode=c("directed"),

weighted=NULL)• V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’)• tkplot(G)• SNN(mat, 2)

B

A

C

D

E

F

[0] A -- D[1] B -- D[2] B -- E[3] C -- E

Page 50: Graph-based  Proximity Measures

SNN – Outlier/Anomaly Detection

Outlier/Anomal

y•something that deviates from what is standard, normal, or expected

Outlier/Anomaly Detection

•detecting patterns in a given data set that do not conform to an established normal behavior

Outlier/Anomaly

0 0.5 1 1.5 2 2.50

0.51

1.52

2.53

3.5

Page 51: Graph-based  Proximity Measures

SNN - Strengths

Ability to handle

noise and outliersAbility to handle

clusters of different sizes and shapesVery good

at handling clusters of

varying densities

Page 52: Graph-based  Proximity Measures

SNN - Weaknesses

Does not take into account the weight of the link between

the nodes in a nearest neighbor

graphA low similarity amongst nodes of

the same cluster in a graph can cause it

to find nearest neighbors that are

not in the same cluster

Page 53: Graph-based  Proximity Measures

Time Complexity Comparison

HITS

O(k*n2.376)

Nuemann Kernel

O(n2.376)

Shared Nearest Neighbor

O(n3)Run Time

Conclusion: Nuemann Kernel <= HITS < SNN