Graph-based Proximity Measures Practical Graph Mining with R Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University
Feb 20, 2016
Graph-based Proximity Measures
Practical Graph Mining with R
Nagiza F. SamatovaWilliam Hendrix
John JenkinsKanchana Padmanabhan
Arpan ChakrabortyDepartment of Computer
ScienceNorth Carolina State
University
Outline
• Defining Proximity Measures• Neumann Kernels• Shared Nearest Neighbor
2
3
Similarity and Dissimilarity• Similarity
– Numerical measure of how alike two data objects are.– Is higher when objects are more alike.– Often falls in the range [0,1]:– Examples: Cosine, Jaccard, Tanimoto,
• Dissimilarity– Numerical measure of how different two data objects
are– Lower when objects are more alike– Minimum dissimilarity is often 0– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Src: “Introduction to Data Mining” by Vipin Kumar et al
4
Distance Metric• Distance d (p, q) between two points p and
q is a dissimilarity measure if it satisfies:
1. Positive definiteness: d (p, q) 0 for all p and q and d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.3. Triangle Inequality: d (p, r) d (p, q) + d (q, r) for all points p, q, and r.
• Examples:– Euclidean distance– Minkowski distance– Mahalanobis distance
Src: “Introduction to Data Mining” by Vipin Kumar et al
5
Is this a distance metric?
Not: Positive definite
Not: Symmetric
Not: Triangle Inequality
1( , ) max( , )j jj dd p q p q
Distance Metric
1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q
1( , ) max( )j jj dd p q p q
2
1
( , ) ( )d
j jj
d p q p q
1( , ) min | |j jj dd p q p q
6
Distance: Euclidean, Minkowski, Mahalanobis
1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q
2
1
( , ) ( )d
j jj
d p q p q
Euclidean1
1
( , ) | |d r
rr j j
j
d p q p q
Minkowski
1
1:City block distanceManhattan distance
-norm
r
L
2
2 : Euclidean, -normr
L
Mahalanobis1( , ) ( ) ( )Td p q p q p q
7
Euclidean Distance 2
1
( , ) ( )d
j jj
d p q p q
Standardization is necessary, if scales differ.
1 2( , ,...., ) ddp p p p
1
1 d
kk
p pd
Mean of attributes Standard deviation of attributes
2
1
1 ( )1
d
p kk
s p pd
Ex: ( , )p age salary
Standardized/Normalized Vector1 2( , ,..., ) dd
newp p p p
p pp p p pp pps s s s
01
new
new
p
ps
8
Distance Matrix
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
2
1
( , ) ( )d
j jj
d p q p q
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
Output Distance Matrix: D
• P = as.matrix (read.table(file=“points.dat”));• D = dist (P[, 2;3], method = "euclidean");• L1 = dist (P[, 2;3], method = “minkowski", p=1);• help (dist)
point x yp1 0 2p2 2 0p3 3 1p4 5 1
Input Data Table: P
File name: points.dat
Src: “Introduction to Data Mining” by Vipin Kumar et al
9
Covariance of Two Vectors, cov(p,q)
1 2 1 2( , ,...., ) and ( , ,...., )d dd dp p p p q q q q
1
1 d
kk
p pd
Mean of attributes
1
1cov( , ) ( )( )1
d
pq k kk
p q s p p q qd
One definition:
cov( , ) [( ( ))( ( )) ]Tp q E p E p q E q Or a better definition:
E is the Expected values of a random variable.
10
Covariance, or Dispersion Matrix,
1 11 12 1
1 2
( , ,...., ) .....
( , ,...., )
dd
dN N N Nd
P p p p
P p p p
1 1 1 2 1
2 1 2 2 21 2
1 2
cov( , ) cov( , ) ... cov( , )cov( , ) cov( , ) ... cov( , )
( , ,..., )... ... ... ...
cov( , ) cov( , ) ... cov( , )
N
NN
N N N N
P P P P P PP P P P P P
P P P
P P P P P P
N points in d-dimensional space:
The covariance, or dispersion matrix:
The inverse, Σ-1, is concentration matrix or precision matrix
11
Common Properties of a Similarity• Similarities, also have some well known
properties.
– s(p, q) = 1 (or maximum similarity) only if p = q.
– s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
Src: “Introduction to Data Mining” by Vipin Kumar et al
12
Similarity Between Binary Vectors• Suppose p and q have only binary attributes• Compute similarities using the following quantities
– M01 = the number of attributes where p was 0 and q was 1
– M10 = the number of attributes where p was 1 and q was 0
– M00 = the number of attributes where p was 0 and q was 0
– M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients:
SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
Src: “Introduction to Data Mining” by Vipin Kumar et al
13
SMC versus Jaccard: Examplep = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)M10 = 1 (the number of attributes where p was 1 and q was 0)M00 = 7 (the number of attributes where p was 0 and q was 0)M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
14
Cosine Similarity• If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where:
indicates vector dot product and || d || is the length of vector d.
• Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 cos( d1, d2 ) = .3150
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Src: “Introduction to Data Mining” by Vipin Kumar et al
15
Extended Jaccard Coefficient (Tanimoto)
• Variation of Jaccard for continuous or count attributes– Reduces to Jaccard for binary attributes
Src: “Introduction to Data Mining” by Vipin Kumar et al
16
Correlation (Pearson Correlation)• Correlation measures the linear relationship
between objects• To compute correlation, we standardize data
objects, p and q, and then take their dot product
)(/))(( pstdpmeanpp kk
)(/))(( qstdqmeanqq kk
qpqpncorrelatio ),(Src: “Introduction to Data Mining” by Vipin Kumar et al
17
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
18
General Approach for Combining Similarities
• Sometimes attributes are of many different types, but an overall similarity is needed.
Src: “Introduction to Data Mining” by Vipin Kumar et al
19
Using Weights to Combine Similarities
• May not want to treat all attributes the same.– Use weights wk which are between 0 and 1 and sum to
1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
Graph-Based Proximity Measures
Within-graph proximity measures:
Hyperlink-Induced Topic Search (HITS)
The Neumann Kernel
Shared Nearest Neighbor (SNN)
√
Outline
• Defining Proximity Measures• Neumann Kernels• Shared Nearest Neighbor
21
Neumann Kernels: Agenda
Neumann Kernel
Introduction
Co-citation and Bibliographic
Coupling
Document and Term
Correlation
Diffusion/Decay factors
Relationship to HITS
Strengths and
Weaknesses
Neumann Kernels (NK)
von Neumann
Generalization of HITS
Input: Undirected or Directed Graph
Output: Within Graph Proximity Measure
Importance
Relatedness
NK: Citation graph
• Input: Graph– n1…n8 vertices (articles)– Graph is directed– Edges indicate a citation
•Citation Matrix C can be formed– If an edge between two vertices exists then
the matrix cell = 1 else = 0
n1 n2 n3 n4
n5 n6 n7 n8
NK: Co-citation graph
• Co-citation graph: A graph which has two nodes connected if they appear simultaneously in the reference list of a third node in citation graph.
• In above graph n1 and n2 are connected because both are referenced by same node n5 in citation graph
• CC=CTC
n1 n2 n3 n4
n5 n6 n7 n8
NK: Bibliographic Coupling Graph
· Bibliographic coupling graph: A graph which has two nodes connected if they share one or more bibliographic references.
· In above graph n5 and n6 are connected because both are referencing same node n2 in citation graph
· CC=C CT
n1 n2 n3 n4
n5 n6 n7 n8
NK: Document and Term Correlation
Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function of their relationship (e.g. frequency of the given term in the document).Example:D1: “I like this book” D2: “We wrote this book”
Term-Document Matrix X
NK: Document and Term Correlation (2)
Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents.Example:D1: “I like this book” D2: “We wrote this book”
Document Correlation matrix K = (XTX)
NK: Document and Term Correlation (3)
Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms.
Example:D1: “I like this book” D2: “We wrote this book”
Term Correlation Matrix T = (XXT)
Neumann Kernel Block Diagram
.Input: Graph
Output: Two matrices of dimensions n x n called K γ and Tγ
Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance
NK: Diffusion Factor - Equation & Effect
Neumann Kernel defines two matrices incorporating a diffusion factor:
Simplifies with our definitions of K and
T
When
When
Indegree = The indegree, δ-(v), of vertex v is the number of edges leading to vertex v.δ- (B)=1
Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v.δ+(A)=3
Maximal indegree= The maximal indegree, Δ-, of the graph is the maximum of all indegree counts of all vertices of graph.Δ-(G)= 2
Maximal outdegree= The maximal outdegree, Δ+, of the graph is the maximum of all outdegree counts of all vertices of graph.Δ+(G)= 3
A
B C D
NK: Diffusion Factor - Terminology
NK: Diffusion Factor - Algorithm
NK: Choice of Diffusion Factor and its effects on the Neumann Algorithm
• Neumann Kernel outputs relatedness between documents and between terms when g = γ
• Similarly when γ is larger, then the Kernel output matches with HITS
.HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8
Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following rankingn3 > n4 > n2 > n1 > n5 = n6 = n7 = n8
For higher values of gamma Neumann Kernel converges to HITS
n1 n2 n3 n4
n5 n6 n7 n8
Comparing NK, HITS, andCo-citation Bibliographic Coupling
Strengths and WeaknessesGeneralization of HITS
Merges relatedness and importance
Useful in many graph applications
Strengths
Topic Drift
No penalty for loops in adjacency matrix
Weaknesses
Outline
• Defining Proximity Measures• Neumann Kernels• Shared Nearest Neighbor
37
Shared Nearest Neighbor (SNN)
• An indirect approach to similarity
• Uses a dynamic method of a k-Nearest Neighbor graph to determine the similarity between the nodes
• If two vertices have more than k neighbors in common then they can be considered similar to one another even if a direct link does not exist
SNN - Agenda
SNN – Understanding Proximity
What makes a node a neighbor to another node is based off of the definition of proximity
Definition: the closeness between a set of objects
Proximity can measure the extent to which the two nodes belong to the same
cluster.
Proximity is a subtle notion whose definition can depend on a specific
application
SNN - Proximity Graphs
• A graph obtained by connecting two points, in a set of points, by an edge if the two points, in some sense, are close to each other
SNN – Proximity Graphs(continued)
1
2
34
5
1
2
3
45
6
7
1 2 3 4 5 6
CYCLIC
LINEAR
RADIAL
Various Types of Proximity Graphs
SNN – Proximity Graphs(continued)
Other types of proximity graphs.
MINIMUM SPANNING TREE
RELATIVE NEIGHBOR GRAPH
GABRIEL GRAPH
NEAREST NEIGHBOR GRAPH(Voronoi diagram)
SNN – Proximity Graphs (continued)
Represents neighbor relationships between objects
Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason
Using a proximity graph increases the scale range over which good segmentations are possible
Can be formulated with respect to many metrics
SNN – Kth Nearest Neighbor (k-NN) Graph
Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure
Has applications in cluster analysis and outlier detection
SNN – Shared Nearest Neighbor Graph
• An SNN graph is a special type of KNN graph.
• If an edge exists between two vertices, then they both belong to each other’s k-neighborhood
In the figure to the left, each of the two black vertices, i and j, have eight nearest neighbors, including each other. Four of those nearest neighbors are shared which are shown in red. Thus, the two black vertices are similar when parameter k=4 for SNN graph.
SNN – The Algorithm
Input: G: an undirected graphInput: k: a natural number (number of shared neighbors)
for i = 1 to N(G) do for j = i+1 to N(G) do if j < = N(G) then counter = 0 end if for m = 1 to N(G) do if vertex i and vertex j both have an edge with vertex m then counter ++ end if end for if counter k then Connect an edge between vertex i and vertex j in SNN graph. end if end forend forreturn SNN graph
SNN – Time Complexity
O(n3)
for i = 1 to n
for j = 1 to n
for k = 1 to n The number of vertices of graph G
can be defined as n “for loops” i
and k iterate once for each vertex in graph G (n times)
Cumulatively this results in a total running time of:
“for loop” j iterates at most n -1 times (O(n))
SNN – R Code Example
• library(“igraph”)• library(“ProximityMeasure”)• data = c( 0, 1, 0, 0, 1, 0,
1, 0, 1, 1, 1, 0,0, 1, 0, 1, 0, 0,0, 1, 1, 0, 1, 1,1, 1, 0, 1, 0, 0,0, 0, 0, 1, 0, 0)
• mat = matrix(data,6,6)• G = graph.adjacency(mat,mode=c("directed"),
weighted=NULL)• V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’)• tkplot(G)• SNN(mat, 2)
B
A
C
D
E
F
[0] A -- D[1] B -- D[2] B -- E[3] C -- E
SNN – Outlier/Anomaly Detection
Outlier/Anomal
y•something that deviates from what is standard, normal, or expected
Outlier/Anomaly Detection
•detecting patterns in a given data set that do not conform to an established normal behavior
Outlier/Anomaly
0 0.5 1 1.5 2 2.50
0.51
1.52
2.53
3.5
SNN - Strengths
Ability to handle
noise and outliersAbility to handle
clusters of different sizes and shapesVery good
at handling clusters of
varying densities
SNN - Weaknesses
Does not take into account the weight of the link between
the nodes in a nearest neighbor
graphA low similarity amongst nodes of
the same cluster in a graph can cause it
to find nearest neighbors that are
not in the same cluster
Time Complexity Comparison
HITS
O(k*n2.376)
Nuemann Kernel
O(n2.376)
Shared Nearest Neighbor
O(n3)Run Time
Conclusion: Nuemann Kernel <= HITS < SNN