-
0
Predicting and Identifying Missing Node Information in Social
Networks
Ron Eyal, Bar-Ilan UniversityAvi Rosenfeld, Jerusalem College of
TechnologySigal Sina, Bar-Ilan UniversitySarit Kraus, Bar-Ilan
University
In recent years, social networks have surged in popularity. One
key aspect of social network research is identifying important
missing informationwhich is not explicitly represented in the
network, or is not visible to all. To date, this line of research
typically focused on finding the connectionsthat are missing
between nodes, a challenge typically termed as the Link Prediction
Problem.
This paper introduces the Missing Node Identification problem
where missing members in the social network structure must be
identified. Inthis problem, indications of missing nodes are
assumed to exist. Given these indications and a partial network, we
must assess which indicationsoriginate from the same missing node
and determine the full network structure.
Towards solving this problem, we present the MISC Algorithm
(Missing node Identification by Spectral Clustering), an approach
based on aspectral clustering algorithm, combined with nodes’
pairwise affinity measures which were adopted from link prediction
research. We evaluate theperformance of our approach in different
problem settings and scenarios, using real life data from Facebook.
The results show that our approach hasbeneficial results and can be
effective in solving the Missing Node Identification Problem. In
addition, this paper also presents R-MISC which usesa sparse matrix
representation, efficient algorithms for calculating the nodes’
pairwise affinity and a proprietary dimension reduction
technique,to enable scaling the MISC algorithm to large networks of
more than 100,000 nodes. Last, we consider problem settings where
some of theindications are unknown. Two algorithms are suggested
for this problem - Speculative MISC, based on MISC, and Missing
Link Completion,based on classical link prediction literature. We
show that Speculative MISC outperforms Missing Link Completion.
Categories and Subject Descriptors: H.2.8 [Database Management]:
Database Applications- Data mining
General Terms: Algorithms, Performance, Theory
Additional Key Words and Phrases: Social networks, spectral
clustering, missing nodes
ACM Reference Format:Ron Eyal, Avi Rosenfeld, Sigal Sina and
Sarit Kraus, 2013. Predicting and Identifying Missing Node
Information in Social Networks. ACM Trans.Embedd. Comput. Syst. 0,
0, Article 0 ( 2013), 26
pages.DOI:http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTIONSocial Networks enable people to share
information and interact with each other. These networks are
typically formallyrepresented as graphs where nodes represent
people and edges represent some type of connection between these
people[Liben-Nowell and Kleinberg 2007], such as friendship or
common interests. These networks have become a keyInternet
application, with examples including popular websites such as
Facebook, Twitter and LinkedIn.
Because of their ubiquity and importance, scientists in both
academia and industry have focused on various as-pects of social
networks. One important factor that is often studied is the
structure of these networks [Clauset et al.2008; Eslami et al.
2011; Fortunato 2010; Freno et al. ; Gomez-Rodriguez et al. 2012;
Gong et al. 2011; Kim andLeskovec 2011; Leroy et al. 2010;
Liben-Nowell and Kleinberg 2007; Lin et al. 2012; Porter et al.
2009; Sadikov et al.2011]. Previously, a Link Prediction Problem
[Clauset et al. 2008; Liben-Nowell and Kleinberg 2007] was defined
asattempting to locate which connections (edges) will soon exist
between nodes. In this problem setting, the nodes ofthe network are
known, and unknown links are derived from existing network
information, including complete nodeinformation. In contrast, we
consider a new Missing Node Identification problem which attempts
to locate and iden-
This research is based on work supported in part by MAFAT. Sarit
Kraus is also affiliated with UMIACS. Preliminary results were
published in theAAAI 2011 paper entitled ’Identifying Missing Node
Information in Social Networks’.Author’s addresses: R. Eyal, S.
Sina and S. Kraus, Computer Science Department, Bar Ilan
University, Ramat-Gan, Israel 92500A. Rosenfeld, Department of
Industrial Engineering, Jerusalem, Israel, 91160Permission to make
digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies arenot
made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a
display alongwith the full citation. Copyrights for components of
this work owned by others than ACM must be honored. Abstracting
with credit is permitted. Tocopy otherwise, to republish, to post
on servers, to redistribute to lists, or to use any component of
this work in other works requires prior specificpermission and/or a
fee. Permissions may be requested from Publications Dept., ACM,
Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701USA, fax +1
(212) 869-0481, or [email protected]⃝ 2013 ACM
1539-9087/2013/-ART0
$15.00DOI:http://dx.doi.org/10.1145/0000000.0000000
ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0,
Article 0, Publication date: 2013.
-
tify missing nodes within the network. This problem is
significantly more difficult than the previously studied
LinkPrediction Problem as neither the nodes nor their edges are
known with certainty.
To understand the importance of the missing node identification
problem we introduce, please consider the followingexample. A
hypothetical company, Social Games Inc., is running an online
gaming service within Facebook. ManyFacebook members are
subscribers of this company’s services, yet it would like to expand
its customer base. Asa service provider, Social Games maintains a
network of users, which is a subset of the group of Facebook
users,and the links between these users. The Facebook users who are
not members of the service are not visible to theirsystems. Social
Games Inc. would like to discover these Facebook nodes, and try to
lure them into joining theirservice. The company thus faces the
missing node identification problem. By solving this problem,
Social Games Inc.could improve its advertising techniques and aim
at the specific users who haven’t yet subscribed to their
service.
The above example exemplifies just one possible application of
the missing node identification problem. In additionto commercial
motivation, solving this problem could also be useful for personal
interests and entertainment applica-tions. This research direction
has also particularly interested the security community. For
example, missing nodes insome networks might represent missing
persons that are sought after by family members who wish to know
their fullfamily tree or people wanted by the police as suspects in
a crime. As a result, solving the missing node
identificationproblem can be of considerable importance.
We focus on a specific variation of the missing node problem
where the missing nodes requiring identification are"friends" of
known nodes. An unrecognized friend is associated with a
"placeholder" node to indicate the existenceof this missing friend.
Thus, a given missing node may be associated with several
"placeholder" nodes, one for eachfriend of this missing node. We
assume that tools such as image recognition software or automated
text analysis canbe used to aid in generating placeholder nodes.
For example, a known user might have many pictures with the
sameunknown person, or another user might constantly blog about a
family member who is currently not a member of thenetwork. Image
recognition or text mining tools can be employed on this and all
nodes in the social network in orderto obtain indications of the
existence of a set of missing nodes. Placeholders can then be used
to indicate where thesemissing nodes exist. However, it is likely
that many of these placeholders are in fact the same person. Thus,
our focusis on solving the identification of the missing nodes, and
we frame the problem as: Given a set of placeholders, whichof these
placeholders do in fact represent the same person. Considering the
assumption that placeholders are receivedfrom external data mining
modules, this work is mainly focused on relatively small networks
with a small number ofmissing nodes. Nonetheless, our proposed
algorithm easily scales to networks of more than a hundred thousand
nodes.
In this paper, we present a general method, entitled MISC
(Missing node Identification by Spectral Clustering),for solving
this problem. This method relies on a spectral clustering algorithm
previously considered only for otherproblems [Almog et al. 2008; Ng
et al. 2001]. One key issue in applying the general spectral
clustering algorithm isdefining a measure for identifying similar
nodes to be clustered together. Towards solving this issue, we
present fivemeasures for judging node similarity, also known as
affinity. One of these measures is the Gaussian Distance
Measure,typically used over Euclidean spaces in spectral clustering
[Ng et al. 2001], while the other four measures are non-Euclidean
measures which have been adapted from a related Link Prediction
Problem [Liben-Nowell and Kleinberg2007]. We found that the latter
measures are especially useful in solving the missing node
identification problem.
We begin this study in the next section by providing background
of the spectral clustering algorithm and relevantsocial network
research that relate to our solution for the Missing Node
Identification problem. Then, in Section III,we formally define the
Missing Node Identification problem. In Section IV, we detail how
the MISC algorithm canbe applied, including how different node
affinity measures inspired by the Link Prediction Problem can be
appliedwithin our solution. Section V presents our evaluation
dataset and methodology, and Section VI provides an
empiricalanalysis of MISC’s performance when the different affinity
measures are used. Following this analysis, we providetwo
extensions to our base algorithm. In Section VII we describe how
our base algorithm can be extended to handlelarge networks of
hundreds of thousands of nodes. In Section VIII we describe the
comparison between the MISCalgorithm and KronEM [Kim and Leskovec
2011], a state-of- the-art algorithm for the prediction of missing
nodes.In Section IX we suggest methods for modifying our base
approach to address situations when partial informationabout the
number of missing nodes is unknown, or not all of the placeholders
are known. Under the problem settingof unknown placeholders, we are
able to show that the MISC algorithm, based on spectral clustering,
outperforms’Missing Link Completion’, an algorithm based on
classical solutions to the Link Prediction Problem. Section
Xconcludes and provides directions for future research.
2. RELATED WORKIn solving the Missing Node Identification
problem, this research aims to use variations of two existing
research areas– spectral clustering algorithms and metrics built
for the Link Prediction Problem. The spectral clustering algorithm
ofJordan, Ng and Weiss [Ng et al. 2001] is a well documented and
accepted algorithm, with applications in many fields
2
-
including statistics, computer science, biology, social sciences
and psychology [von Luxburg 2007]. While spectralclustering
algorithms have been applied to many areas, its use in the Missing
Node Identification problem has not beenpreviously considered and
cannot be directly applied from previous works.
The main idea behind the spectral clustering framework is to
embed a set of data points which should be clusteredin a graph
structure in which the weights of the edges represents the affinity
between each pair of points. Two pointswhich are “similar” will
have a larger value for the weight of the edge between these two
points’ than two pointswhich are “dissimilar”. The similarity, or
affinity, function is used to calculate an affinity matrix which
describesthe pairwise similarity between points. The affinity
matrix can be thought of as a weighted adjacency matrix,
andtherefore it is equivalent to an affinity graph. After
generating the affinity graph, an approximation of the
min-cutalgorithm is employed in order to split the graph into
partitions which maximize the similarity between
intra-partitionpoints and minimize the similarity between points in
different partitions. This is done by manipulating the
affinitymatrix representing the affinity graph and solving an
eigenvector problem.
While the general framework of this algorithm is to embed points
in a Euclidean space as an affinity graph, ourapproach is to
construct the affinity graph directly from the social network graph
structure. The key challenge inapplying the spectral clustering
algorithm to the Missing Node Identification problem is how to
compute the level ofsimilarity between nodes in the social network
while constructing the affinity matrix. Towards defining this
measure,we consider adopting measures developed for a related
problem, the Link Prediction Problem. In the Link PredictionProblem
there are a set of known nodes, and the goal is to discover which
connections, or edges, will be made betweennodes [Clauset et al.
2008; Liben-Nowell and Kleinberg 2007]. In contrast, in the missing
node problem, even thenodes themselves are not known, making the
problem significantly more difficult. Nonetheless, we propose a
solutionwhere measures originally used to solve the Link Prediction
Problem are utilized to form the affinity matrix in the firststep
of solving this problem as well.
Various methods have been proposed to solve the Link Prediction
Problem. Approaches typically attempt to derivewhich edges are
missing by using measures to predict link similarity based on the
overall structure of the network.However, these approaches differ
as to which computation is best suited for predicting link
similarity. For example,Liben-Nowell and Kleinberg [Liben-Nowell
and Kleinberg 2007] demonstrated that measures such as the
shortestpath between nodes and different measures relying on the
number of common neighbors can be useful. They alsoconsidered
variations of these measures, such as using an adaptation of Adamic
and Adar’s measure of the similaritybetween webpages [Adamic and
Adar 2003] and Katz’s calculation for shortest path information,
which weight shortpaths more heavily [Katz 1953] than the simpler
shortest path information. After formally describing the missing
nodeidentification problem, we detail how the spectral clustering
algorithm can be combined with these link predictionmethods in
order to effectively solve the missing node identification
problem.
Many other studies have researched problems of missing
information in social networks. Guimera and Sales-Pardo[Guimerà and
Sales-Pardo 2009] propose a method which performs well for
detecting missing links as well as spuriouslinks in complex
networks. The method is based on a stochastic block model, where
the nodes of the network arepartitioned into different blocks, and
the probability of two nodes being connected depends only on the
blocks to whichthey belong. Some studies focus on understanding the
propagation of different phenomena through social networks asa
diffusion process. These phenomena include viruses and infectious
diseases, information, opinions and ideas, trends,advertisements,
news and more. Gomez-Rodriguez et al. [Gomez-Rodriguez et al. 2012]
attempted to infer a networkstructure from observations of a
diffusion process. Specifically, they observed the times when nodes
get infected by aspecific contagion, and attempted to reconstruct
the network over which the contagion propagates. The
reconstructionis done through the edges of the network, while the
nodes are known in advance. Eslami et al. [Eslami et al.
2011]studied the same problem. They modeled the diffusion process
as a Markov random walk and proposed an algorithmcalled DNE to
discover the most probable diffusion links.
Sadikov et al. [Sadikov et al. 2011] also studied the problem of
diffusion of data in a partially observed socialnetwork. In their
study they proposed a method for estimating the properties of an
information cascade, the nodes andedges over which a contagion
spreads through the network, when only part of the cascade is
observed. While this studytakes into account missing nodes and
edges from the cascade, the proposed method estimates accumulative
propertiesof the true cascade and does not produce a prediction of
the cascade itself. These properties include the number ofnodes,
number of edges, number of isolated nodes, number of weakly
connected components and average node degree.
Other works attempted to infer missing link information from the
structure of the network or information aboutknown nodes within the
network. For example, Lin et al. [Lin et al. 2012] proposed a
method for community detection,based on graph clustering, in
networks with incomplete information. In these networks, the links
within a few localregions are known, but links from the entire
network are missing. The graph clustering is performed using an
iterativealgorithm named DSHRINK. Gong et al. [Gong et al. 2011]
proposed a model to jointly infer missing links and missingnode
attributes by representing the social network as an augmented graph
where attributes are also represented by
3
-
nodes. They showed that link prediction accuracy can be improved
when first inferring missing node attributes. Frenoet al. [Freno et
al. ] proposed a supervised learning method which uses both the
graph structure and node attributesto recommend missing links. A
preference score which measures the affinity between pairs of nodes
is defined basedon the feature vectors of each pair of nodes. Their
algorithm learns the similarity function over feature vectors of
thegraph structure. Kossinets [Kossinets 2003] assessed the effect
of missing data on various networks and suggestedthat nodes may be
missing, in addition to missing links. In this work, the effects of
missing data on network levelstatistics were measured and it was
empirically shown that missing data causes errors in estimating
these parameters.While advocating its importance, this work does
not offer a definitive statistical treatment to overcome the
problem ofmissing data.
One can divide studies of the missing links problem into two
groups: unsupervised methods and supervised ones.Unsupervised
methods include works such as those done by Liben-Nowell and
Kleinberg [Liben-Nowell and Klein-berg 2007], Katz [Katz 1953], and
Gong et al. [Gong et al. 2011]) which do not require the inputted
data to be labeledin any way. Instead, these methods learn through
implying information based on the specific structure of the input
net-work. A second group of methods require some element of tagged
data be inputted in a training phase where the data istypically
explicitly labeled so that the classifier can be created. These
methods include the works of Freno et al. [Frenoet al. ] and Gong
et al. [Gong et al. 2011]) who use a binary classifier in order to
predict the missing links. Althoughthe supervised methods often
yields better results, it requires an additional step of manually
tagging the inputted data–something that requires additional
resources and time. Additionally, this approach assumes some
similarity betweenthe training dataset and test dataset. In our
work, we intentionally avoided basing ourselves on supervised
methods aswe wished to present an method free of these steps and
assumptions and thus opted to create an unsupervised method.
Most similar to our paper is the recent work by Kim and Leskovec
[Kim and Leskovec 2011], which also tackledthe issue of missing
nodes in a network. This work deals with situations where only part
of the network is observed,and the unobserved part must be
inferred. The proposed algorithm, called KronEM, uses an
Expectation Maximizationapproach, where the observed part is used
to fit a Kronecker graph model of the network structure. The model
is usedto estimate the missing part of the network, and the model
parameters are then reestimated using the updated network.This
process is repeated in an iterative manner until convergence is
reached. The result is a graph which serves as aprediction of the
full network. This research differs from ours in several aspects.
First, while the KronEM predictionis based on link probabilities
provided by the EM framework, our algorithm is based on a
clustering method andgraph partitioning. Secondly, our approach is
based on the existence of missing node indications obtained from
datamining modules such as image recognition. When these
indications exist, our algorithm can be directly used to predictthe
original graph. As a result, while KronEM is well suited for large
scale networks with many missing nodes, ouralgorithm may be
effective in local regions of the network, with a small number of
missing nodes, where the datamining can be employed. The results of
experiments performed in this study show that our proposed
algorithm, MISC,can achieve better prediction quality than KronEM,
thanks to the use of indications of missing nodes.
3. FORMAL DEFINITION OF THE MISSING NODE IDENTIFICATION
PROBLEMWe formally define the new Missing Node Identification
Problem as follows. Assume that there is a social networkG = (V,E)
in which e = ⟨v, u⟩ ∈ E represents an interaction between v ∈ V and
u ∈ V. Some of the nodes in thenetwork are missing and are not
known to the system. We denote the set of missing nodes Vm ⊂ V ,
and assume thatthe number of missing nodes is given as N = |Vm|. We
denote the rest of the nodes as known, i.e., Vk = V \ Vm, andthe
set of known edges is Ek = {⟨v, u⟩ | v, u ∈ Vk ∧ ⟨v, u⟩ ∈ E}, i.e.
only the edges between known nodes.
Towards identifying the missing nodes, we focus on a part of the
network, Ga = ⟨Va, Ea⟩, that we define as beingavailable for the
identification of missing nodes. In this network, each of the
missing nodes is replaced by a set ofplaceholders. Formally, we
define a set Vp of placeholders and a set Ep for the associated
edges. For each missingnode v ∈ Vm and for each edge ⟨v, u⟩ ∈ E, u
∈ Vk, a placeholder is created. That is, we add a placeholder for
vas v′ to Vp and for the original edge ⟨v, u⟩ we add a placeholder
⟨v′, u⟩ to Ep. We denote the origin of v′ ∈ Vp witho(v′). Putting
all of these components together, Va = Vk ∪ Vp and Ea = Ek ∪ Ep. In
this setting, both the knownnodes and the placeholders, along with
their associated edges are defined as being the portion of the
network, Ga, thatis available for the missing nodes identification
process. The problem is that for a given missing node v there may
bemany placeholders in Vp. The challenge is to try to determine
which of the placeholders are associated with v. Thiswill allow us
to reconstruct the original social network G.
Formally, we define the missing node identification problem as:
Given a known network Gk = ⟨Vk, Ek⟩, an avail-able network Ga =
⟨Va, Ea⟩ and the number of missing nodes N , divide the nodes of Va
\ Vk to N disjoint setsVa1 , . . . , VaN such that Vai ⊆ Vp are all
the placeholders of vi ∈ Vm.
To better understand this formalization, consider the following
example. Assume Alice is a Facebook member,and thus is one of the
available nodes in Va. She has many known social nodes (people),
Vk, within the network,
4
-
Fig. 1: Illustration of the missing node identification problem.
The whole graph represents the original network,G = ⟨V,E⟩. Nodes 1
and 5 are missing nodes. The sub-graph containing the nodes which
appear inside the cloud is
the known network, Gk = ⟨Vk, Ek⟩.
Fig. 2: Illustration of the available network obtained by adding
placeholders for the missing nodes. The nodes whichappear outside
of the cloud are the placeholders. The sub-graph containing the
nodes which appear inside the cloud isstill the known network, Gk =
⟨Vk, Ek⟩. The whole graph represents the available network, Ga =
⟨Va, Ea⟩, which
includes the known network and the placeholders.
but her cousin Bob is not a member of Facebook. Bob is
represented by a missing node w ∈ Vm. From analysisof text in her
profile we might find phrases like "my long lost cousin Bob",
indicating the existence of the missingnode representing Bob.
Alternately, from examining profile pictures of Alice’s friends, an
image recognition softwarepackage identifies pictures of an unknown
male person, resulting in another indication of the existence of a
missingnode. Each indication is represented by a placeholder node
v′ ∈ Vp and a link (u, v′) ∈ Ep, where u is the knownnode (e.g.,
Alice) which contains the indication. By solving the missing node
identification problem we aim to identifywhich of these indications
point to the same missing node, representing Bob, and which
represent other missing nodes.
4. THE MISC ALGORITHMThe main challenge in solving the missing
node problem is discovering how to correctly identify the set of Vp
place-holders as the set of Vm missing nodes. As we are limited to
general structural knowledge about the portion of thenetwork Ga
that is available, solving this problem is far from trivial. Our
key approach is to use a clustering approach,and specifically
spectral clustering, to group Vp into N clusters in order to
identify the missing nodes. The challenge tousing any clustering
algorithm, including spectral clustering, is how to provide the
clustering algorithm with a definedEuclidean value that quantifies
distances between both the known portion of the network and the
placeholders repre-senting the missing nodes. To better understand
this difficulty, this section first briefly introduces spectral
clusteringand focuses on the challenges in applying it to the
missing node identification problem. Our novel contribution is
thatwe apply a series of affinity measures primarily based on
literature from the related link prediction problem which can
5
-
be used to solve the missing node problem. These affinity
measures are critical in providing a non-discrete measurebetween
both known nodes and placeholders such that the clustering
algorithm can operate on that input. Finally, wepresent our novel
MISC algorithm for solving the missing node identification problem
based on combining both thespectral clustering algorithm and these
affinity measures.
4.1. Spectral ClusteringSpectral clustering is a general
algorithm used to cluster data samples using a certain predefined
similarity or affinitymeasure between them. The algorithm creates
clusters which maximize the similarity between points in each
clusterand minimize the similarity between points in different
clusters. This algorithm accepts as its input a set of M
samplecoordinates in a multi-dimensional space, described by a
matrix S. The algorithm also requires the number of clusters,K, be
known as used as input. In its original form, the algorithm
constructs an affinity matrix which describes theaffinity between
each pair of samples based on the Euclidean distance on the
multi-dimensional space. This affinitymatrix is equivalent to an
affinity graph, where each data sample is represented by a node and
the affinity betweentwo samples is represented by the weight of the
edge between the two matching nodes. Spectral clustering solves
theminimum cut problem by partitioning the affinity graph in a
manner that maximizes the affinity within each partitionand
minimizes the affinity between partitions. Since finding this
partitioning is NP-Hard, spectral clustering finds anapproximated
solution to this problem. While the reader is encouraged to review
the algorithm in its entirety [Ng et al.2001], a brief description
of this algorithm is presented below in Algorithm 1.
ALGORITHM 1: - Spectral ClusteringInput: S - data samples in a
multi-dimensional spaceK - number of clustersσ - standard deviation
for the Gaussian distance functionOutput: C ∈ N1×|S| - a vector
indicating a cluster index for each sample in S
1: Define si to be the coordinates of every sample i in the
multi-dimensional space and calculate an affinity matrix A ∈ RM×M
,where M is the number of samples. A defines the affinity between
all pairs of samples (i, j) using the Gaussian distancefunction:Aij
= exp(−||si − sj ||2/2σ2)
2: Define D to be the diagonal matrix whose (i, i) element is
the sum of A’s i-th row, and construct the matrix L:L =
D−1/2AD−1/2
3: Find x1,x2,...,xK , the K largest eigenvectors of L (chosen
to be orthogonal to each other in the case of repeated
eigenvalues),and form the matrix X = [x1x2...xK ] ∈ RM×K by
stacking the eigenvectors in columns.
4: Form the matrix Y from X by renormalizing each of X’s rows to
have unit length.5: Treating each row of Y as a point in RK ,
cluster them into K clusters via k-means or any other clustering
algorithm.6: Assign the original sample i to cluster j if and only
if row i of the matrix Y was assigned to cluster j.
The key to the success of Algorithm 1 is constructing the
affinity matrix in the first step of the algorithm. Thismatrix must
represent the pairwise affinity as accurately as possible according
to the given problem setting. Notethat this algorithm assumes data
nodes residing in some Euclidean space and thus defines affinity
between samplesaccordingly. However, in the case of social network
graphs, it is very difficult to embed the nodes of the graph in
aEuclidean space as defined by the original algorithm. This is
because it is unclear if such a space can be defined
whichrepresents the actual distance between nodes in the graph.
To understand this difficulty, consider the triangle inequality
which is one of the basic characteristics of a Euclideanspace. This
inequality may not hold for many reasonable distance metrics
between nodes. For instance, consider adistance metric that
incorporates the number of common neighbors between two nodes.
While nodes a and b may nothave any common neighbors, causing their
distance to be infinite, they may both have common neighbors with
nodec. The common neighbors measure for this example does not obey
the triangle inequality because under this measured(a,b) >
d(a,c) + d(c,b). In general, defining a space which obeys the
triangle inequality is difficult because it requiresan examination
of more than two nodes at once when defining their distances. A
distance metric which does obey thisinequality is the shortest path
length between two nodes, which we define later. Nonetheless, as we
have found, thismeasure does not necessarily yield the best results
for missing node identification. In order to bypass the
Euclideanspace difficulty, we have decided to alter the first step
of the algorithm and to define direct, graph-based measures
forbuilding the affinity matrix A, rather than using the Euclidean
space. While applying this algorithm to missing nodeidentification,
we need to address how this measure can be calculated to represent
affinity between nodes in a socialnetwork. Due to the complexity of
social networks, and the need to compute this measure in a
multi-dimensionalspace, calculating this measure is far from
trivial. We consider several different methods, most of which have
been
6
-
proven to be useful for solving the Link Prediction Problem in
social networks [Adamic and Adar 2003; Katz 1953;Liben-Nowell and
Kleinberg 2007]. We empirically compare these methods with the
Euclidean method described inthe original algorithm. These methods
are discussed in detail in the next section.
4.2. Incorporating Link Prediction MeasuresThe key novelty
within this paper is that we propose that affinity measures be
constructed based on general graphmeasures or previously developed
measures for the related Link Prediction Problem [Adamic and Adar
2003; Katz1953; Liben-Nowell and Kleinberg 2007]. Specifically, we
discuss how five such measures can be potentially appliedto the
spectral clustering algorithm [Ng et al. 2001]:
(1) Gaussian Distance: Define Dij to be the length of the
shortest path between nodes i and j. Define Di to be thevector of
the length of shortest paths from node i to all other nodes.
Calculate Aij as in step 1 of the originalspectral clustering
algorithm:Aij = exp(−||Di −Dj ||2/2σ2).
(2) Inverse Squared Shortest Path (ISSP):Aij =
1(Dij)2
where Dij is the length of the shortest path between nodes i and
j.(3) Relative Common Friends:
Aij =|Γ(i)
∩Γ(j)|
min(|Γ(i)|,|Γ(j)|) where Γ(i) is defined as the group of
neighbors of node i in the network graph.(4) Adamic / Adar:
Aij =∑
k∈Γ(i)∩
Γ(j)1
log(|Γ(k)|)(5) Katz Beta:
Aij =∑∞
k=1 βk· (number of paths between nodes i and j of length exactly
k).
All five of these measures present possible similarity values
for creating the affinity matrix, Aij , in the spectral clus-tering
algorithm. The Gaussian Distance is based on the standard distance
measure often used in spectral clustering[von Luxburg 2007]. The
Inverse Squared Shortest Path (ISSP) measure is based on the length
of the shortest pathbetween two points, i and j, here representing
two nodes in the social network. This measure presents an
alternativeto the original Gaussian Distance used to measure
Euclidean distances, and is not directly inspired by the Link
Pre-diction Problem. In contrast, the next three measures were
directly inspired by this literature. The Relative CommonNeighbors
measure checks the number of neighbor nodes that i and j have in
common (|Γ(i)
∩Γ(j)|). We divide
by min(|Γ(i)|, |Γ(j)|) in order to avoid biases towards nodes
with a very large number of neighbors. Similarly, theAdamic / Adar
measure also incorporates the common neighbor measure: (Γ(i)
∩Γ(j)), checking the overall connec-
tivity of each common neighbor to other nodes in the graph and
giving more weight to common neighbors who areless connected. Since
the nodes that act as placeholders for missing nodes only have one
neighbor each, the commonneighbors and the Adamic/Adar measures do
not represent these nodes well. Therefore, for these measures only,
wealso consider them to be connected to their neighbor’s neighbors.
Last, the Katz measure [Katz 1953] directly sumsthe number of paths
between i and j, using a parameter β which is exponentially damped
by the length of each path.Similar to the common neighbor measure,
this measure also stresses shorter paths more heavily. Finally, for
evaluatingthese algorithms, we consider a baseline Random
Assignment algorithm that assigns each placeholder to a
randomcluster and represents the most naive of assignment
algorithms.
4.3. Solving the Missing Node Identification ProblemGenerally,
our solution for solving the missing node problem is based on the
following three steps:
(1) Cluster the placeholder nodes, using a spectral clustering
approach, thus creating disjointed groups of placeholderswhich have
a high probability of representing the same missing node.
(2) Unite all of the placeholders from each cluster into one
node, attempting to predict the missing node that is mostlikely
represented by the placeholders in that cluster.
(3) Connect the predicted node to all neighbors of the
placeholders that were in the matching cluster, in the hopes
ofcreating a graph that is as similar as possible to the original
graph G.
Clustering the placeholders is the crucial part of this
approach. When this clustering is performed perfectly,
theplaceholders in each cluster will all originate from the same
missing node. As a result, the predicted node createdfrom each
cluster will match a specific missing node from the original
network graph from which the missing nodeswere removed. In this
case the predicted graph will be identical to the original graph.
When clustering is not fullysuccessful, some clusters will contain
placeholders which originate from different missing nodes, the
predicted nodeswill not exactly match the missing nodes and the
predicted graph will be less similar to the original graph.
7
-
Fig. 3: Correct clustering of the placeholders for missing nodes
1 and 5. The placeholders in each cluster are united toone node
which represents a missing node.
Specifically, Algorithm 2, MISC, presents how this solution is
accomplished. This algorithm accepts the knownand available parts
of the network graph, as described in the problem definition. We
also assume at this point that thenumber of missing nodes, N , is
given. N is used to define the number of clusters that the spectral
clustering algorithmwill create. The final input is α, a procedure
for calculating the pairwise affinity of nodes in the graph. An
example ofsuch a procedure could be one that implements the
calculation of one of the affinity measures adopted from the
LinkPrediction Problem.
ALGORITHM 2: - MISC (Missing node Identification by Spectral
Clustering)Input: Gk = ⟨Vk, Ek⟩ - the known part of the networkGa =
⟨Va, Ea⟩ - the available part of the networkN – the number of
missing nodesα : G(V,E) −→ R|V |×|V | - a procedure for calculating
the affinity matrix of nodes in a graphOutput: C ∈ N|Va\Vk| - a
vector indicating the cluster index of each placeholder node, Ĝ
=
(V̂ , Ê
)– prediction of the full
network graph
1: A ∈ R|Va|×|Va| ←− α(Ga) - calculate the affinity matrix of
the available nodes in the graph2: Perform steps 2-4 of Spectral
Clustering Algorithm (Algorithm 1) to calculate Y, using N as the
input K to algorithm 13: Y’←− {Yi | vi /∈ Vk} – keep only the rows
of Y which match the placeholder nodes4: C←− k_means(Y’,N) –
cluster the rows which match the placeholder nodes to N clusters5:
V̂ ←− Vk, Ê ←− Ek - initialize the output graph to contain the
known network6: For each cluster c ∈ C create a new node vc ∈ V̂7:
For each placeholder v in cluster c and edge (u, v) ∈ Ea, create an
edge (u, vc) ∈ Ê8: Return C, Ĝ =
(V̂ , Ê
)
The first two steps of this algorithm are based on the spectral
clustering algorithm and measures described inthe previous two
sections. This algorithm first calculates the affinity matrix of
the available nodes using the givenprocedure α (step 1). Note that
we use this algorithm with any one of the five affinity measures as
described in theprevious section. Next, steps 2-4 of the original
spectral clustering algorithm are followed in order to calculate
thematrix Y (step 2). This is a transformation that spectral
clustering performs in order to transform the data points intoa
vector space in which k-means clustering can be employed. The
number of missing nodes, N , is an input parameterin MISC. It is
used as the number of clusters for the spectral clustering
algorithm, which is marked as K. Even thoughwe only need to cluster
the placeholders, the matrix Y in the spectral clustering algorithm
contains all of the M datasamples as coordinates in an N
dimensional space. Spectral clustering performs k-means clustering
on the samples inthis space. In the context of MISC, each row of Y
corresponds to a node in Ga embedded in an N dimensional space.
Step 3 of the algorithm is specific to our problem. Here, the
rows of Y which correspond to the known nodes inVk are removed. As
opposed to general clustering problems where all the data must be
clustered in an unsupervised
8
-
manner, in our case most of the nodes are known and the
challenge is to find a correct clustering for the
placeholders.Therefore, for the sake of clustering and unifying
only the placeholders, only the rows corresponding to
placeholdersin Y are kept.
On the other hand, the affinity between the known nodes and the
placeholders contains important information whichshould be
utilized. For this reason, all the nodes are embedded in the
affinity matrix, and all the columns in Y remain.Notice that in
this manner, the information obtained from the known nodes in the
embedding process is still present inthe matrix Y ′ in the form of
the coordinates matching the placeholders (see step 3 of Algorithm
2).
In step 4 of Algorithm 2, the remaining rows, which correspond
to placeholders, are clustered using k-means clus-tering. In this
step also, N is used as the number of clusters K. This is because
every cluster is used to predict amissing node in the following
steps. According to spectral clustering theory [Ng et al. 2001],
k-means clustering isexpected to achieve better results when
employed using the coordinates depicted in Y or Y ′ than when
applied to theoriginal data.
In steps 5-8 the predicted graph is created by uniting all of
the placeholders in each cluster to a new node andconnecting this
node to the known nodes which were neighbors of the original
placeholders in the correspondingcluster. Recall that each cluster
represents a predicted missing node. The placeholders in that
cluster each represent alink of the predicted missing node, and for
this reason the links are created between the new node and the
neighborsof the placeholders in the corresponding cluster. Another
advantage of removing the known nodes in step 3 is clearat this
point - each resulting cluster contains only placeholders, and can
therefore represent a missing node from theoriginal graph.
5. DATASET DESCRIPTION AND EVALUATION METHODOLOGYTo empirically
study the Missing Node Identification problem and assess the
proposed solutions, we must be ableto simulate the problem setting
using real world data. Within this section we first describe the
dataset utilized andmethods of synthesizing the problem. We then
discuss the evaluations, methods and methodology used to compare
theproposed solutions empirically.
5.1. Dataset DescriptionFor an empirical evaluation of our
method we use a previously developed social network dataset - the
FacebookMHRW dataset [Gjoka et al. 2010]. This dataset contains
structural information sampled from Facebook, includingover 900,000
nodes and the links between them. For each node certain social
characteristics are stored, such as sharedacademic networks (e.g.
all people from Harvard University), corporate networks (e.g.
workers in AIG), geographicalnetworks (e.g. members from Idaho), or
networks of people who share similar interests (e.g. love of
chocolate). Allnodes are anonymized as numbers without any
indication of their true identity.
The main challenge we had to address in using a dataset of this
size was processing the data within a tractable periodand
overcoming memory constraints. In order to create more tractable
datasets, we considered two methods of creatingsubsets of the
Facebook data [Gjoka et al. 2010]. In the first method, we create a
subset based on naturally occurringsimilarities between nodes
according to users’ network membership characteristics within the
social network. Eachsubset is created by sampling all the nodes in
a specific user network and the links between these nodes. Nodes
withonly one link or no links at all are removed. The advantage of
this method of creating the subsets is that there isa higher chance
of affiliation between the nodes in the user network as compared to
random nodes selected fromthe entire social network. However, the
disadvantage is that the nodes which make up the user network may
not becompletely connected, for example if the user network is
comprised of several disconnected components. In fact, thesubgraph
of nodes that is part of a specific user network may be very
sparse. Another disadvantage is that the usernetworks in the
dataset are limited in size and therefore large subgraphs cannot be
created from them. In contrast, forthe second method of creating a
subset, we begin with the entire dataset, and extract a subset
based on a BFS walkstarting from a random node in the dataset. Here
no previous information about the social network is necessary,
butthe BFS generated subset may not accurately represent the actual
topology of the entire network.
In order to synthesize the missing node problem within these two
subsets, we randomly mark N nodes as the set ofmissing nodes, Vm.
We then remove these nodes from the network, and replace each link
(v, u) between v ∈ Vm andu ∈ Vk with a placeholder node v′ ∈ Vp and
a link (v′, u) ∈ Ep. The resulting network Ga is the available
networkused as input to our proposed MISC algorithm for solving the
Missing Node Identification problem.
5.2. Evaluation MeasuresWe considered two types of evaluation
measures to measure the effectiveness of the solutions presented.
One measureis a straightforward similarity measure that compares
the output graph of the MISC algorithm, Ĝ = (V̂ , Ê), to
theoriginal network graph, G, from which the missing nodes were
removed.
9
-
Within the first measure, we quantify the similarity of two
graphs based on the accepted Graph Edit Distance(GED) measure
[Bunke and Messmer 1993; Kostakis et al. 2011]. The GED is defined
as the minimal number of editoperations required to transform one
graph into the other. An edit operation is an addition or deletion
of a node or anedge. Since finding the optimal edit distance is
NP-Hard, we use a previously developed simulated annealing
method[Kostakis et al. 2011] to find an approximation of the Graph
Edit Distance. The main advantage of this method ofevaluation is
that it is independent of the method used to predict Ĝ, making it
very robust. For instance, GED canbe used to compare two methods
which might create a different number of clusters from each other,
if the numberof clusters is unknown in advance. GED can also be
used to compare methods that cluster a different amount
ofplaceholders, for example, if not all of the placeholders are
given. In fact, it can be used to compare any two methods,as long
as they both produce a predicted graph.
A disadvantage to computing the GED lies within its
computational cost. This expensive cost exists because theGED
measure is based on finding an optimal matching of the nodes in Ĝ
to the nodes in G which minimizes the editdistance. Even if we take
advantage of the fact that most of the nodes are known nodes from
Vk and their matchingis known, there are still N unknown nodes to
be aligned - the missing nodes in G to the nodes generated from
eachcluster in Ĝ. This allows for N ! possibilities. Therefore, a
heuristic search method must be used, as we do, which maylead to
over-estimations of the edit distance.
Due to this expense, a purity measure, which can be computed
simply, can be used instead. The purity measureattempts to assess
the quality of the placeholders’ clustering. This can be done since
we know in advance the correctclustering of the placeholders. Under
the correct clustering, each cluster would group together the
placeholders orig-inating from the same missing node. A high
quality algorithm would produce a clustering which is most similar
tothe true clustering of the placeholders. The clustering quality
is tested using the purity measure which is often used toevaluate
clustering algorithms [Strehl and Ghosh 2003]. This measure is
calculated in the following manner:
(1) Classify each cluster according to the true classification
of the majority of samples in that cluster. In our case, weclassify
each cluster according to the most frequent true original node v ∈
Vm of the placeholder nodes in thatcluster.
(2) Count the number of correctly classified samples in all
clusters and divide by the number of samples. In our casethe number
of samples (nodes) that are classified is |Vp|.
Formally, in this problem setting, purity is defined as:
purity(C) = 1|Vp|∑
k maxvj∈Vm |ck ∩ {v′ ∈ Vp | o(v′) =vj}| where ck is defined as
the set of placeholders which were assigned to cluster k. Note that
as the number ofmissing nodes increases, correct clustering becomes
more difficult, as there are more possible original nodes for
eachplaceholder. As our initial results show, the purity indeed
decreases as the number of missing nodes increases (seeFigures 5
and 6). This evaluation method works well because it is easy to
calculate and it reflects the quality of thealgorithm well, when
comparing clustering algorithms which create the same number of
clusters. Its disadvantageis due to the fact that it can be biased
depending on the number of clusters. If the same data is divided
into a largenumber of clusters it is easier to achieve higher
purity. Therefore this method is used only when the number of
clustersis known in advance, and there is no advantage to selecting
a large number of clusters.
5.3. Evaluation MethodologyIn this subsection we describe our
evaluation methodology and the experimental setup (see flowchart in
Figure 4).The flow begins with the full Facebook MHRW dataset,
containing over 900,000 nodes. This dataset is sampled byrunning
BFS from a random node or by selecting a specific user network, as
described in subsection V.A. Whenrunning BFS, the sampling is
repeated ten times, each starting from a different random node. In
other words, tendifferent networks are created for each size, in
order to avoid randomization errors and biases. In the following
step,N nodes are randomly selected as missing nodes. This is
repeated several times from each one of the networks,depending on
the experiment, resulting in different random instances of the
problem setting for each combinationof graph size and N . From each
instance the randomly selected missing nodes are removed. Each
missing node isreplaced with a placeholder for each of its links to
remaining nodes in the network. The resulting graph is fed intothe
MISC algorithm, which produces a clustering of the placeholders. By
uniting the placeholders in each cluster intoa new node and
connecting the new node to the neighbors of the placeholders in the
corresponding cluster, MISCcreates a predicted network. This
predicted network is supposed to resemble as closely as possible
the structure of theoriginal network from which the random nodes
were removed. The clustering produced by MISC is evaluated usingthe
purity measure described above, and the average purity value
achieved for all of the instances is reported. TheGraph Edit
Distance is also evaluated by comparing the predicted network to
the original network. This result is alsoaveraged over the
different instances and is reported as well. In each experiment
report we indicate the number ofiterations performed. Nearly all
points in the graphs of the relevant figures below are the average
of at least 100 runs
10
-
of the algorithm on randomly generated configurations. However,
for highly time consuming experiments, such as thecomparison with
the KronEM algorithm, only 40 iterations of the experiment were
run. Nevertheless, the results weresignificant even with this
number of iterations.
Fig. 4: A figure explaining the evaluation methodology for the
experiments within this work.
6. COMPARING AFFINITY MEASURESTo assess the quality of our
algorithm we ran experiments on the subgraphs obtained from the
Facebook dataset. Eachexperiment consisted of synthesizing the
missing node problem as described in Section 5, and running the
spectralclustering algorithm using each one of the proposed
affinity measures. We measured the purity achieved when usingeach
of the affinity measures and compared the results to the purity of
a random clustering, where each placeholderwas randomly assigned to
a cluster. Each experiment described in this section was repeated
over 150 times, each timewith randomly selected missing nodes. The
results displayed indicate the average over all experiments.
The results obtained from these experiments clearly show that
the MISC algorithm obtained significantly betterresults than the
random clustering, regardless of which affinity measure was used,
showing the success of this methodin solving the missing node
identification problem. We also compared the relative success of
the different affinitymeasures described in Section IV.B. We find
that Inverse Squared Shortest Path, Relative Common Neighbors
andAdamic / Adar performed slightly better than Gaussian Distance
and Katz Beta measures in this dataset. Figure 5shows the purity
achieved when using each of the affinity measures with the spectral
clustering algorithm, on fifteensubgraphs containing 2,000 nodes
each, obtained by BFS from a random node in the full dataset. Each
point in thefigure represents the average purity from the
experiments described above. MISC performed much better than
therandom clustering with each one of the affinity measures. When
comparing the affinity measures, it seems that theInverse Squared
Shortest Path was slightly better than the others. The same
experiments were performed on subgraphsof user networks taken from
the full facebook dataset. The following networks were
selected:
(1) Network 362 – containing 3,189 nodes and 3,325 undirected
links(2) Network 477 – containing 2,791 nodes and 3,589 undirected
links(3) Network 491 – containing 2,874 nodes and 3,143 undirected
links
11
-
Fig. 5: Purity achieved when using each of the affinity
measures.
It is important to note that subgraphs are very sparse, as each
node has very few links on average. As a result, removingrandom
nodes can cause the graph to become disconnected. Given this, it is
interesting to see the performance of thespectral clustering
algorithm on these subgraphs given in Figure 6. We can see that
spectral clustering is effective alsoon these subgraphs, achieving
much higher purity than the random clustering. Among the affinity
measures, InverseSquared Shortest Path (ISSP) performed the worst
in this case and Adamic/Adar performed the best. This may be dueto
the fact that disconnected components lead to paths of infinite
length, and ISSP is more sensitive to that.
Fig. 6: Purity achieved when using each of the affinity measures
on user network graphs.
As the key contribution of this paper lies within its use of
affinity measures to solve the missing node problem, weconsidered
whether a second clustering algorithm, k-means, could be used
instead of spectral clustering. Overall, wefound that the k-means
algorithm also produced similar results, yet the spectral
clustering algorithm overall performedsignificantly better. To
evaluate this claim, we used 10 samples of 10,000 node networks.
From each network, werandomly removed a set of missing nodes of
sizes 30, 50, 100, 150 and 200. Then, we considered how the
k-means
12
-
clustering algorithm could be used instead. Within both the
spectral clustering and k-means algorithms, affinity ma-trices are
needed to provide the input for the clustering algorithms. However,
we considered two variations for whichnodes to process the affinity
measures within the k-means algorithms. Within the first variation,
we considered boththe known network nodes and the placeholder
nodes. Within the second variation, we only processed the
placeholdernodes. We randomly generated 8 networks, and for each
network setting generated 10 sets of missing nodes. Thus,each
datapoint in Figure 7 is averaged from 80 different experiments. As
can be seen in this Figure, we consideredthree different affinity
measures: Relative Common Neighbors (far left of the figure),
Adamic Adar (middle of thefigure) and Katz Beta 0.05 (far right of
the figure). Note that for all three affinity measures, the
spectral clusteringalgorithm performed better than both of the
k-means variations. We also checked for normal distribution through
theShapiro-Wilks test. We found that the results for the k-means
algorithms, with the exception of the 200 missing nodescase for the
Adamic Adar and Katz Beta affinity measure, were normally
distributed. We then performed the t-test tovalidate the results’
significance. We found that the spectral clustering algorithm
performed significantly better thanboth of the k-means variations.
In all cases the p-scores were much lower than the 0.05
significance threshold. Thelargest p-score value was observed in
the case of the related common neighbor affinity measure, where the
p-scoresbetween the spectral clustering and both k-means variations
were much less than 0.0001.
Fig. 7: Comparing the purity achieved through the spectral
clustering algorithm (SC datapoints in the figure) versusthe
k-means clustering algorithm for the network (k-mean datapoints in
the figure) and the missing nodes and
k-means clustering on only the placeholders (k-means on PHs
datapoints in the figure).
7. SCALING TO LARGE NETWORKSUsing a straightforward
implementation of our proposed algorithm, we have been able to
analyze network graphs ofup to 5,000 nodes within a tractable time.
This is typically not enough to deal with recent datasets of social
networks,which can reach hundreds of thousands of nodes or more.
Processing these datasets imposes serious memory and
timeconstraints, and therefore efficient algorithms need to be
developed and used for this task. In this section we proposetwo
approaches to address this challenge by enabling efficient
calculation of the affinity matrix and of the spectralclustering
algorithm. We accomplish this through two novel extensions of the
base MISC algorithm – adding SparseMatrix Representation and
Dimension Reduction.
7.1. Sparse Matrix Representation and AlgorithmsSocial networks
generally have a sparse nature, i.e. each node is connected to only
a small fraction of nodes. Most ofthe possible links do not exist,
and therefore |E| ≪ |V |2. A commonly used representation for a
graph G = ⟨V,E⟩ isa matrix G|V |×|V | where Gi,j = 1 if (i, j) ∈ E,
and Gi,j = 0 otherwise. When representing a social network as
anadjacency matrix, most of the entries of the matrix will be zero.
Using sparse matrix representations can significantlyreduce the
memory load and help deal with a much larger amount of data. It is
possible to maintain only the non-zerovalues in order to preserve
memory and efficiently perform calculations. The sparse property
can also be utilized toimprove the time complexity of the affinity
matrix calculation. However, the affinity matrix itself may not be
sparsein some cases, depending on the affinity measure used. For
instance, if the Inverse Squared Shortest Path measure isused, and
the graph is fully connected, then each pair of nodes has a
non-zero affinity value. In this section we detailthe algorithms
for calculating the affinity matrix and analyze them in terms of
space and time complexity. We detailhow the Adamic/Adar can be
modified to maintain the sparseness of the affinity matrix. While
we also considered
13
-
how to modify the Relative Common Neighbors and Adamic/Adar
affinity measures, the application of the generalapproach presented
in this section is straightforward. Nonetheless, we include how
these measures can be modified inAppendix A so that the results
presented in this paper can be replicated.
ALGORITHM 3: - Katz Beta AffinityInput: Ga = ⟨Va, Ea⟩ - the
available part of the networkK - the maximal path length to
considerβ - the damping factorOutput: A ∈ R|Va|×|Va| – affinity
matrix indicating the pairwise affinity
1: P ←− Ga – initialize P as the adjacency matrix of Ga2: A1 ←−
β · P – initialize A as the adjacency matrix of Ga, damped by β3:
for k = 2 to K do4: P ←− P ·Ga – count the number of paths of
length k by matrix multiplication5: Ak ←− Ak−1 + βk · P –
accumulate the number of paths, damped by βk6: end for7: Return
AK
Algorithm 3 describes the Katz Beta affinity calculation. As
before, we assume that the given network graph, fromwhich the
affinity matrix is calculated, is Ga = (Va, Ea). We assume that the
matrix is given in a sparse representation,i.e. a list of neighbors
is given for each node. Let d be the maximal degree of a node in
the graph. In step 1, the matrix P,which contains the number of
paths between each pair of nodes, is initialized. At this point it
contains only the numberof paths of length 1, and it is therefore
equal to the adjacency matrix which defines Ga. In step 2, the
affinity matrixis also initialized, at this point taking into
account paths of length 1, and damped by β. In steps 3-6 we
iterativelyincrease the maximal path length taken into account
until we reach paths of length K. In step 4 P is updated in
eachiteration k to Gak, indicating the number of paths of length k
between each pair of nodes. P is then accumulated intoAk after
damping by βk, in step 5.
Multiplication of sparse n × n matrices can be done in O(mn),
where m is the number of non-zero elements. Inour case, n = |Va|
and m = O(|Va| · d) for the matrix Ga. In P , the number of
non-zero elements changes in eachiteration by a factor of up to d.
Therefore, in iteration k, step 4 requires O(|Va|2dk) operations.
The time complexityof steps 3-6 and of the entire procedure is then
O
(∑Kk=1 |Va|
2dk
)= O(|Va|2dK).
7.2. Dimension ReductionIn the first step of the original MISC
algorithm described above, we calculate the affinity matrix A ∈
R|Va|×|Va|,representing the pairwise affinity of nodes in Va. This
matrix is then fed into the spectral clustering algorithm. Each
rowof A corresponds to one of the nodes, represented in a vector
space. Each column corresponds to a single dimension inthe space in
which these nodes reside. In steps 2-4 of spectral clustering
(Algorithm 1), the nodes are embedded in aEuclidean space RN . By
selecting the N largest eigenvectors of L in step 3, MISC reduces
the number of dimensionsgiven to the k-means algorithm in step 5
from |Vv| to N . However, steps 2 and 3 still remain
computationally expensive,dealing with a large matrix with |Va|
rows and columns.
We propose a variation of MISC that includes dimension
reduction, R-MISC, which can reduce the inherent com-plexity within
the MISC algorithm, allowing us to tractably solve the missing node
problem in much larger networks.R-MISC reduces the size of the
affinity matrix, greatly decreasing the computational complexity of
steps 2 and 3within the original MISC algorithm. Intuitively, when
the number of placeholders is relatively small compared to thetotal
size of the network, many network nodes are "distant" from all of
the placeholders in the sense that they have noaffinity or only a
very small affinity with the placeholders. Since only the
placeholders are clustered in our problemsetting, we propose that
the dimensions representing these nodes can be removed from the
affinity matrix with only asmall loss in performance of the
spectral clustering algorithm, yet significantly reducing the time
and the memory usedby the spectral clustering component of the MISC
algorithm. Note that while these nodes have a very small
affinitywith the placeholders, they may have a high affinity with
other nodes which are not removed. Thus, removing themmay still
affect the result of the spectral clustering. Yet, as our results
show, the effect is minimal.
To test this method, we conducted experiments comparing the
performance of R-MISC relative to MISC. Aftercalculating the
affinity matrix common to both algorithms, we remove the rows and
columns within R-MISC thatcorrespond to nodes that have zero
affinity with all of the placeholders, resulting in a much smaller,
square affinitymatrix which is then used in the following steps of
the original MISC algorithm.
14
-
Fig. 8: The effect of dimension reduction on purity in 10,000
node graphs in terms of time (left) and purity (right)given
different numbers of missing node sizes (X-axis in both sides).
Fig. 9: The effect of dimension reduction on purity in 50,000
node graphs in terms of time (left) and purity (right)given
different numbers of missing node sizes (X-axis in both sides).
We then studied the impact of employing dimension reduction
within different network sizes and for differentamounts of missing
nodes within those networks. First, we studied how this approach is
impacted by network size.In Figures 8 and 9 we present results from
networks constructed with 10,000 and 50,000 nodes, respectively.
Wealso studied additional networks with 20,000, 30,000 and 40,000
nodes and found R-MISC to be equally effective inthese intermediate
sizes as well. Second, we studied how the R-MISC would perform for
different numbers of missingnodes. We studied networks with N = 10,
50, 100, 200, 300 and 500 missing nodes. Note that these values
constitutethe x-axis within Figures 8 and 9.
Last, we studied the impact that the network size and different
missing node sizes had on performance. In doingso, we compared both
the accuracy, as measured by purity, and the time, measured in
seconds, for both the originalMISC algorithm and the R-MISC
variation with dimension reduction. The left side of Figures 8 and
9 represents theresults from the time experiments, and the right
side represents the corresponding purity measures. As expected,
thedimension reduction significantly reduces the time needed to run
the spectral clustering component of the algorithm,while only
having a small impact on the R-MISC’s performance. These graphs
present results from the RelativeCommon Neighbors affinity measure.
However, similar results were obtained when the other affinity
measures wereused. In all cases, the results clearly show that
despite the overhead of performing the dimension reduction,
significanttime is saved and this method allows us to scale R-MISC
to much larger networks within a tractable time. Note thatas these
experiments only pertain to the time used by the spectral
clustering step of the algorithm, the time element ofthese results
is not dependent on the affinity measure which we used. Thus, the
time results are the same for all affinitymeasures being used.
15
-
Fig. 10: The impact of network size on the MISC and R-MISC
algorithms for networks with 100 (on left) and 500(on right)
missing nodes.
Fig. 11: Purity achieved by the R-MISC algorithm when using
different affinity measures on 100,000 node graphs,using dimension
reduction.
We also considered how scalable the R-MISC algorithm would be by
considering networks with up to 100,000nodes. We again considered
networks with 100, 200, 300, 400 and 500 missing nodes. Figure 10
presents the resultsfrom the experiments with 100 and 500 missing
nodes. Please note from these results that only the R-MISC
algorithmwas able to tractably solve the networks with 100,000
nodes. In comparison, note that time required by R-MISCalgorithms
increases relatively slowly with the network size. For example,
within the 500 missing node problem, thetime required for the
R-MISC algorithm to solve for 500 missing nodes within a 30,000
node network is 504.30seconds, while the time to solve for 500
nodes within 40,000, 50,000 and 100,000 nodes is 602.99, 683.77,
and 702.83seconds respectively. Thus, we conclude that R-MISC is a
scalable algorithm, as can be extrapolated from these results.
Figure 11 displays the purity achieved by using three different
affinity measures in the R-MISC algorithm, ongraphs containing
100,000 nodes. The results are compared to a random clustering. The
ability to analyze such largegraphs is achieved due to dimension
reduction and sparse matrix representation. The results of the
tested measuresare consistent with results from experiments on
smaller graphs, with Adamic/Adar and Relative Common
Neighborsperforming slightly better than Katz Beta.
16
-
8. COMPARING MISC TO KRONEMRecall from Section III that the
Missing Node Problem is relatively new, with very few algorithms
available to comparewith MISC. Nonetheless, we did perform
comparisons between MISC and the only other known algorithm
suitablefor the missing node problem, KronEM [Kim and Leskovec
2011]. KronEM accepts a known graph and the numberof missing nodes
and outputs a graph which predicts the missing nodes and their
links. This algorithm is not based onthe existence of placeholders
and therefore it does not use them. Another key difference between
MISC and KronEMis that KronEM assumes that the number of nodes in
the full graph is a power of two.
Nonetheless, to facilitate a fair comparison between MISC and
KronEM, we generated 10 networks, each containing211 = 2048 nodes
sampled from the full Facebook dataset. From each network, we
randomly removed N missingnodes. The selected values for N were 11,
21, 31, 41 and 50 which are, respectively, approximately 0.5%, 1%,
1.5%,2% and 2.5% of the network. The experiment was repeated four
times for each selected value of N in each network,resulting in 40
iterations per missing node percentage. Each resulting network was
fed to the KronEM and MISCalgorithms.
The output of each algorithm was compared to the original
network graph using Graph Edit Distance, since KronEMcannot be
evaluated by Purity, as it is not a clustering based algorithm. As
a baseline algorithm, we also measured theGED obtained by the
random clustering approach in this set of experiments. An important
observation is that MISConly predicts the existence of links
between the predicted nodes and the neighbors of the original
missing nodes.KronEM, on the other hand, might predict an erroneous
link between a predicted node and any other node in thenetwork,
which was not connected to the missing node. This is due to the
fact that KronEM does not take advantageof the placeholders.
Therefore, in order to fairly compare the two, when calculating the
GED, we only considerededit operations relating to links between
the predicted nodes and the neighbors of the missing nodes. In
addition, eventhough KronEM predicts a directed graph, we treated
each predicted link as undirected, since this slightly improvedthe
results of KronEM.
8.1. KronEM ConfigurationThe KronEM algorithm accepts the known
network, without the placeholders, and the number of missing
nodes.Additional parameters are the initial initiator matrix of the
Kronecker graph, the set of parameters for the Gibbssampling and
the number of iterations [Kim and Leskovec 2011].
The authors recommended using at least 3 kinds of initial
initiator matrices, which represent different relationshipsbetween
the nodes. Accordingly, we ran the KronEM algorithm with the 3
recommended variations of the initialinitiator matrix which
represent the following relationship: (a) Homophily - love of the
same; (b) Heterophily - loveof the different; and (c)
Core-periphery - links are most likely to form between members of
the core and least likely toform between nodes members of the
periphery. For all 3 variations we used the same parameters1. Each
iteration tookseveral hours to run using the C++ implementation we
received from the authors of [Kim and Leskovec 2011].
8.2. MISC ConfigurationFor this comparison, we used MISC with
three different affinity measures: Adamic/Adar, Katz Beta 0.05 and
RelativeCommon Neighbors. Even though the tested network size was
only 2048 nodes, we still used the dimension reductionand sparse
network representation features of MISC, since they are shown to
speed up the calculation with only aslight impact on the predictive
performance. Each iteration of the algorithm ran for less than 10
seconds in Matlab2.
8.3. ResultsThe results show that both MISC and KronEM
algorithms are significantly better than the random clustering
baseline.Between the three variations of KronEM, the results were
almost identical, yet the Homophily variation producedslightly
better results as the missing node percentage grew.
Between the three variations of affinity measures for the MISC
algorithm, there was also a small difference, yetRelative Common
Neighbors performed slightly better than Adamic/Adar and Katz Beta
0.05. All three algorithmsperformed significantly better than
KronEM with the Homophily initiator matrix. The best obtained
results fromeach algorithm are displayed in Figure 12. Table 1
displays the mean result and standard deviation achieved by
eachalgorithm over the 40 iterations over each missing node
percentage and four random draws of missing nodes fromeach of the
10 networks.
1In running the KronEM algorithm we set the algorithm parameters
to the values as described in the ReadMe file in their algorithm’s
distribution.As such, the EM parameter which controls the number of
iterations was set to 50, the l parameter controlling the gradient
learning rate was 1e-5,the mns parameter controlling the minimum
gradient step was 1e-4, and the mxs parameter controlling the
maximum gradient step was 1e-2. Thetests were run on a Windows Web
Server 2008 with a 64 bit operation system, 4 GB RAM and an Intel
dual core 2.27 GHz CPU.2The MISC experiment ran on an Intel Core
i5, 2.67 GHZ CPU with 4 GB of RAM, running Matlab 2010a and a
Windows 7 OS.
17
-
10 15 20 25 30 35 40 45 5020
40
60
80
100
120
140
160
Number of Missing Nodes
Gra
ph E
dit D
ista
nce
Comparing MISC to KronEM
KronEM HomophilyMISC Relative Common Neighbors
Fig. 12: Comparison of MISC to KronEM on 2,048 node networks
using Graph Edit Distance (lower is better).
To determine the significance of these results, we first used
the Shapiro-Wilk test to determine if the data wasnormally
distributed. We found that the results of many of the networks that
were studied were in fact not normallydistributed. As a result, we
used the Wilcoxon Sign-Rank Test to see if the differences in Table
1 were found to bestatistically significant. Based on this test, we
found that the differences were significant in all cases.
We conclude that while KronEM can be an excellent solution when
a large part of the network is unknown andthere are no indications
of missing nodes, MISC can perform better when the placeholders are
given and when tryingto identify a small number of missing
nodes.
9. DEALING WITH PARTIAL AND INACCURATE PLACEHOLDER
INFORMATIONThe algorithms and experiments described in this work
have made assumptions regarding the information that isavailable on
the social network. For instance, the number of missing nodes was
assumed to be given as an input tothe algorithm. In addition, it
was assumed that all the placeholders can be identified with
complete certainty. When anode was removed from the original graph
in the experiments described above, a placeholder was attached to
each oneof that node’s links. If we assume that indications of
missing nodes in the network are obtained from a data miningmodule,
such as image recognition and text processing, it is likely that
this information would be partial and noisy. Forinstance, the
number of missing nodes may not be known, not all placeholders may
be known, and there may be falsealarms indicating placeholders
which do not actually exist. In this section we present methods
aimed at addressingthese issues.
Specifically, in this section we consider three different types
of placeholder uncertainty. In the first type of case, wedo not
know the exact number of missing nodes. In this case, we must
estimate this value according to the estimates wepresent. We
consider a second case where insufficient placeholders exist to
correctly identify all missing nodes. Last,we consider the case
where extra placeholders exist. In this case, we assume that the
actual missing nodes are foundwithin the set of placeholders, but
extraneous information exists that assumes additional placeholders
exist which donot correspond to actual nodes within the
network.
For the first two categories of problems (unknown number of
missing nodes and missing placeholders) we useevaluations based on
Graph Edit Distance, as the actual number of missing nodes is
unknown based on the data. Inthese problems, the purity measure is
not appropriate as the actual number of missing nodes is needed to
calculate thepurity measure. In contrast, Graph Edit Distance is
based on calculating the number of transformations between
theoriginal and actual networks– something that can be calculated
even without knowing the number of missing nodes.In contrast, in
the last type of uncertainty, where there are extra placeholders,
we again assume that the number ofplaceholders is known, allowing
us to consider the purity evaluation measure as well.
9.1. Estimating the Number of Missing NodesIn many real-life
scenarios, a partial network may be available when the number of
missing nodes is unknown. In theMISC algorithm, the number of
clusters is set to equal the number of missing nodes, N . Recall
that the placeholdersin each cluster are united to one predicted
node which is connected to all of the neighbors of these
placeholders. Each
18
-
Table I: Comparison of MISC and KronEM Algorithms
Missing Node Percentage Algorithm Mean GED Std.
0.5%
KronEM Core-Periphery 36.32 11.98KronEM Heterophily 36.18
12.16KronEM Homophily 36.03 11.16MISC Adamic/Adar 32.55 15.44MISC
Katz Beta 0.05 31.03 16.00MISC CommNeighbors 31.52 16.47
1%
KronEM Core-Periphery 69.72 19.57KronEM Heterophily 70.20
20.82KronEM Homophily 70.08 19.39MISC Adamic/Adar 67.80 26.83MISC
Katz Beta 0.05 66.45 28.76MISC CommNeighbors 64.50 27.57
1.5%
KronEM Core-Periphery 101.82 23.68KronEM Heterophily 102.63
27.08KronEM Homophily 100.70 23.92MISC Adamic/Adar 99.68 33.47MISC
Katz Beta 0.05 97.65 33.15MISC CommNeighbors 95.60 32.99
2%
KronEM Core-Periphery 133.20 28.08KronEM Heterophily 134.75
29.71KronEM Homophily 130.33 26.74MISC Adamic/Adar 127.67 37.42MISC
Katz Beta 0.05 127.05 37.26MISC CommNeighbors 123.62 35.92
2.5%
KronEM Core-Periphery 162.98 30.98KronEM Heterophily 167.25
36.26KronEM Homophily 158.60 29.39MISC Adamic/Adar 154.73 37.42MISC
Katz Beta 0.05 155.23 40.00MISC CommNeighbors 150.35 39.66
Table II: Studying the Accuracy of the Estimation for the Number
of Missing Nodes
10 20 30 50 70 100 TotalQuadratic Degree Estimation 0.1850
0.1725 0.1308 0.1115 0.0921 0.0728 0.1275
Mean Degree Estimation 0.1850 0.1725 0.1300 0.1125 0.0921 0.0743
0.1277
predicted node is supposed to resemble one of the missing nodes.
When the number of missing nodes is unknown, itmust be estimated so
that the predicted graph is as similar as possible to the original
graph.
Our general approach is to estimate the number of missing nodes
based on the general structure of social networks.We assume that a
set ratio exists between nodes within the network and their
neighbors. This relationship has beenpreviously studied within
missing link and node literature [Gomez-Rodriguez et al. 2012; Lin
et al. 2012; Kim andLeskovec 2011]. We considered two methods,
quadratic and mean estimations. Within the quadratic estimation
weassumed that the number of edges in the network follows a
quadratic relation of the network’s nodes. This was alsoobserved in
the work of Kim and Leskovec [Kim and Leskovec 2011]. Accordingly,
we first calculated this ratioaccording to the known part of the
network, i.e. |Ek| = a ∗ |Vk|2, and we can then calculate the
estimated number ofclusters, N̂ , as the solution for the following
equation: |Ek ∪Ep| = a* (|Vk|+N)2. The estimated N̂ is rounded to
thenearest integer and used in the MISC algorithm. Within the mean
estimation we assigned d as the average degree of anode in Va. The
expected number of clusters is then |Vp| /d. The estimated number
of clusters, N̂ , is rounded to thenearest integer and used in the
MISC algorithm. This estimation causes |Vp| nodes to be clustered
into N̂ clusters, andthe average number of placeholders per cluster
is then |Vp| /N̂ . As a result, the predicted nodes have the same
averagedegree as the nodes in the available graph Va. This is due
to the fact that each placeholder has one link, and each
19
-
Fig. 13: Graph Edit Distance achieved when N is estimated and
when N is known in advance. Tested on 10,000 nodegraphs.
predicted node will have the links of all the placeholders which
were clustered into the relevant cluster representingthat missing
node.
We first studied how accurate these estimations were in
predicting the actual number of missing nodes within thenetwork. To
do so, we ran an experiment using 10 samples of 10,000 node
networks. From each network, we randomlyremoved a set of missing
nodes of sizes 10, 20, 30, 50, 70 and 100. We calculated the two
estimations according to theremaining networks and the
placeholders. We then calculated the mean absolute error between
each estimation and theactual number of missing nodes, and repeated
this test 4 times. The results in Table II show the mean absolute
errorfor these 40 runs. Please note that both estimation methods
yielded similar results and that as the number of missingnodes
increases, then the error is reduced. This implies that both
estimations actually become more precise as morenodes are missing.
One possible explanation for this is that as the number of missing
nodes increases, small variationswithin localized portions of the
network become less significant, making these estimations more
accurate. Next, tomeasure the performance of using only an estimate
of the number of missing nodes, we considered the
predictiveperformance of the MISC algorithm when the actual number
of clusters is given, in comparison to the performanceobtained when
using an estimate. In this case, the Graph Edit Distance measure
was used since each instance of theexperiment may result in a
different number of clusters, depending on the selected subset of
the network and therandomly chosen missing nodes. As both the mean
and quadratic estimations yield similar results, Figure 13
displaysthe results of these experiments when using the mean
estimate. While there is some visible degradation of the
resultsachieved when N was unknown, the algorithm was still able to
achieve beneficial results. For instance, the proposedalgorithm
which estimated the value of N still performed better than a random
clustering algorithm which was giventhe correct value in advance,
especially as N increased. A possible reason for some degradation
in the results mightbe a large variance in the degrees of the
missing nodes, causing a large variance in |Vp|, because each
neighbor of amissing node is connected to a placeholder in the
experiment. Thus, when a node with a large degree was
randomlyselected for removal, this led to an overestimate of N .
The estimation of N is expected to be much more precise if
thesubset of the network is selected in a way that the variance in
the nodes’ degrees would be smaller.
9.2. Addressing Missing PlaceholdersMany real-life scenarios are
likely when some of the placeholders are missing, for example, if
the indications forsome of the placeholders were not obtained.
Under the assumption that a data mining module provides us with
theindications of the existence of placeholders, this module may
provide incomplete results. This mistaken informationwill not only
add false negatives of non-detection of placeholders, but false
positives may also exist in the form offalse alarms. To assess how
robust MISC is with incomplete placeholder information, we measured
how this partialinformation would affect the missing nodes’
identification. For this purpose we conducted a set of experiments
wherethe percentage of known placeholders ranged from 10% to 100%
of the total placeholders that are created whenremoving the missing
nodes. Figure 14 displays the Graph Edit Distance achieved by the
algorithm for each percentageof known placeholders. The affinity
measures used in these experiments were Katz Beta and Adamic/Adar.
The results
20
-
Fig. 14: Graph Edit Distance between the original graph and the
graph predicted by the MISC algorithm when onlysome of the
placeholders are known.
clearly show that the performance decreases when less
placeholders are known, resulting in a higher Graph EditDistance
between the original graph G and the predicted graph Ĝ. Each point
in the graph represents the average GraphEdit Distance achieved
from 100 experiments. This scenario raises questions regarding the
MISC algorithm’s abilityto address missing information within the
placeholders. First, when some placeholders are unknown, the
resultinggraph predicted by the spectral clustering algorithm would
lack the edges between each unknown placeholder and itsneighbor. In
addition, adapting the spectral clustering algorithm to deal with
probabilistic placeholders is not trivial.A possible approach is to
run a Link Prediction Algorithm in order to complete the edges
which are missed due to theunknown placeholders. To formulate this
new problem we alter the input of the original missing node
identificationproblem. Recall that Vp is the group of all of the
placeholders generated from the missing nodes and their edges.
Wedefine the following new groups: V kp , E
kp - the group of known placeholders and their associated edges,
and V
up , E
up -
the group of unknown placeholders and their associated edges,
such that Vp = V kp ∪ V up and V kp ∩ V up = ϕ. For eachmissing
node v ∈ Vm, and for each edge ⟨v, u⟩ ∈ E, u ∈ Vk, we add a
placeholder v′ either to V kp or to V up and anedge ⟨v′, u⟩ to Ekp
or to Eup accordingly. The available network graph, Ga = ⟨Va, Ea⟩,
now consists of Va = Vk ∪ V kpand Ea = Ek ∪Ekp . In addition, we
define the indicator function I (v) which returns the value 1 for
each known nodev ∈ Vk if it is connected to an unknown placeholder,
and otherwise returns 0.
I (v) =
1 if ∃u ∈ Vup ∧ ⟨v, u⟩ ∈ Eup
0 otherwise
The value of I (v) is of course unknown to the system. Instead,
we model the data mining module’s knowledge asS (v) , a noisy view
of I (v) with additive random noise X(v), where X is an unknown
random variable.
S (v) = I (v) + X(v)
The formal problem definition for this problem setting is then:
given a known network Gk = ⟨Vk, Ek⟩, an availablenetwork Ga = ⟨Va,
Ea⟩, the value of S (v) for each v ∈ Vk and the number of missing
nodes N , divide the nodesof Va\Vk to N disjoint sets Vv1 , . . . ,
VvN such that Vvi ⊆ Vp are all the placeholders of vi ∈ Vm, and
connect eachset to additional nodes in Vk such that the resulting
graph has a minimal graph edit distance (GED) from the
originalnetwork G. We propose two possible algorithms for solving
this problem – Missing Link Completion and SpeculativeMISC. Both
algorithms accept a threshold T , which indicates a minimal value
of S(v) to be taken into account as avalid indication of a
placeholder. The Missing Link Completion algorithm at first ignores
S(v), and runs the originalMISC algorithm only on the available
graph. Then, in order to complete links that are missing from the
predictedgraph due to not knowing all the placeholders, a Link
Prediction algorithm is used. This algorithm simply connectseach
node that has a value of S(v) which is higher than T to the new
node with which it has the strongest affinity.Algorithm 6 details
the pseudo code of the Missing Link Completion algorithm: The
Speculative MISC algorithm usesa different approach regarding S
(v). Under this approach, a new placeholder is added and connected
to every node vwhose indication value, S (v) is greater than T .
Next, the regular MISC algorithm (algorithm 2) is used on the
newgraph to predict the original graph. To compare these two
algorithms, we conducted experiments where only 50%of the
placeholders are known. For each node v ∈ Vk we calculated S(v) as
described above, indicating if that node
21
-
ALGORITHM 4: - Missing Link CompletionInput:Gk = ⟨Vk, Ek⟩ {the
known part of the network}Ga = ⟨Va, Ea⟩ {the available part of the
network}S (v) , ∀v ∈ Vk {the available indications about the
unknown placeholders}N {the number of missing nodes}α : G(V,E) −→
R|V |×|V | {a function for calculating the affinity matrix of nodes
in a graph}T {the minimal value of S(v) to be taken into account as
a valid indication of a placeholder}
Output: Ĝ =⟨V̂,Ê
⟩- prediction of the original network G
1: Ĝ =⟨V̂,Ê
⟩←−MISC (Gk, Ga, N, α) {Apply the MISC algorithm while ignoring
S(v)}
2: Â = α(Ĝ) {Calculate the pairwise affinity of the nodes in
the predicted graph Ĝ} {Iterate over all the known nodes with
avalue of S(v) higher than the threshold}
3: for each v ∈ Vk : S (v)>T do4: u = argmaxw∈V̂ \Vk Â(v, w)
{find the new node u with the str