Predicting and Identifying Missing Node Information in ...users.umiacs.umd.edu/~sarit/data/articles/tkdd-final.pdf · Sigal Sina, Bar-Ilan University Sarit Kraus, Bar-Ilan University

0

Predicting and Identifying Missing Node Information in Social Networks

Ron Eyal, Bar-Ilan UniversityAvi Rosenfeld, Jerusalem College of TechnologySigal Sina, Bar-Ilan UniversitySarit Kraus, Bar-Ilan University

In recent years, social networks have surged in popularity. One key aspect of social network research is identifying important missing informationwhich is not explicitly represented in the network, or is not visible to all. To date, this line of research typically focused on finding the connectionsthat are missing between nodes, a challenge typically termed as the Link Prediction Problem.

This paper introduces the Missing Node Identification problem where missing members in the social network structure must be identified. Inthis problem, indications of missing nodes are assumed to exist. Given these indications and a partial network, we must assess which indicationsoriginate from the same missing node and determine the full network structure.

Towards solving this problem, we present the MISC Algorithm (Missing node Identification by Spectral Clustering), an approach based on aspectral clustering algorithm, combined with nodes’ pairwise affinity measures which were adopted from link prediction research. We evaluate theperformance of our approach in different problem settings and scenarios, using real life data from Facebook. The results show that our approach hasbeneficial results and can be effective in solving the Missing Node Identification Problem. In addition, this paper also presents R-MISC which usesa sparse matrix representation, efficient algorithms for calculating the nodes’ pairwise affinity and a proprietary dimension reduction technique,to enable scaling the MISC algorithm to large networks of more than 100,000 nodes. Last, we consider problem settings where some of theindications are unknown. Two algorithms are suggested for this problem - Speculative MISC, based on MISC, and Missing Link Completion,based on classical link prediction literature. We show that Speculative MISC outperforms Missing Link Completion.

Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications- Data mining

General Terms: Algorithms, Performance, Theory

Additional Key Words and Phrases: Social networks, spectral clustering, missing nodes

ACM Reference Format:Ron Eyal, Avi Rosenfeld, Sigal Sina and Sarit Kraus, 2013. Predicting and Identifying Missing Node Information in Social Networks. ACM Trans.Embedd. Comput. Syst. 0, 0, Article 0 ( 2013), 26 pages.DOI:http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTIONSocial Networks enable people to share information and interact with each other. These networks are typically formallyrepresented as graphs where nodes represent people and edges represent some type of connection between these people[Liben-Nowell and Kleinberg 2007], such as friendship or common interests. These networks have become a keyInternet application, with examples including popular websites such as Facebook, Twitter and LinkedIn.

Because of their ubiquity and importance, scientists in both academia and industry have focused on various as-pects of social networks. One important factor that is often studied is the structure of these networks [Clauset et al.2008; Eslami et al. 2011; Fortunato 2010; Freno et al. ; Gomez-Rodriguez et al. 2012; Gong et al. 2011; Kim andLeskovec 2011; Leroy et al. 2010; Liben-Nowell and Kleinberg 2007; Lin et al. 2012; Porter et al. 2009; Sadikov et al.2011]. Previously, a Link Prediction Problem [Clauset et al. 2008; Liben-Nowell and Kleinberg 2007] was defined asattempting to locate which connections (edges) will soon exist between nodes. In this problem setting, the nodes ofthe network are known, and unknown links are derived from existing network information, including complete nodeinformation. In contrast, we consider a new Missing Node Identification problem which attempts to locate and iden-

This research is based on work supported in part by MAFAT. Sarit Kraus is also affiliated with UMIACS. Preliminary results were published in theAAAI 2011 paper entitled ’Identifying Missing Node Information in Social Networks’.Author’s addresses: R. Eyal, S. Sina and S. Kraus, Computer Science Department, Bar Ilan University, Ramat-Gan, Israel 92500A. Rosenfeld, Department of Industrial Engineering, Jerusalem, Israel, 91160Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. Tocopy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701USA, fax +1 (212) 869-0481, or [email protected]⃝ 2013 ACM 1539-9087/2013/-ART0 $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: 2013.

tify missing nodes within the network. This problem is significantly more difficult than the previously studied LinkPrediction Problem as neither the nodes nor their edges are known with certainty.

To understand the importance of the missing node identification problem we introduce, please consider the followingexample. A hypothetical company, Social Games Inc., is running an online gaming service within Facebook. ManyFacebook members are subscribers of this company’s services, yet it would like to expand its customer base. Asa service provider, Social Games maintains a network of users, which is a subset of the group of Facebook users,and the links between these users. The Facebook users who are not members of the service are not visible to theirsystems. Social Games Inc. would like to discover these Facebook nodes, and try to lure them into joining theirservice. The company thus faces the missing node identification problem. By solving this problem, Social Games Inc.could improve its advertising techniques and aim at the specific users who haven’t yet subscribed to their service.

The above example exemplifies just one possible application of the missing node identification problem. In additionto commercial motivation, solving this problem could also be useful for personal interests and entertainment applica-tions. This research direction has also particularly interested the security community. For example, missing nodes insome networks might represent missing persons that are sought after by family members who wish to know their fullfamily tree or people wanted by the police as suspects in a crime. As a result, solving the missing node identificationproblem can be of considerable importance.

We focus on a specific variation of the missing node problem where the missing nodes requiring identification are"friends" of known nodes. An unrecognized friend is associated with a "placeholder" node to indicate the existenceof this missing friend. Thus, a given missing node may be associated with several "placeholder" nodes, one for eachfriend of this missing node. We assume that tools such as image recognition software or automated text analysis canbe used to aid in generating placeholder nodes. For example, a known user might have many pictures with the sameunknown person, or another user might constantly blog about a family member who is currently not a member of thenetwork. Image recognition or text mining tools can be employed on this and all nodes in the social network in orderto obtain indications of the existence of a set of missing nodes. Placeholders can then be used to indicate where thesemissing nodes exist. However, it is likely that many of these placeholders are in fact the same person. Thus, our focusis on solving the identification of the missing nodes, and we frame the problem as: Given a set of placeholders, whichof these placeholders do in fact represent the same person. Considering the assumption that placeholders are receivedfrom external data mining modules, this work is mainly focused on relatively small networks with a small number ofmissing nodes. Nonetheless, our proposed algorithm easily scales to networks of more than a hundred thousand nodes.

In this paper, we present a general method, entitled MISC (Missing node Identification by Spectral Clustering),for solving this problem. This method relies on a spectral clustering algorithm previously considered only for otherproblems [Almog et al. 2008; Ng et al. 2001]. One key issue in applying the general spectral clustering algorithm isdefining a measure for identifying similar nodes to be clustered together. Towards solving this issue, we present fivemeasures for judging node similarity, also known as affinity. One of these measures is the Gaussian Distance Measure,typically used over Euclidean spaces in spectral clustering [Ng et al. 2001], while the other four measures are non-Euclidean measures which have been adapted from a related Link Prediction Problem [Liben-Nowell and Kleinberg2007]. We found that the latter measures are especially useful in solving the missing node identification problem.

We begin this study in the next section by providing background of the spectral clustering algorithm and relevantsocial network research that relate to our solution for the Missing Node Identification problem. Then, in Section III,we formally define the Missing Node Identification problem. In Section IV, we detail how the MISC algorithm canbe applied, including how different node affinity measures inspired by the Link Prediction Problem can be appliedwithin our solution. Section V presents our evaluation dataset and methodology, and Section VI provides an empiricalanalysis of MISC’s performance when the different affinity measures are used. Following this analysis, we providetwo extensions to our base algorithm. In Section VII we describe how our base algorithm can be extended to handlelarge networks of hundreds of thousands of nodes. In Section VIII we describe the comparison between the MISCalgorithm and KronEM [Kim and Leskovec 2011], a state-of- the-art algorithm for the prediction of missing nodes.In Section IX we suggest methods for modifying our base approach to address situations when partial informationabout the number of missing nodes is unknown, or not all of the placeholders are known. Under the problem settingof unknown placeholders, we are able to show that the MISC algorithm, based on spectral clustering, outperforms’Missing Link Completion’, an algorithm based on classical solutions to the Link Prediction Problem. Section Xconcludes and provides directions for future research.

2. RELATED WORKIn solving the Missing Node Identification problem, this research aims to use variations of two existing research areas– spectral clustering algorithms and metrics built for the Link Prediction Problem. The spectral clustering algorithm ofJordan, Ng and Weiss [Ng et al. 2001] is a well documented and accepted algorithm, with applications in many fields

2

including statistics, computer science, biology, social sciences and psychology [von Luxburg 2007]. While spectralclustering algorithms have been applied to many areas, its use in the Missing Node Identification problem has not beenpreviously considered and cannot be directly applied from previous works.

The main idea behind the spectral clustering framework is to embed a set of data points which should be clusteredin a graph structure in which the weights of the edges represents the affinity between each pair of points. Two pointswhich are “similar” will have a larger value for the weight of the edge between these two points’ than two pointswhich are “dissimilar”. The similarity, or affinity, function is used to calculate an affinity matrix which describesthe pairwise similarity between points. The affinity matrix can be thought of as a weighted adjacency matrix, andtherefore it is equivalent to an affinity graph. After generating the affinity graph, an approximation of the min-cutalgorithm is employed in order to split the graph into partitions which maximize the similarity between intra-partitionpoints and minimize the similarity between points in different partitions. This is done by manipulating the affinitymatrix representing the affinity graph and solving an eigenvector problem.

While the general framework of this algorithm is to embed points in a Euclidean space as an affinity graph, ourapproach is to construct the affinity graph directly from the social network graph structure. The key challenge inapplying the spectral clustering algorithm to the Missing Node Identification problem is how to compute the level ofsimilarity between nodes in the social network while constructing the affinity matrix. Towards defining this measure,we consider adopting measures developed for a related problem, the Link Prediction Problem. In the Link PredictionProblem there are a set of known nodes, and the goal is to discover which connections, or edges, will be made betweennodes [Clauset et al. 2008; Liben-Nowell and Kleinberg 2007]. In contrast, in the missing node problem, even thenodes themselves are not known, making the problem significantly more difficult. Nonetheless, we propose a solutionwhere measures originally used to solve the Link Prediction Problem are utilized to form the affinity matrix in the firststep of solving this problem as well.

Various methods have been proposed to solve the Link Prediction Problem. Approaches typically attempt to derivewhich edges are missing by using measures to predict link similarity based on the overall structure of the network.However, these approaches differ as to which computation is best suited for predicting link similarity. For example,Liben-Nowell and Kleinberg [Liben-Nowell and Kleinberg 2007] demonstrated that measures such as the shortestpath between nodes and different measures relying on the number of common neighbors can be useful. They alsoconsidered variations of these measures, such as using an adaptation of Adamic and Adar’s measure of the similaritybetween webpages [Adamic and Adar 2003] and Katz’s calculation for shortest path information, which weight shortpaths more heavily [Katz 1953] than the simpler shortest path information. After formally describing the missing nodeidentification problem, we detail how the spectral clustering algorithm can be combined with these link predictionmethods in order to effectively solve the missing node identification problem.

Many other studies have researched problems of missing information in social networks. Guimera and Sales-Pardo[Guimerà and Sales-Pardo 2009] propose a method which performs well for detecting missing links as well as spuriouslinks in complex networks. The method is based on a stochastic block model, where the nodes of the network arepartitioned into different blocks, and the probability of two nodes being connected depends only on the blocks to whichthey belong. Some studies focus on understanding the propagation of different phenomena through social networks asa diffusion process. These phenomena include viruses and infectious diseases, information, opinions and ideas, trends,advertisements, news and more. Gomez-Rodriguez et al. [Gomez-Rodriguez et al. 2012] attempted to infer a networkstructure from observations of a diffusion process. Specifically, they observed the times when nodes get infected by aspecific contagion, and attempted to reconstruct the network over which the contagion propagates. The reconstructionis done through the edges of the network, while the nodes are known in advance. Eslami et al. [Eslami et al. 2011]studied the same problem. They modeled the diffusion process as a Markov random walk and proposed an algorithmcalled DNE to discover the most probable diffusion links.

Sadikov et al. [Sadikov et al. 2011] also studied the problem of diffusion of data in a partially observed socialnetwork. In their study they proposed a method for estimating the properties of an information cascade, the nodes andedges over which a contagion spreads through the network, when only part of the cascade is observed. While this studytakes into account missing nodes and edges from the cascade, the proposed method estimates accumulative propertiesof the true cascade and does not produce a prediction of the cascade itself. These properties include the number ofnodes, number of edges, number of isolated nodes, number of weakly connected components and average node degree.

Other works attempted to infer missing link information from the structure of the network or information aboutknown nodes within the network. For example, Lin et al. [Lin et al. 2012] proposed a method for community detection,based on graph clustering, in networks with incomplete information. In these networks, the links within a few localregions are known, but links from the entire network are missing. The graph clustering is performed using an iterativealgorithm named DSHRINK. Gong et al. [Gong et al. 2011] proposed a model to jointly infer missing links and missingnode attributes by representing the social network as an augmented graph where attributes are also represented by

3

nodes. They showed that link prediction accuracy can be improved when first inferring missing node attributes. Frenoet al. [Freno et al. ] proposed a supervised learning method which uses both the graph structure and node attributesto recommend missing links. A preference score which measures the affinity between pairs of nodes is defined basedon the feature vectors of each pair of nodes. Their algorithm learns the similarity function over feature vectors of thegraph structure. Kossinets [Kossinets 2003] assessed the effect of missing data on various networks and suggestedthat nodes may be missing, in addition to missing links. In this work, the effects of missing data on network levelstatistics were measured and it was empirically shown that missing data causes errors in estimating these parameters.While advocating its importance, this work does not offer a definitive statistical treatment to overcome the problem ofmissing data.

One can divide studies of the missing links problem into two groups: unsupervised methods and supervised ones.Unsupervised methods include works such as those done by Liben-Nowell and Kleinberg [Liben-Nowell and Klein-berg 2007], Katz [Katz 1953], and Gong et al. [Gong et al. 2011]) which do not require the inputted data to be labeledin any way. Instead, these methods learn through implying information based on the specific structure of the input net-work. A second group of methods require some element of tagged data be inputted in a training phase where the data istypically explicitly labeled so that the classifier can be created. These methods include the works of Freno et al. [Frenoet al. ] and Gong et al. [Gong et al. 2011]) who use a binary classifier in order to predict the missing links. Althoughthe supervised methods often yields better results, it requires an additional step of manually tagging the inputted data–something that requires additional resources and time. Additionally, this approach assumes some similarity betweenthe training dataset and test dataset. In our work, we intentionally avoided basing ourselves on supervised methods aswe wished to present an method free of these steps and assumptions and thus opted to create an unsupervised method.

Most similar to our paper is the recent work by Kim and Leskovec [Kim and Leskovec 2011], which also tackledthe issue of missing nodes in a network. This work deals with situations where only part of the network is observed,and the unobserved part must be inferred. The proposed algorithm, called KronEM, uses an Expectation Maximizationapproach, where the observed part is used to fit a Kronecker graph model of the network structure. The model is usedto estimate the missing part of the network, and the model parameters are then reestimated using the updated network.This process is repeated in an iterative manner until convergence is reached. The result is a graph which serves as aprediction of the full network. This research differs from ours in several aspects. First, while the KronEM predictionis based on link probabilities provided by the EM framework, our algorithm is based on a clustering method andgraph partitioning. Secondly, our approach is based on the existence of missing node indications obtained from datamining modules such as image recognition. When these indications exist, our algorithm can be directly used to predictthe original graph. As a result, while KronEM is well suited for large scale networks with many missing nodes, ouralgorithm may be effective in local regions of the network, with a small number of missing nodes, where the datamining can be employed. The results of experiments performed in this study show that our proposed algorithm, MISC,can achieve better prediction quality than KronEM, thanks to the use of indications of missing nodes.

3. FORMAL DEFINITION OF THE MISSING NODE IDENTIFICATION PROBLEMWe formally define the new Missing Node Identification Problem as follows. Assume that there is a social networkG = (V,E) in which e = ⟨v, u⟩ ∈ E represents an interaction between v ∈ V and u ∈ V. Some of the nodes in thenetwork are missing and are not known to the system. We denote the set of missing nodes Vm ⊂ V , and assume thatthe number of missing nodes is given as N = |Vm|. We denote the rest of the nodes as known, i.e., Vk = V \ Vm, andthe set of known edges is Ek = {⟨v, u⟩ | v, u ∈ Vk ∧ ⟨v, u⟩ ∈ E}, i.e. only the edges between known nodes.

Towards identifying the missing nodes, we focus on a part of the network, Ga = ⟨Va, Ea⟩, that we define as beingavailable for the identification of missing nodes. In this network, each of the missing nodes is replaced by a set ofplaceholders. Formally, we define a set Vp of placeholders and a set Ep for the associated edges. For each missingnode v ∈ Vm and for each edge ⟨v, u⟩ ∈ E, u ∈ Vk, a placeholder is created. That is, we add a placeholder for vas v′ to Vp and for the original edge ⟨v, u⟩ we add a placeholder ⟨v′, u⟩ to Ep. We denote the origin of v′ ∈ Vp witho(v′). Putting all of these components together, Va = Vk ∪ Vp and Ea = Ek ∪ Ep. In this setting, both the knownnodes and the placeholders, along with their associated edges are defined as being the portion of the network, Ga, thatis available for the missing nodes identification process. The problem is that for a given missing node v there may bemany placeholders in Vp. The challenge is to try to determine which of the placeholders are associated with v. Thiswill allow us to reconstruct the original social network G.

Formally, we define the missing node identification problem as: Given a known network Gk = ⟨Vk, Ek⟩, an avail-able network Ga = ⟨Va, Ea⟩ and the number of missing nodes N , divide the nodes of Va \ Vk to N disjoint setsVa1 , . . . , VaN such that Vai ⊆ Vp are all the placeholders of vi ∈ Vm.

To better understand this formalization, consider the following example. Assume Alice is a Facebook member,and thus is one of the available nodes in Va. She has many known social nodes (people), Vk, within the network,

4

Fig. 1: Illustration of the missing node identification problem. The whole graph represents the original network,G = ⟨V,E⟩. Nodes 1 and 5 are missing nodes. The sub-graph containing the nodes which appear inside the cloud is

the known network, Gk = ⟨Vk, Ek⟩.

Fig. 2: Illustration of the available network obtained by adding placeholders for the missing nodes. The nodes whichappear outside of the cloud are the placeholders. The sub-graph containing the nodes which appear inside the cloud isstill the known network, Gk = ⟨Vk, Ek⟩. The whole graph represents the available network, Ga = ⟨Va, Ea⟩, which

includes the known network and the placeholders.

but her cousin Bob is not a member of Facebook. Bob is represented by a missing node w ∈ Vm. From analysisof text in her profile we might find phrases like "my long lost cousin Bob", indicating the existence of the missingnode representing Bob. Alternately, from examining profile pictures of Alice’s friends, an image recognition softwarepackage identifies pictures of an unknown male person, resulting in another indication of the existence of a missingnode. Each indication is represented by a placeholder node v′ ∈ Vp and a link (u, v′) ∈ Ep, where u is the knownnode (e.g., Alice) which contains the indication. By solving the missing node identification problem we aim to identifywhich of these indications point to the same missing node, representing Bob, and which represent other missing nodes.

4. THE MISC ALGORITHMThe main challenge in solving the missing node problem is discovering how to correctly identify the set of Vp place-holders as the set of Vm missing nodes. As we are limited to general structural knowledge about the portion of thenetwork Ga that is available, solving this problem is far from trivial. Our key approach is to use a clustering approach,and specifically spectral clustering, to group Vp into N clusters in order to identify the missing nodes. The challenge tousing any clustering algorithm, including spectral clustering, is how to provide the clustering algorithm with a definedEuclidean value that quantifies distances between both the known portion of the network and the placeholders repre-senting the missing nodes. To better understand this difficulty, this section first briefly introduces spectral clusteringand focuses on the challenges in applying it to the missing node identification problem. Our novel contribution is thatwe apply a series of affinity measures primarily based on literature from the related link prediction problem which can

5

be used to solve the missing node problem. These affinity measures are critical in providing a non-discrete measurebetween both known nodes and placeholders such that the clustering algorithm can operate on that input. Finally, wepresent our novel MISC algorithm for solving the missing node identification problem based on combining both thespectral clustering algorithm and these affinity measures.

4.1. Spectral ClusteringSpectral clustering is a general algorithm used to cluster data samples using a certain predefined similarity or affinitymeasure between them. The algorithm creates clusters which maximize the similarity between points in each clusterand minimize the similarity between points in different clusters. This algorithm accepts as its input a set of M samplecoordinates in a multi-dimensional space, described by a matrix S. The algorithm also requires the number of clusters,K, be known as used as input. In its original form, the algorithm constructs an affinity matrix which describes theaffinity between each pair of samples based on the Euclidean distance on the multi-dimensional space. This affinitymatrix is equivalent to an affinity graph, where each data sample is represented by a node and the affinity betweentwo samples is represented by the weight of the edge between the two matching nodes. Spectral clustering solves theminimum cut problem by partitioning the affinity graph in a manner that maximizes the affinity within each partitionand minimizes the affinity between partitions. Since finding this partitioning is NP-Hard, spectral clustering finds anapproximated solution to this problem. While the reader is encouraged to review the algorithm in its entirety [Ng et al.2001], a brief description of this algorithm is presented below in Algorithm 1.

ALGORITHM 1: - Spectral ClusteringInput: S - data samples in a multi-dimensional spaceK - number of clustersσ - standard deviation for the Gaussian distance functionOutput: C ∈ N1×|S| - a vector indicating a cluster index for each sample in S

1: Define si to be the coordinates of every sample i in the multi-dimensional space and calculate an affinity matrix A ∈ RM×M ,where M is the number of samples. A defines the affinity between all pairs of samples (i, j) using the Gaussian distancefunction:Aij = exp(−||si − sj ||2/2σ2)

2: Define D to be the diagonal matrix whose (i, i) element is the sum of A’s i-th row, and construct the matrix L:L = D−1/2AD−1/2

3: Find x1,x2,...,xK , the K largest eigenvectors of L (chosen to be orthogonal to each other in the case of repeated eigenvalues),and form the matrix X = [x1x2...xK ] ∈ RM×K by stacking the eigenvectors in columns.

4: Form the matrix Y from X by renormalizing each of X’s rows to have unit length.5: Treating each row of Y as a point in RK , cluster them into K clusters via k-means or any other clustering algorithm.6: Assign the original sample i to cluster j if and only if row i of the matrix Y was assigned to cluster j.

The key to the success of Algorithm 1 is constructing the affinity matrix in the first step of the algorithm. Thismatrix must represent the pairwise affinity as accurately as possible according to the given problem setting. Notethat this algorithm assumes data nodes residing in some Euclidean space and thus defines affinity between samplesaccordingly. However, in the case of social network graphs, it is very difficult to embed the nodes of the graph in aEuclidean space as defined by the original algorithm. This is because it is unclear if such a space can be defined whichrepresents the actual distance between nodes in the graph.

To understand this difficulty, consider the triangle inequality which is one of the basic characteristics of a Euclideanspace. This inequality may not hold for many reasonable distance metrics between nodes. For instance, consider adistance metric that incorporates the number of common neighbors between two nodes. While nodes a and b may nothave any common neighbors, causing their distance to be infinite, they may both have common neighbors with nodec. The common neighbors measure for this example does not obey the triangle inequality because under this measured(a,b) > d(a,c) + d(c,b). In general, defining a space which obeys the triangle inequality is difficult because it requiresan examination of more than two nodes at once when defining their distances. A distance metric which does obey thisinequality is the shortest path length between two nodes, which we define later. Nonetheless, as we have found, thismeasure does not necessarily yield the best results for missing node identification. In order to bypass the Euclideanspace difficulty, we have decided to alter the first step of the algorithm and to define direct, graph-based measures forbuilding the affinity matrix A, rather than using the Euclidean space. While applying this algorithm to missing nodeidentification, we need to address how this measure can be calculated to represent affinity between nodes in a socialnetwork. Due to the complexity of social networks, and the need to compute this measure in a multi-dimensionalspace, calculating this measure is far from trivial. We consider several different methods, most of which have been

6

proven to be useful for solving the Link Prediction Problem in social networks [Adamic and Adar 2003; Katz 1953;Liben-Nowell and Kleinberg 2007]. We empirically compare these methods with the Euclidean method described inthe original algorithm. These methods are discussed in detail in the next section.

4.2. Incorporating Link Prediction MeasuresThe key novelty within this paper is that we propose that affinity measures be constructed based on general graphmeasures or previously developed measures for the related Link Prediction Problem [Adamic and Adar 2003; Katz1953; Liben-Nowell and Kleinberg 2007]. Specifically, we discuss how five such measures can be potentially appliedto the spectral clustering algorithm [Ng et al. 2001]:

(1) Gaussian Distance: Define Dij to be the length of the shortest path between nodes i and j. Define Di to be thevector of the length of shortest paths from node i to all other nodes. Calculate Aij as in step 1 of the originalspectral clustering algorithm:Aij = exp(−||Di −Dj ||2/2σ2).

(2) Inverse Squared Shortest Path (ISSP):Aij =

1(Dij)2

where Dij is the length of the shortest path between nodes i and j.(3) Relative Common Friends:

Aij =|Γ(i)

∩Γ(j)|

min(|Γ(i)|,|Γ(j)|) where Γ(i) is defined as the group of neighbors of node i in the network graph.(4) Adamic / Adar:

Aij =∑

k∈Γ(i)∩

Γ(j)1

log(|Γ(k)|)(5) Katz Beta:

Aij =∑∞

k=1 βk· (number of paths between nodes i and j of length exactly k).

All five of these measures present possible similarity values for creating the affinity matrix, Aij , in the spectral clus-tering algorithm. The Gaussian Distance is based on the standard distance measure often used in spectral clustering[von Luxburg 2007]. The Inverse Squared Shortest Path (ISSP) measure is based on the length of the shortest pathbetween two points, i and j, here representing two nodes in the social network. This measure presents an alternativeto the original Gaussian Distance used to measure Euclidean distances, and is not directly inspired by the Link Pre-diction Problem. In contrast, the next three measures were directly inspired by this literature. The Relative CommonNeighbors measure checks the number of neighbor nodes that i and j have in common (|Γ(i)

∩Γ(j)|). We divide

by min(|Γ(i)|, |Γ(j)|) in order to avoid biases towards nodes with a very large number of neighbors. Similarly, theAdamic / Adar measure also incorporates the common neighbor measure: (Γ(i)

∩Γ(j)), checking the overall connec-

tivity of each common neighbor to other nodes in the graph and giving more weight to common neighbors who areless connected. Since the nodes that act as placeholders for missing nodes only have one neighbor each, the commonneighbors and the Adamic/Adar measures do not represent these nodes well. Therefore, for these measures only, wealso consider them to be connected to their neighbor’s neighbors. Last, the Katz measure [Katz 1953] directly sumsthe number of paths between i and j, using a parameter β which is exponentially damped by the length of each path.Similar to the common neighbor measure, this measure also stresses shorter paths more heavily. Finally, for evaluatingthese algorithms, we consider a baseline Random Assignment algorithm that assigns each placeholder to a randomcluster and represents the most naive of assignment algorithms.

4.3. Solving the Missing Node Identification ProblemGenerally, our solution for solving the missing node problem is based on the following three steps:

(1) Cluster the placeholder nodes, using a spectral clustering approach, thus creating disjointed groups of placeholderswhich have a high probability of representing the same missing node.

(2) Unite all of the placeholders from each cluster into one node, attempting to predict the missing node that is mostlikely represented by the placeholders in that cluster.

(3) Connect the predicted node to all neighbors of the placeholders that were in the matching cluster, in the hopes ofcreating a graph that is as similar as possible to the original graph G.

Clustering the placeholders is the crucial part of this approach. When this clustering is performed perfectly, theplaceholders in each cluster will all originate from the same missing node. As a result, the predicted node createdfrom each cluster will match a specific missing node from the original network graph from which the missing nodeswere removed. In this case the predicted graph will be identical to the original graph. When clustering is not fullysuccessful, some clusters will contain placeholders which originate from different missing nodes, the predicted nodeswill not exactly match the missing nodes and the predicted graph will be less similar to the original graph.

7

Fig. 3: Correct clustering of the placeholders for missing nodes 1 and 5. The placeholders in each cluster are united toone node which represents a missing node.

Specifically, Algorithm 2, MISC, presents how this solution is accomplished. This algorithm accepts the knownand available parts of the network graph, as described in the problem definition. We also assume at this point that thenumber of missing nodes, N , is given. N is used to define the number of clusters that the spectral clustering algorithmwill create. The final input is α, a procedure for calculating the pairwise affinity of nodes in the graph. An example ofsuch a procedure could be one that implements the calculation of one of the affinity measures adopted from the LinkPrediction Problem.

ALGORITHM 2: - MISC (Missing node Identification by Spectral Clustering)Input: Gk = ⟨Vk, Ek⟩ - the known part of the networkGa = ⟨Va, Ea⟩ - the available part of the networkN – the number of missing nodesα : G(V,E) −→ R|V |×|V | - a procedure for calculating the affinity matrix of nodes in a graphOutput: C ∈ N|Va\Vk| - a vector indicating the cluster index of each placeholder node, Ĝ =

(V̂ , Ê

)– prediction of the full

network graph

1: A ∈ R|Va|×|Va| ←− α(Ga) - calculate the affinity matrix of the available nodes in the graph2: Perform steps 2-4 of Spectral Clustering Algorithm (Algorithm 1) to calculate Y, using N as the input K to algorithm 13: Y’←− {Yi | vi /∈ Vk} – keep only the rows of Y which match the placeholder nodes4: C←− k_means(Y’,N) – cluster the rows which match the placeholder nodes to N clusters5: V̂ ←− Vk, Ê ←− Ek - initialize the output graph to contain the known network6: For each cluster c ∈ C create a new node vc ∈ V̂7: For each placeholder v in cluster c and edge (u, v) ∈ Ea, create an edge (u, vc) ∈ Ê8: Return C, Ĝ =

(V̂ , Ê

)

The first two steps of this algorithm are based on the spectral clustering algorithm and measures described inthe previous two sections. This algorithm first calculates the affinity matrix of the available nodes using the givenprocedure α (step 1). Note that we use this algorithm with any one of the five affinity measures as described in theprevious section. Next, steps 2-4 of the original spectral clustering algorithm are followed in order to calculate thematrix Y (step 2). This is a transformation that spectral clustering performs in order to transform the data points intoa vector space in which k-means clustering can be employed. The number of missing nodes, N , is an input parameterin MISC. It is used as the number of clusters for the spectral clustering algorithm, which is marked as K. Even thoughwe only need to cluster the placeholders, the matrix Y in the spectral clustering algorithm contains all of the M datasamples as coordinates in an N dimensional space. Spectral clustering performs k-means clustering on the samples inthis space. In the context of MISC, each row of Y corresponds to a node in Ga embedded in an N dimensional space.

Step 3 of the algorithm is specific to our problem. Here, the rows of Y which correspond to the known nodes inVk are removed. As opposed to general clustering problems where all the data must be clustered in an unsupervised

8

manner, in our case most of the nodes are known and the challenge is to find a correct clustering for the placeholders.Therefore, for the sake of clustering and unifying only the placeholders, only the rows corresponding to placeholdersin Y are kept.

On the other hand, the affinity between the known nodes and the placeholders contains important information whichshould be utilized. For this reason, all the nodes are embedded in the affinity matrix, and all the columns in Y remain.Notice that in this manner, the information obtained from the known nodes in the embedding process is still present inthe matrix Y ′ in the form of the coordinates matching the placeholders (see step 3 of Algorithm 2).

In step 4 of Algorithm 2, the remaining rows, which correspond to placeholders, are clustered using k-means clus-tering. In this step also, N is used as the number of clusters K. This is because every cluster is used to predict amissing node in the following steps. According to spectral clustering theory [Ng et al. 2001], k-means clustering isexpected to achieve better results when employed using the coordinates depicted in Y or Y ′ than when applied to theoriginal data.

In steps 5-8 the predicted graph is created by uniting all of the placeholders in each cluster to a new node andconnecting this node to the known nodes which were neighbors of the original placeholders in the correspondingcluster. Recall that each cluster represents a predicted missing node. The placeholders in that cluster each represent alink of the predicted missing node, and for this reason the links are created between the new node and the neighborsof the placeholders in the corresponding cluster. Another advantage of removing the known nodes in step 3 is clearat this point - each resulting cluster contains only placeholders, and can therefore represent a missing node from theoriginal graph.

5. DATASET DESCRIPTION AND EVALUATION METHODOLOGYTo empirically study the Missing Node Identification problem and assess the proposed solutions, we must be ableto simulate the problem setting using real world data. Within this section we first describe the dataset utilized andmethods of synthesizing the problem. We then discuss the evaluations, methods and methodology used to compare theproposed solutions empirically.

5.1. Dataset DescriptionFor an empirical evaluation of our method we use a previously developed social network dataset - the FacebookMHRW dataset [Gjoka et al. 2010]. This dataset contains structural information sampled from Facebook, includingover 900,000 nodes and the links between them. For each node certain social characteristics are stored, such as sharedacademic networks (e.g. all people from Harvard University), corporate networks (e.g. workers in AIG), geographicalnetworks (e.g. members from Idaho), or networks of people who share similar interests (e.g. love of chocolate). Allnodes are anonymized as numbers without any indication of their true identity.

The main challenge we had to address in using a dataset of this size was processing the data within a tractable periodand overcoming memory constraints. In order to create more tractable datasets, we considered two methods of creatingsubsets of the Facebook data [Gjoka et al. 2010]. In the first method, we create a subset based on naturally occurringsimilarities between nodes according to users’ network membership characteristics within the social network. Eachsubset is created by sampling all the nodes in a specific user network and the links between these nodes. Nodes withonly one link or no links at all are removed. The advantage of this method of creating the subsets is that there isa higher chance of affiliation between the nodes in the user network as compared to random nodes selected fromthe entire social network. However, the disadvantage is that the nodes which make up the user network may not becompletely connected, for example if the user network is comprised of several disconnected components. In fact, thesubgraph of nodes that is part of a specific user network may be very sparse. Another disadvantage is that the usernetworks in the dataset are limited in size and therefore large subgraphs cannot be created from them. In contrast, forthe second method of creating a subset, we begin with the entire dataset, and extract a subset based on a BFS walkstarting from a random node in the dataset. Here no previous information about the social network is necessary, butthe BFS generated subset may not accurately represent the actual topology of the entire network.

In order to synthesize the missing node problem within these two subsets, we randomly mark N nodes as the set ofmissing nodes, Vm. We then remove these nodes from the network, and replace each link (v, u) between v ∈ Vm andu ∈ Vk with a placeholder node v′ ∈ Vp and a link (v′, u) ∈ Ep. The resulting network Ga is the available networkused as input to our proposed MISC algorithm for solving the Missing Node Identification problem.

5.2. Evaluation MeasuresWe considered two types of evaluation measures to measure the effectiveness of the solutions presented. One measureis a straightforward similarity measure that compares the output graph of the MISC algorithm, Ĝ = (V̂ , Ê), to theoriginal network graph, G, from which the missing nodes were removed.

9

Within the first measure, we quantify the similarity of two graphs based on the accepted Graph Edit Distance(GED) measure [Bunke and Messmer 1993; Kostakis et al. 2011]. The GED is defined as the minimal number of editoperations required to transform one graph into the other. An edit operation is an addition or deletion of a node or anedge. Since finding the optimal edit distance is NP-Hard, we use a previously developed simulated annealing method[Kostakis et al. 2011] to find an approximation of the Graph Edit Distance. The main advantage of this method ofevaluation is that it is independent of the method used to predict Ĝ, making it very robust. For instance, GED canbe used to compare two methods which might create a different number of clusters from each other, if the numberof clusters is unknown in advance. GED can also be used to compare methods that cluster a different amount ofplaceholders, for example, if not all of the placeholders are given. In fact, it can be used to compare any two methods,as long as they both produce a predicted graph.

A disadvantage to computing the GED lies within its computational cost. This expensive cost exists because theGED measure is based on finding an optimal matching of the nodes in Ĝ to the nodes in G which minimizes the editdistance. Even if we take advantage of the fact that most of the nodes are known nodes from Vk and their matchingis known, there are still N unknown nodes to be aligned - the missing nodes in G to the nodes generated from eachcluster in Ĝ. This allows for N ! possibilities. Therefore, a heuristic search method must be used, as we do, which maylead to over-estimations of the edit distance.

Due to this expense, a purity measure, which can be computed simply, can be used instead. The purity measureattempts to assess the quality of the placeholders’ clustering. This can be done since we know in advance the correctclustering of the placeholders. Under the correct clustering, each cluster would group together the placeholders orig-inating from the same missing node. A high quality algorithm would produce a clustering which is most similar tothe true clustering of the placeholders. The clustering quality is tested using the purity measure which is often used toevaluate clustering algorithms [Strehl and Ghosh 2003]. This measure is calculated in the following manner:

(1) Classify each cluster according to the true classification of the majority of samples in that cluster. In our case, weclassify each cluster according to the most frequent true original node v ∈ Vm of the placeholder nodes in thatcluster.

(2) Count the number of correctly classified samples in all clusters and divide by the number of samples. In our casethe number of samples (nodes) that are classified is |Vp|.

Formally, in this problem setting, purity is defined as: purity(C) = 1|Vp|∑

k maxvj∈Vm |ck ∩ {v′ ∈ Vp | o(v′) =vj}| where ck is defined as the set of placeholders which were assigned to cluster k. Note that as the number ofmissing nodes increases, correct clustering becomes more difficult, as there are more possible original nodes for eachplaceholder. As our initial results show, the purity indeed decreases as the number of missing nodes increases (seeFigures 5 and 6). This evaluation method works well because it is easy to calculate and it reflects the quality of thealgorithm well, when comparing clustering algorithms which create the same number of clusters. Its disadvantageis due to the fact that it can be biased depending on the number of clusters. If the same data is divided into a largenumber of clusters it is easier to achieve higher purity. Therefore this method is used only when the number of clustersis known in advance, and there is no advantage to selecting a large number of clusters.

5.3. Evaluation MethodologyIn this subsection we describe our evaluation methodology and the experimental setup (see flowchart in Figure 4).The flow begins with the full Facebook MHRW dataset, containing over 900,000 nodes. This dataset is sampled byrunning BFS from a random node or by selecting a specific user network, as described in subsection V.A. Whenrunning BFS, the sampling is repeated ten times, each starting from a different random node. In other words, tendifferent networks are created for each size, in order to avoid randomization errors and biases. In the following step,N nodes are randomly selected as missing nodes. This is repeated several times from each one of the networks,depending on the experiment, resulting in different random instances of the problem setting for each combinationof graph size and N . From each instance the randomly selected missing nodes are removed. Each missing node isreplaced with a placeholder for each of its links to remaining nodes in the network. The resulting graph is fed intothe MISC algorithm, which produces a clustering of the placeholders. By uniting the placeholders in each cluster intoa new node and connecting the new node to the neighbors of the placeholders in the corresponding cluster, MISCcreates a predicted network. This predicted network is supposed to resemble as closely as possible the structure of theoriginal network from which the random nodes were removed. The clustering produced by MISC is evaluated usingthe purity measure described above, and the average purity value achieved for all of the instances is reported. TheGraph Edit Distance is also evaluated by comparing the predicted network to the original network. This result is alsoaveraged over the different instances and is reported as well. In each experiment report we indicate the number ofiterations performed. Nearly all points in the graphs of the relevant figures below are the average of at least 100 runs

10

of the algorithm on randomly generated configurations. However, for highly time consuming experiments, such as thecomparison with the KronEM algorithm, only 40 iterations of the experiment were run. Nevertheless, the results weresignificant even with this number of iterations.

Fig. 4: A figure explaining the evaluation methodology for the experiments within this work.

6. COMPARING AFFINITY MEASURESTo assess the quality of our algorithm we ran experiments on the subgraphs obtained from the Facebook dataset. Eachexperiment consisted of synthesizing the missing node problem as described in Section 5, and running the spectralclustering algorithm using each one of the proposed affinity measures. We measured the purity achieved when usingeach of the affinity measures and compared the results to the purity of a random clustering, where each placeholderwas randomly assigned to a cluster. Each experiment described in this section was repeated over 150 times, each timewith randomly selected missing nodes. The results displayed indicate the average over all experiments.

The results obtained from these experiments clearly show that the MISC algorithm obtained significantly betterresults than the random clustering, regardless of which affinity measure was used, showing the success of this methodin solving the missing node identification problem. We also compared the relative success of the different affinitymeasures described in Section IV.B. We find that Inverse Squared Shortest Path, Relative Common Neighbors andAdamic / Adar performed slightly better than Gaussian Distance and Katz Beta measures in this dataset. Figure 5shows the purity achieved when using each of the affinity measures with the spectral clustering algorithm, on fifteensubgraphs containing 2,000 nodes each, obtained by BFS from a random node in the full dataset. Each point in thefigure represents the average purity from the experiments described above. MISC performed much better than therandom clustering with each one of the affinity measures. When comparing the affinity measures, it seems that theInverse Squared Shortest Path was slightly better than the others. The same experiments were performed on subgraphsof user networks taken from the full facebook dataset. The following networks were selected:

(1) Network 362 – containing 3,189 nodes and 3,325 undirected links(2) Network 477 – containing 2,791 nodes and 3,589 undirected links(3) Network 491 – containing 2,874 nodes and 3,143 undirected links

11

Fig. 5: Purity achieved when using each of the affinity measures.

It is important to note that subgraphs are very sparse, as each node has very few links on average. As a result, removingrandom nodes can cause the graph to become disconnected. Given this, it is interesting to see the performance of thespectral clustering algorithm on these subgraphs given in Figure 6. We can see that spectral clustering is effective alsoon these subgraphs, achieving much higher purity than the random clustering. Among the affinity measures, InverseSquared Shortest Path (ISSP) performed the worst in this case and Adamic/Adar performed the best. This may be dueto the fact that disconnected components lead to paths of infinite length, and ISSP is more sensitive to that.

Fig. 6: Purity achieved when using each of the affinity measures on user network graphs.

As the key contribution of this paper lies within its use of affinity measures to solve the missing node problem, weconsidered whether a second clustering algorithm, k-means, could be used instead of spectral clustering. Overall, wefound that the k-means algorithm also produced similar results, yet the spectral clustering algorithm overall performedsignificantly better. To evaluate this claim, we used 10 samples of 10,000 node networks. From each network, werandomly removed a set of missing nodes of sizes 30, 50, 100, 150 and 200. Then, we considered how the k-means

12

clustering algorithm could be used instead. Within both the spectral clustering and k-means algorithms, affinity ma-trices are needed to provide the input for the clustering algorithms. However, we considered two variations for whichnodes to process the affinity measures within the k-means algorithms. Within the first variation, we considered boththe known network nodes and the placeholder nodes. Within the second variation, we only processed the placeholdernodes. We randomly generated 8 networks, and for each network setting generated 10 sets of missing nodes. Thus,each datapoint in Figure 7 is averaged from 80 different experiments. As can be seen in this Figure, we consideredthree different affinity measures: Relative Common Neighbors (far left of the figure), Adamic Adar (middle of thefigure) and Katz Beta 0.05 (far right of the figure). Note that for all three affinity measures, the spectral clusteringalgorithm performed better than both of the k-means variations. We also checked for normal distribution through theShapiro-Wilks test. We found that the results for the k-means algorithms, with the exception of the 200 missing nodescase for the Adamic Adar and Katz Beta affinity measure, were normally distributed. We then performed the t-test tovalidate the results’ significance. We found that the spectral clustering algorithm performed significantly better thanboth of the k-means variations. In all cases the p-scores were much lower than the 0.05 significance threshold. Thelargest p-score value was observed in the case of the related common neighbor affinity measure, where the p-scoresbetween the spectral clustering and both k-means variations were much less than 0.0001.

Fig. 7: Comparing the purity achieved through the spectral clustering algorithm (SC datapoints in the figure) versusthe k-means clustering algorithm for the network (k-mean datapoints in the figure) and the missing nodes and

k-means clustering on only the placeholders (k-means on PHs datapoints in the figure).

7. SCALING TO LARGE NETWORKSUsing a straightforward implementation of our proposed algorithm, we have been able to analyze network graphs ofup to 5,000 nodes within a tractable time. This is typically not enough to deal with recent datasets of social networks,which can reach hundreds of thousands of nodes or more. Processing these datasets imposes serious memory and timeconstraints, and therefore efficient algorithms need to be developed and used for this task. In this section we proposetwo approaches to address this challenge by enabling efficient calculation of the affinity matrix and of the spectralclustering algorithm. We accomplish this through two novel extensions of the base MISC algorithm – adding SparseMatrix Representation and Dimension Reduction.

7.1. Sparse Matrix Representation and AlgorithmsSocial networks generally have a sparse nature, i.e. each node is connected to only a small fraction of nodes. Most ofthe possible links do not exist, and therefore |E| ≪ |V |2. A commonly used representation for a graph G = ⟨V,E⟩ isa matrix G|V |×|V | where Gi,j = 1 if (i, j) ∈ E, and Gi,j = 0 otherwise. When representing a social network as anadjacency matrix, most of the entries of the matrix will be zero. Using sparse matrix representations can significantlyreduce the memory load and help deal with a much larger amount of data. It is possible to maintain only the non-zerovalues in order to preserve memory and efficiently perform calculations. The sparse property can also be utilized toimprove the time complexity of the affinity matrix calculation. However, the affinity matrix itself may not be sparsein some cases, depending on the affinity measure used. For instance, if the Inverse Squared Shortest Path measure isused, and the graph is fully connected, then each pair of nodes has a non-zero affinity value. In this section we detailthe algorithms for calculating the affinity matrix and analyze them in terms of space and time complexity. We detailhow the Adamic/Adar can be modified to maintain the sparseness of the affinity matrix. While we also considered

13

how to modify the Relative Common Neighbors and Adamic/Adar affinity measures, the application of the generalapproach presented in this section is straightforward. Nonetheless, we include how these measures can be modified inAppendix A so that the results presented in this paper can be replicated.

ALGORITHM 3: - Katz Beta AffinityInput: Ga = ⟨Va, Ea⟩ - the available part of the networkK - the maximal path length to considerβ - the damping factorOutput: A ∈ R|Va|×|Va| – affinity matrix indicating the pairwise affinity

1: P ←− Ga – initialize P as the adjacency matrix of Ga2: A1 ←− β · P – initialize A as the adjacency matrix of Ga, damped by β3: for k = 2 to K do4: P ←− P ·Ga – count the number of paths of length k by matrix multiplication5: Ak ←− Ak−1 + βk · P – accumulate the number of paths, damped by βk6: end for7: Return AK

Algorithm 3 describes the Katz Beta affinity calculation. As before, we assume that the given network graph, fromwhich the affinity matrix is calculated, is Ga = (Va, Ea). We assume that the matrix is given in a sparse representation,i.e. a list of neighbors is given for each node. Let d be the maximal degree of a node in the graph. In step 1, the matrix P,which contains the number of paths between each pair of nodes, is initialized. At this point it contains only the numberof paths of length 1, and it is therefore equal to the adjacency matrix which defines Ga. In step 2, the affinity matrixis also initialized, at this point taking into account paths of length 1, and damped by β. In steps 3-6 we iterativelyincrease the maximal path length taken into account until we reach paths of length K. In step 4 P is updated in eachiteration k to Gak, indicating the number of paths of length k between each pair of nodes. P is then accumulated intoAk after damping by βk, in step 5.

Multiplication of sparse n × n matrices can be done in O(mn), where m is the number of non-zero elements. Inour case, n = |Va| and m = O(|Va| · d) for the matrix Ga. In P , the number of non-zero elements changes in eachiteration by a factor of up to d. Therefore, in iteration k, step 4 requires O(|Va|2dk) operations. The time complexityof steps 3-6 and of the entire procedure is then O

(∑Kk=1 |Va|

2dk

)= O(|Va|2dK).

7.2. Dimension ReductionIn the first step of the original MISC algorithm described above, we calculate the affinity matrix A ∈ R|Va|×|Va|,representing the pairwise affinity of nodes in Va. This matrix is then fed into the spectral clustering algorithm. Each rowof A corresponds to one of the nodes, represented in a vector space. Each column corresponds to a single dimension inthe space in which these nodes reside. In steps 2-4 of spectral clustering (Algorithm 1), the nodes are embedded in aEuclidean space RN . By selecting the N largest eigenvectors of L in step 3, MISC reduces the number of dimensionsgiven to the k-means algorithm in step 5 from |Vv| to N . However, steps 2 and 3 still remain computationally expensive,dealing with a large matrix with |Va| rows and columns.

We propose a variation of MISC that includes dimension reduction, R-MISC, which can reduce the inherent com-plexity within the MISC algorithm, allowing us to tractably solve the missing node problem in much larger networks.R-MISC reduces the size of the affinity matrix, greatly decreasing the computational complexity of steps 2 and 3within the original MISC algorithm. Intuitively, when the number of placeholders is relatively small compared to thetotal size of the network, many network nodes are "distant" from all of the placeholders in the sense that they have noaffinity or only a very small affinity with the placeholders. Since only the placeholders are clustered in our problemsetting, we propose that the dimensions representing these nodes can be removed from the affinity matrix with only asmall loss in performance of the spectral clustering algorithm, yet significantly reducing the time and the memory usedby the spectral clustering component of the MISC algorithm. Note that while these nodes have a very small affinitywith the placeholders, they may have a high affinity with other nodes which are not removed. Thus, removing themmay still affect the result of the spectral clustering. Yet, as our results show, the effect is minimal.

To test this method, we conducted experiments comparing the performance of R-MISC relative to MISC. Aftercalculating the affinity matrix common to both algorithms, we remove the rows and columns within R-MISC thatcorrespond to nodes that have zero affinity with all of the placeholders, resulting in a much smaller, square affinitymatrix which is then used in the following steps of the original MISC algorithm.

14

Fig. 8: The effect of dimension reduction on purity in 10,000 node graphs in terms of time (left) and purity (right)given different numbers of missing node sizes (X-axis in both sides).

Fig. 9: The effect of dimension reduction on purity in 50,000 node graphs in terms of time (left) and purity (right)given different numbers of missing node sizes (X-axis in both sides).

We then studied the impact of employing dimension reduction within different network sizes and for differentamounts of missing nodes within those networks. First, we studied how this approach is impacted by network size.In Figures 8 and 9 we present results from networks constructed with 10,000 and 50,000 nodes, respectively. Wealso studied additional networks with 20,000, 30,000 and 40,000 nodes and found R-MISC to be equally effective inthese intermediate sizes as well. Second, we studied how the R-MISC would perform for different numbers of missingnodes. We studied networks with N = 10, 50, 100, 200, 300 and 500 missing nodes. Note that these values constitutethe x-axis within Figures 8 and 9.

Last, we studied the impact that the network size and different missing node sizes had on performance. In doingso, we compared both the accuracy, as measured by purity, and the time, measured in seconds, for both the originalMISC algorithm and the R-MISC variation with dimension reduction. The left side of Figures 8 and 9 represents theresults from the time experiments, and the right side represents the corresponding purity measures. As expected, thedimension reduction significantly reduces the time needed to run the spectral clustering component of the algorithm,while only having a small impact on the R-MISC’s performance. These graphs present results from the RelativeCommon Neighbors affinity measure. However, similar results were obtained when the other affinity measures wereused. In all cases, the results clearly show that despite the overhead of performing the dimension reduction, significanttime is saved and this method allows us to scale R-MISC to much larger networks within a tractable time. Note thatas these experiments only pertain to the time used by the spectral clustering step of the algorithm, the time element ofthese results is not dependent on the affinity measure which we used. Thus, the time results are the same for all affinitymeasures being used.

15

Fig. 10: The impact of network size on the MISC and R-MISC algorithms for networks with 100 (on left) and 500(on right) missing nodes.

Fig. 11: Purity achieved by the R-MISC algorithm when using different affinity measures on 100,000 node graphs,using dimension reduction.

We also considered how scalable the R-MISC algorithm would be by considering networks with up to 100,000nodes. We again considered networks with 100, 200, 300, 400 and 500 missing nodes. Figure 10 presents the resultsfrom the experiments with 100 and 500 missing nodes. Please note from these results that only the R-MISC algorithmwas able to tractably solve the networks with 100,000 nodes. In comparison, note that time required by R-MISCalgorithms increases relatively slowly with the network size. For example, within the 500 missing node problem, thetime required for the R-MISC algorithm to solve for 500 missing nodes within a 30,000 node network is 504.30seconds, while the time to solve for 500 nodes within 40,000, 50,000 and 100,000 nodes is 602.99, 683.77, and 702.83seconds respectively. Thus, we conclude that R-MISC is a scalable algorithm, as can be extrapolated from these results.

Figure 11 displays the purity achieved by using three different affinity measures in the R-MISC algorithm, ongraphs containing 100,000 nodes. The results are compared to a random clustering. The ability to analyze such largegraphs is achieved due to dimension reduction and sparse matrix representation. The results of the tested measuresare consistent with results from experiments on smaller graphs, with Adamic/Adar and Relative Common Neighborsperforming slightly better than Katz Beta.

16

8. COMPARING MISC TO KRONEMRecall from Section III that the Missing Node Problem is relatively new, with very few algorithms available to comparewith MISC. Nonetheless, we did perform comparisons between MISC and the only other known algorithm suitablefor the missing node problem, KronEM [Kim and Leskovec 2011]. KronEM accepts a known graph and the numberof missing nodes and outputs a graph which predicts the missing nodes and their links. This algorithm is not based onthe existence of placeholders and therefore it does not use them. Another key difference between MISC and KronEMis that KronEM assumes that the number of nodes in the full graph is a power of two.

Nonetheless, to facilitate a fair comparison between MISC and KronEM, we generated 10 networks, each containing211 = 2048 nodes sampled from the full Facebook dataset. From each network, we randomly removed N missingnodes. The selected values for N were 11, 21, 31, 41 and 50 which are, respectively, approximately 0.5%, 1%, 1.5%,2% and 2.5% of the network. The experiment was repeated four times for each selected value of N in each network,resulting in 40 iterations per missing node percentage. Each resulting network was fed to the KronEM and MISCalgorithms.

The output of each algorithm was compared to the original network graph using Graph Edit Distance, since KronEMcannot be evaluated by Purity, as it is not a clustering based algorithm. As a baseline algorithm, we also measured theGED obtained by the random clustering approach in this set of experiments. An important observation is that MISConly predicts the existence of links between the predicted nodes and the neighbors of the original missing nodes.KronEM, on the other hand, might predict an erroneous link between a predicted node and any other node in thenetwork, which was not connected to the missing node. This is due to the fact that KronEM does not take advantageof the placeholders. Therefore, in order to fairly compare the two, when calculating the GED, we only considerededit operations relating to links between the predicted nodes and the neighbors of the missing nodes. In addition, eventhough KronEM predicts a directed graph, we treated each predicted link as undirected, since this slightly improvedthe results of KronEM.

8.1. KronEM ConfigurationThe KronEM algorithm accepts the known network, without the placeholders, and the number of missing nodes.Additional parameters are the initial initiator matrix of the Kronecker graph, the set of parameters for the Gibbssampling and the number of iterations [Kim and Leskovec 2011].

The authors recommended using at least 3 kinds of initial initiator matrices, which represent different relationshipsbetween the nodes. Accordingly, we ran the KronEM algorithm with the 3 recommended variations of the initialinitiator matrix which represent the following relationship: (a) Homophily - love of the same; (b) Heterophily - loveof the different; and (c) Core-periphery - links are most likely to form between members of the core and least likely toform between nodes members of the periphery. For all 3 variations we used the same parameters1. Each iteration tookseveral hours to run using the C++ implementation we received from the authors of [Kim and Leskovec 2011].

8.2. MISC ConfigurationFor this comparison, we used MISC with three different affinity measures: Adamic/Adar, Katz Beta 0.05 and RelativeCommon Neighbors. Even though the tested network size was only 2048 nodes, we still used the dimension reductionand sparse network representation features of MISC, since they are shown to speed up the calculation with only aslight impact on the predictive performance. Each iteration of the algorithm ran for less than 10 seconds in Matlab2.

8.3. ResultsThe results show that both MISC and KronEM algorithms are significantly better than the random clustering baseline.Between the three variations of KronEM, the results were almost identical, yet the Homophily variation producedslightly better results as the missing node percentage grew.

Between the three variations of affinity measures for the MISC algorithm, there was also a small difference, yetRelative Common Neighbors performed slightly better than Adamic/Adar and Katz Beta 0.05. All three algorithmsperformed significantly better than KronEM with the Homophily initiator matrix. The best obtained results fromeach algorithm are displayed in Figure 12. Table 1 displays the mean result and standard deviation achieved by eachalgorithm over the 40 iterations over each missing node percentage and four random draws of missing nodes fromeach of the 10 networks.

1In running the KronEM algorithm we set the algorithm parameters to the values as described in the ReadMe file in their algorithm’s distribution.As such, the EM parameter which controls the number of iterations was set to 50, the l parameter controlling the gradient learning rate was 1e-5,the mns parameter controlling the minimum gradient step was 1e-4, and the mxs parameter controlling the maximum gradient step was 1e-2. Thetests were run on a Windows Web Server 2008 with a 64 bit operation system, 4 GB RAM and an Intel dual core 2.27 GHz CPU.2The MISC experiment ran on an Intel Core i5, 2.67 GHZ CPU with 4 GB of RAM, running Matlab 2010a and a Windows 7 OS.

17

10 15 20 25 30 35 40 45 5020

40

60

80

100

120

140

160

Number of Missing Nodes

Gra

ph E

dit D

ista

nce

Comparing MISC to KronEM

KronEM HomophilyMISC Relative Common Neighbors

Fig. 12: Comparison of MISC to KronEM on 2,048 node networks using Graph Edit Distance (lower is better).

To determine the significance of these results, we first used the Shapiro-Wilk test to determine if the data wasnormally distributed. We found that the results of many of the networks that were studied were in fact not normallydistributed. As a result, we used the Wilcoxon Sign-Rank Test to see if the differences in Table 1 were found to bestatistically significant. Based on this test, we found that the differences were significant in all cases.

We conclude that while KronEM can be an excellent solution when a large part of the network is unknown andthere are no indications of missing nodes, MISC can perform better when the placeholders are given and when tryingto identify a small number of missing nodes.

9. DEALING WITH PARTIAL AND INACCURATE PLACEHOLDER INFORMATIONThe algorithms and experiments described in this work have made assumptions regarding the information that isavailable on the social network. For instance, the number of missing nodes was assumed to be given as an input tothe algorithm. In addition, it was assumed that all the placeholders can be identified with complete certainty. When anode was removed from the original graph in the experiments described above, a placeholder was attached to each oneof that node’s links. If we assume that indications of missing nodes in the network are obtained from a data miningmodule, such as image recognition and text processing, it is likely that this information would be partial and noisy. Forinstance, the number of missing nodes may not be known, not all placeholders may be known, and there may be falsealarms indicating placeholders which do not actually exist. In this section we present methods aimed at addressingthese issues.

Specifically, in this section we consider three different types of placeholder uncertainty. In the first type of case, wedo not know the exact number of missing nodes. In this case, we must estimate this value according to the estimates wepresent. We consider a second case where insufficient placeholders exist to correctly identify all missing nodes. Last,we consider the case where extra placeholders exist. In this case, we assume that the actual missing nodes are foundwithin the set of placeholders, but extraneous information exists that assumes additional placeholders exist which donot correspond to actual nodes within the network.

For the first two categories of problems (unknown number of missing nodes and missing placeholders) we useevaluations based on Graph Edit Distance, as the actual number of missing nodes is unknown based on the data. Inthese problems, the purity measure is not appropriate as the actual number of missing nodes is needed to calculate thepurity measure. In contrast, Graph Edit Distance is based on calculating the number of transformations between theoriginal and actual networks– something that can be calculated even without knowing the number of missing nodes.In contrast, in the last type of uncertainty, where there are extra placeholders, we again assume that the number ofplaceholders is known, allowing us to consider the purity evaluation measure as well.

9.1. Estimating the Number of Missing NodesIn many real-life scenarios, a partial network may be available when the number of missing nodes is unknown. In theMISC algorithm, the number of clusters is set to equal the number of missing nodes, N . Recall that the placeholdersin each cluster are united to one predicted node which is connected to all of the neighbors of these placeholders. Each

18

Table I: Comparison of MISC and KronEM Algorithms

Missing Node Percentage Algorithm Mean GED Std.

0.5%

KronEM Core-Periphery 36.32 11.98KronEM Heterophily 36.18 12.16KronEM Homophily 36.03 11.16MISC Adamic/Adar 32.55 15.44MISC Katz Beta 0.05 31.03 16.00MISC CommNeighbors 31.52 16.47

1%


1.5%


2%


2.5%


Table II: Studying the Accuracy of the Estimation for the Number of Missing Nodes

10 20 30 50 70 100 TotalQuadratic Degree Estimation 0.1850 0.1725 0.1308 0.1115 0.0921 0.0728 0.1275

Mean Degree Estimation 0.1850 0.1725 0.1300 0.1125 0.0921 0.0743 0.1277

predicted node is supposed to resemble one of the missing nodes. When the number of missing nodes is unknown, itmust be estimated so that the predicted graph is as similar as possible to the original graph.

Our general approach is to estimate the number of missing nodes based on the general structure of social networks.We assume that a set ratio exists between nodes within the network and their neighbors. This relationship has beenpreviously studied within missing link and node literature [Gomez-Rodriguez et al. 2012; Lin et al. 2012; Kim andLeskovec 2011]. We considered two methods, quadratic and mean estimations. Within the quadratic estimation weassumed that the number of edges in the network follows a quadratic relation of the network’s nodes. This was alsoobserved in the work of Kim and Leskovec [Kim and Leskovec 2011]. Accordingly, we first calculated this ratioaccording to the known part of the network, i.e. |Ek| = a ∗ |Vk|2, and we can then calculate the estimated number ofclusters, N̂ , as the solution for the following equation: |Ek ∪Ep| = a* (|Vk|+N)2. The estimated N̂ is rounded to thenearest integer and used in the MISC algorithm. Within the mean estimation we assigned d as the average degree of anode in Va. The expected number of clusters is then |Vp| /d. The estimated number of clusters, N̂ , is rounded to thenearest integer and used in the MISC algorithm. This estimation causes |Vp| nodes to be clustered into N̂ clusters, andthe average number of placeholders per cluster is then |Vp| /N̂ . As a result, the predicted nodes have the same averagedegree as the nodes in the available graph Va. This is due to the fact that each placeholder has one link, and each

19

Fig. 13: Graph Edit Distance achieved when N is estimated and when N is known in advance. Tested on 10,000 nodegraphs.

predicted node will have the links of all the placeholders which were clustered into the relevant cluster representingthat missing node.

We first studied how accurate these estimations were in predicting the actual number of missing nodes within thenetwork. To do so, we ran an experiment using 10 samples of 10,000 node networks. From each network, we randomlyremoved a set of missing nodes of sizes 10, 20, 30, 50, 70 and 100. We calculated the two estimations according to theremaining networks and the placeholders. We then calculated the mean absolute error between each estimation and theactual number of missing nodes, and repeated this test 4 times. The results in Table II show the mean absolute errorfor these 40 runs. Please note that both estimation methods yielded similar results and that as the number of missingnodes increases, then the error is reduced. This implies that both estimations actually become more precise as morenodes are missing. One possible explanation for this is that as the number of missing nodes increases, small variationswithin localized portions of the network become less significant, making these estimations more accurate. Next, tomeasure the performance of using only an estimate of the number of missing nodes, we considered the predictiveperformance of the MISC algorithm when the actual number of clusters is given, in comparison to the performanceobtained when using an estimate. In this case, the Graph Edit Distance measure was used since each instance of theexperiment may result in a different number of clusters, depending on the selected subset of the network and therandomly chosen missing nodes. As both the mean and quadratic estimations yield similar results, Figure 13 displaysthe results of these experiments when using the mean estimate. While there is some visible degradation of the resultsachieved when N was unknown, the algorithm was still able to achieve beneficial results. For instance, the proposedalgorithm which estimated the value of N still performed better than a random clustering algorithm which was giventhe correct value in advance, especially as N increased. A possible reason for some degradation in the results mightbe a large variance in the degrees of the missing nodes, causing a large variance in |Vp|, because each neighbor of amissing node is connected to a placeholder in the experiment. Thus, when a node with a large degree was randomlyselected for removal, this led to an overestimate of N . The estimation of N is expected to be much more precise if thesubset of the network is selected in a way that the variance in the nodes’ degrees would be smaller.

9.2. Addressing Missing PlaceholdersMany real-life scenarios are likely when some of the placeholders are missing, for example, if the indications forsome of the placeholders were not obtained. Under the assumption that a data mining module provides us with theindications of the existence of placeholders, this module may provide incomplete results. This mistaken informationwill not only add false negatives of non-detection of placeholders, but false positives may also exist in the form offalse alarms. To assess how robust MISC is with incomplete placeholder information, we measured how this partialinformation would affect the missing nodes’ identification. For this purpose we conducted a set of experiments wherethe percentage of known placeholders ranged from 10% to 100% of the total placeholders that are created whenremoving the missing nodes. Figure 14 displays the Graph Edit Distance achieved by the algorithm for each percentageof known placeholders. The affinity measures used in these experiments were Katz Beta and Adamic/Adar. The results

20

Fig. 14: Graph Edit Distance between the original graph and the graph predicted by the MISC algorithm when onlysome of the placeholders are known.

clearly show that the performance decreases when less placeholders are known, resulting in a higher Graph EditDistance between the original graph G and the predicted graph Ĝ. Each point in the graph represents the average GraphEdit Distance achieved from 100 experiments. This scenario raises questions regarding the MISC algorithm’s abilityto address missing information within the placeholders. First, when some placeholders are unknown, the resultinggraph predicted by the spectral clustering algorithm would lack the edges between each unknown placeholder and itsneighbor. In addition, adapting the spectral clustering algorithm to deal with probabilistic placeholders is not trivial.A possible approach is to run a Link Prediction Algorithm in order to complete the edges which are missed due to theunknown placeholders. To formulate this new problem we alter the input of the original missing node identificationproblem. Recall that Vp is the group of all of the placeholders generated from the missing nodes and their edges. Wedefine the following new groups: V kp , E

kp - the group of known placeholders and their associated edges, and V

up , E

up -

the group of unknown placeholders and their associated edges, such that Vp = V kp ∪ V up and V kp ∩ V up = ϕ. For eachmissing node v ∈ Vm, and for each edge ⟨v, u⟩ ∈ E, u ∈ Vk, we add a placeholder v′ either to V kp or to V up and anedge ⟨v′, u⟩ to Ekp or to Eup accordingly. The available network graph, Ga = ⟨Va, Ea⟩, now consists of Va = Vk ∪ V kpand Ea = Ek ∪Ekp . In addition, we define the indicator function I (v) which returns the value 1 for each known nodev ∈ Vk if it is connected to an unknown placeholder, and otherwise returns 0.

I (v) =

1 if ∃u ∈ Vup ∧ ⟨v, u⟩ ∈ Eup

0 otherwise

The value of I (v) is of course unknown to the system. Instead, we model the data mining module’s knowledge asS (v) , a noisy view of I (v) with additive random noise X(v), where X is an unknown random variable.

S (v) = I (v) + X(v)

The formal problem definition for this problem setting is then: given a known network Gk = ⟨Vk, Ek⟩, an availablenetwork Ga = ⟨Va, Ea⟩, the value of S (v) for each v ∈ Vk and the number of missing nodes N , divide the nodesof Va\Vk to N disjoint sets Vv1 , . . . , VvN such that Vvi ⊆ Vp are all the placeholders of vi ∈ Vm, and connect eachset to additional nodes in Vk such that the resulting graph has a minimal graph edit distance (GED) from the originalnetwork G. We propose two possible algorithms for solving this problem – Missing Link Completion and SpeculativeMISC. Both algorithms accept a threshold T , which indicates a minimal value of S(v) to be taken into account as avalid indication of a placeholder. The Missing Link Completion algorithm at first ignores S(v), and runs the originalMISC algorithm only on the available graph. Then, in order to complete links that are missing from the predictedgraph due to not knowing all the placeholders, a Link Prediction algorithm is used. This algorithm simply connectseach node that has a value of S(v) which is higher than T to the new node with which it has the strongest affinity.Algorithm 6 details the pseudo code of the Missing Link Completion algorithm: The Speculative MISC algorithm usesa different approach regarding S (v). Under this approach, a new placeholder is added and connected to every node vwhose indication value, S (v) is greater than T . Next, the regular MISC algorithm (algorithm 2) is used on the newgraph to predict the original graph. To compare these two algorithms, we conducted experiments where only 50%of the placeholders are known. For each node v ∈ Vk we calculated S(v) as described above, indicating if that node

21

ALGORITHM 4: - Missing Link CompletionInput:Gk = ⟨Vk, Ek⟩ {the known part of the network}Ga = ⟨Va, Ea⟩ {the available part of the network}S (v) , ∀v ∈ Vk {the available indications about the unknown placeholders}N {the number of missing nodes}α : G(V,E) −→ R|V |×|V | {a function for calculating the affinity matrix of nodes in a graph}T {the minimal value of S(v) to be taken into account as a valid indication of a placeholder}

Output: Ĝ =⟨V̂,Ê

⟩- prediction of the original network G

1: Ĝ =⟨V̂,Ê

⟩←−MISC (Gk, Ga, N, α) {Apply the MISC algorithm while ignoring S(v)}

2: Â = α(Ĝ) {Calculate the pairwise affinity of the nodes in the predicted graph Ĝ} {Iterate over all the known nodes with avalue of S(v) higher than the threshold}

3: for each v ∈ Vk : S (v)>T do4: u = argmaxw∈V̂ \Vk Â(v, w) {find the new node u with the str

Predicting and Identifying Missing Node Information in ...users.umiacs.umd.edu/~sarit/data/articles/tkdd-final.pdf · Sigal Sina, Bar-Ilan University Sarit Kraus, Bar-Ilan University

Documents