Node Embedding with Adaptive Similarities for Scalable Learning … · 2019. 8. 13. · 28 (semi-)supervised learning tasks, such as node classiﬁca-29 tion, link prediction, or

Node Embedding with Adaptive Similarities forScalable Learning over Graphs

Dimitris Berberidis , Student Member, IEEE and Georgios B. Giannakis , Fellow, IEEE

Abstract—Node embedding is the task of extracting informative and descriptive features over the nodes of a graph. The importance of

node embedding for graph analytics as well as learning tasks, such as node classification, link prediction, and community detection,

has led to a growing interest and a number of recent advances. Nonetheless, node embedding faces several major challenges.

Practical embedding methods have to deal with real-world graphs that arise from different domains, with inherently diverse underlying

processes as well as similarity structures and metrics. On the other hand, similar to principal component analysis in feature vector

spaces, node embedding is an inherently unsupervised task. Lacking metadata for validation, practical schemes motivate

standardization and limited use of tunable hyperparameters. Finally, node embedding methods must be scalable in order to cope with

large-scale real-world graphs of networks with ever-increasing size. The present work puts forth an adaptive node embedding

framework that adjusts the embedding process to a given underlying graph, in a fully unsupervised manner. This is achieved by

leveraging the notion of a tunable node similarity matrix that assigns weights on multihop paths. The design of multihop similarities

ensures that the resultant embeddings also inherit interpretable spectral properties. The proposed model is thoroughly investigated,

interpreted, and numerically evaluated using stochastic block models. Moreover, an unsupervised algorithm is developed for training

the model parameters effieciently. Extensive node classification, link prediction, and clustering experiments are carried out on many

real-world graphs from various domains, along with comparisons with state-of-the-art scalable and unsupervised node embedding

alternatives. The proposed method enjoys superior performance in many cases, while also yielding interpretable information on the

underlying graph structure.

Index Terms—SVD, SVM, unsupervised, multiscale, random walks, spectral

Ç

1 INTRODUCTION

USUPERVISED node embedding is an exciting field, inwhich a significant amount of progress has been madein recent years [15]. The task consists of mapping each nodeof a graph to a vector in a low-dimensional euclidean space.The main goal is to extract features that can be utilized down-stream in order to perform a variety of unsupervised or(semi-)supervised learning tasks, such as node classifica-tion, link prediction, or clustering [16]. Ideally, it is desiredfor the embedded nodal vectors to convey at least as muchinformation as the original graph. Nevertheless, an appro-priate embedding can boost the performance of certainlearning tasks because they allow one to work with themore “friendly” and intuitive Euclidean representation, anddeploy mature and widely implemented feature-based algo-rithms such as (kernel) support vector machines (SVMs),logistic regression, and K-means.

Early embedding works mostly focused on a structure-preserving dimensionality reduction of feature vectors(instead of nodes); see for instance [22], [23], [24], [25], [26].In this context, graphs are constructed from pairwise featurevector relations and are treated as representations of the

manifold that data lie on; embedded vectors are then gener-ated so that they preserve the corresponding pair-wiseproximities on the manifold. More recently, nodal vectorembedding of a graph has attracted considerable attentionin different fields, and is often posed as the factorization ofa properly defined node similarity matrix [27], [28], [29],[30], [31], [32], [33], [34]. Efforts in this direction mostlyfocus on designing meaningful similarity metrics to factor-ize. While some methods (e.g., [27], [29]) maintain scalabil-ity by factorizing similarity matrices in an implicit manner(without explicitly forming them), others such as [30], [31]form and/or factorize dense similarity matrices that scalepoorly to large graphs. Another line of work opts to gradu-ally fit pairs of embedded vectors to existing edges usingstochastic optimization tools [35], [37]. Such approaches arenaturally scalable and entail a high degree of locality.Recently, stochastic edge-fitting has been generalized toimplicitly accommodate long-range node similarities [36].Meanwhile, other works have approached node embed-dings using random-walk-based tools and concepts origi-nating from natural language processing [38], [39], [40]; seealso related works on embedding of knowledge graphs [41],[42], [50]. Methods that rely on graph convolutional neuralnetworks and autoencoders have also been proposed fornode embedding [45], [46], [47]. Moreover, a gamut ofrelated embedding tasks are gaining traction, such asembedding based on structural roles of nodes [43], [44],supervised embeddings for classification [11], and inductiveembedding methods that utilize multiple graphs [6]

� The authors are with theDepartment of Electrical and Computer Engineering,and Digital Technology Center, University of Minnesota, Minneapolis, MN55455USA. E-mail: {bermp001, georgios}@umn.edu.

Manuscript received 3 Dec. 2018; revised 1 June 2019; accepted 16 July 2019.Date of publication 29 July 2019; date of current version 11 Jan. 2021.(Corresponding author: Dimitris Berberidis.)Recommended for acceptance by F. Rusu.Digital Object Identifier no. 10.1109/TKDE.2019.2931542

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 2, FEBRUARY 2021 637

1041-4347� 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tps://www.ieee.org/publications/rights/index.html for more information.

https://orcid.org/0000-0003-3563-6052https://orcid.org/0000-0003-3563-6052https://orcid.org/0000-0003-3563-6052https://orcid.org/0000-0003-3563-6052https://orcid.org/0000-0003-3563-6052https://orcid.org/0000-0002-0196-0260https://orcid.org/0000-0002-0196-0260https://orcid.org/0000-0002-0196-0260https://orcid.org/0000-0002-0196-0260https://orcid.org/0000-0002-0196-0260mailto:

We identify the following challenges that need to beaddressed in order to design embedding methods that areapplicable in practice:

� Diversity. Since graphs that arise from differentdomains are generally characterized by a diverse setof properties, there may not be a “one-size-fits-all”node embedding approach.

� No supervision. At the same time, node embeddingmay need to be performed in a fully unsupervisedman-ner, that is, without extra information (node attrib-utes, labels, or groundtruth communities) to guidethe parameter tuning process with cross-validation.

� Scalability. While some real-world networks are ofmoderate size, others may contain massive numbersof nodes and edges. Specifically, graphs encounteredwith social networks, transportation networks,knowledge graphs and others, typically scale to mil-lions of nodes and tens of millions of edges. Thus,strict computational constraints must be accountedby the design of node embedding methods.

In response to these challenges, we propose a scalable nodeembedding framework that is based on factorizing anadaptive node similarity matrix. The first challenge isaddressed by utilizing a large family of node similaritymetrics, parametrized by placing different weights onnode proximities of different orders; see also our precursorwork [20]. Experiments indicate that the proposed modelfor similarity metrics is expressive enough to describe real-world graphs from diverse domains and with differentstructures. To address the second challenge (lack of super-vision), we put forth a self-supervised parameter learningscheme based on predicting randomly removed edges.Finally, we accommodate scalability by constraining theparametrization of similarity matrices such that the prox-imity order parameters carry over to the embedded vectorsin a smooth manner. This allows for learning proximity

order parameters directly on the feature vectors. Conse-quently, dense similarity matrices do not need to be explic-itly formed and factorized, thus endowing the proposedmethod with the desired level of scalability.

The rest of the paper is organized as follows. Section 2introduces the problem and the proposed similarity model.Section 3 presents a numerical study on model properties,while Section 4 deals with learning the model parameters inan unsupervised manner. Finally, Section 5 discussesrelated methods, and Section 6 contains experiments on realgraphs, comparisons with competing alternatives, andinterpretation of the results. While notation is defined wher-ever it is introduced, we also summarize the most importantsymbols that appear throughout the paper in Table 1.

2 PROBLEM STATEMENT AND MODELING

Given an undirected graph G :¼ fV; Eg, where V is the setof N nodes, and E � V � V is the set of edges, the task ofnode embedding boils down to determining fð�Þ : V ! Rd,where d � N . In other words, a function is sought to mapevery node of G to a vector in the d-dimensional Euclid-ean space. Typically, the embedding is low dimensionalwith d much smaller than the number of nodes. Givenfð�Þ, the low-dimensional vector representation of eachnode vi is

ei ¼ fðviÞ 8vi 2 V :Since the number of nodes is finite, instead of finding a gen-eral fð�Þ (induction), one may pose the embedding task inits most general form as a the following minimization prob-lem over the embedded vectors

fe�i gNi¼1 ¼ arg minfeigNi¼1X

vi;vj2V‘ sGðvi; vjÞ; sEðei; ejÞ� �

; (1)

where ‘ð�; �Þ : R�R ! R is a loss function; sGð�; �Þ : V � V !R is a similarity metric over pairs of graph nodes; andsEð�; �Þ : Rd �Rd ! R a similarity metric over pairs of vectorsin the d-dimensional euclidean space.

In par with (1), node embedding can be viewed as thedesign of nodal vectors feigNi¼1 that successfully “encode” acertain notion of pairwise similarities among graph nodes.

2.1 Embedding as Matrix Factorization

Starting from the generalized framework in (1), one mayarrive at concrete approaches by specifying choices of sGð�; �Þ,sEð�; �Þ, and ‘ð�; �Þ. To start, suppose that the node similaritymetric is symmetric; that is, sGðvi; vjÞ ¼ sGðvj; viÞ 8vi; vj 2 V.Furthermore, let the loss function be quadratic

‘ðx; x0Þ ¼ x� x0ð Þ2;and the nodal vector similarity be the inner product

sEðei; ejÞ ¼ e>i ej:Using these specifications, (1) reduces to the following sym-metric matrix factorization problem

E� ¼ arg minE2RN�d

kSG � EE>k2F ; (2)

TABLE 1Important Notation

V , Set of nodesE , Set of edgesA , N �N adjacency matrixD , diagð1TAÞ diagonal degree matrixE , N � dmatrix of embeddingsei , Embedding vector of node visGð�; �Þ , Node – to – node similarityskð�; �Þ , k-hop node – to – node similaritysEð�; �Þ , Embedding – to – embedding similarity‘ð�; �Þ , Distance (loss) between similaritiesSG , Final node similarity matrixS , Basic sparse (single-hop) and symmetric

node similarity matrixuk , Coefficient of k-hop pathsuu , ½u1; . . . ; uK T vector of coefficientsSK , K-dimensional probability simplexSþ , Set of sampled positive edgesS� , Set of all sampled negative edgesS , Sþ [ S� all sampled edgesNs , Number of sampled edgesuu�S , Optimal coefficients that fit sample STs , Number of different edge samples

638 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 2, FEBRUARY 2021

where SG 2 RN�N is the symmetric similarity matrixwith SG½ i;j¼ SG½ j;i¼ sGðvi; vjÞ, and matrix E :¼ e1 . . . eN½ >concatenates all node embeddings as rows. A well-knownanalytical solution to (2) relies on the singular value decom-position (SVD) of the similarity matrix, that is SG ¼ USSVT ,where U and V are the N �N unitary matrices formed bythe left and right singular vectors, and SS is diagonal withnon-negative singular values sorted in decreasing order; inour case, U ¼ V since SG is symmetric. Given the SVD of SG,the low-rank (d � N) solver in (2) is E� ¼ UdSS1=2d , where SSdcontains the d largest singular values, and Ud the corre-sponding singular vectors [19]. Matrices Ud and SSd can beobtained directly using the reduced-complexity schemeknown as truncated SVD.

If in addition SG is sparse, (2) can be solved even moreefficiently, with complexity that scales with the number ofedges. One such example with sparse similarities is whenSG ¼ A, where A is the graph adjacency matrix. Embed-dings generally gain scalability by avoiding the explicit con-struction of a dense SG. In fact, simply storing SG in theworking memory becomes prohibitive even for graphs ofmoderate sizes (sayN > 105).

In the ensuing section, we will design a family of densesimilarity matrices that (among other properties) can bedecomposed implicitly, at the cost of input sparsity.

2.2 Multihop Graph Node Similarities

Having reduced the node embedding problem to the onein (2), it remains to specify the graph similarity metricthat gives rise to SG. Towards this end, and in order tomaintain expressibility, we will design a parametricmodel for SG, with each pairwise node similarity metricexpressed as

sGðvi; vj; uuÞ ¼XKk¼1

ukskðvi; vjÞ; s:t: uu 2 SK; (3)

where SK :¼ fuu 2 RK : uu 0; uu>1 ¼ 1g is theK-dimensionalprobability simplex, and skðvi; vjÞ is a similarity metric thatdepends on all k-hop paths of possibly repeated nodes thatstart from vi and end at vj (or vice-versa). Thus, sGð�; �; uuÞcontains all k-hop interactions between two nodes, eachweighted by a non-negative importance score uk withk ¼ 1; . . . ; K.

Let S be any similarity matrix that is characterized by thesame sparsity pattern as the adjacency matrix, that is

Si;j ¼ si;j; ði; jÞ 2 E0; ði; jÞ =2 E;�

(4)

where fsi;jgs denote the generic non-negative values ofentries that correspond to edges of G. Maintaining the samesparsity pattern as A allows for the ði; jÞ entry of Sk to beinterpreted as a measure of influence between vi and vj thatdepends on all k-hop paths that connect them; that is,½Ski;j ¼ skðvi; vjÞ. For instance, selecting S ¼ A is equivalentto using the k-step similarity skðvi; vjÞ ¼ jfk� length pathsconnecting vi to vjgj [12]. Likewise, if S ¼ AD�1 where D ¼diagð1TAÞ, then skðvi; vjÞ can be interpreted as the probabil-ity that a random walk starting from vj lands on vi afterexactly k steps, e.g., [31]. Thus, for a properly selected S

with entries as in (4), tunable multihop similarity metrics in(3) can be collected as entries of the power series matrix

SGðuuÞ ¼XKk¼1

ukSk; s:t: uu 2 SK: (5)

Upon substituting (5) into (2) yields the tunable embed-dings E�ðuuÞ that depend on the choice of parameters uu. Fromthe eigen-decomposition S ¼ USSU>, and given that U>U ¼I, we readily arrive at

Sk ¼ USSkU>; (6)and after plugging (6) into (5), we obtain

SGðuuÞ ¼ UXKk¼1

ukSSk

!U>; s:t: uu 2 SK: (7)

Furthermore, the truncated singular pairs of SGðuuÞ conve-niently follow from those of S, and they have to be com-puted once. Specifically, the truncated singular vectors andsingular values are UdðuuÞ ¼ Ud and SSdðuuÞ ¼

PKk¼1 ukSS

kd,

respectively. Thus, if S 2 SymN the solution to (2) with SGparametrized by uu is simply given as

E�ðuuÞ ¼ UdffiffiffiffiffiffiffiffiffiffiffiffiSSdðuuÞ

q: (8)

Note that this holds only for non-negative parameters uk

0 8 k. If uk < 0 for at least one k 2 f1; . . . ; Kg, then the diag-onal entries of SSdðuuÞ cannot be guaranteed to be non-nega-tive and sorted in decreasing order, which would causeUdðuuÞ;SSdðuuÞð Þ to not be a valid SVD pair.Having narrowed down SG to belong to the parametrized

family in (5), we proceed to select an appropriate sparsity-preserving S in order to obtain a solid model.

2.3 Spectral Multihop Embeddings

While any symmetric S that obeys (4) can be used for con-structing multihop similarities (cf. (5)), judicious designs ofS can effect certain desirable properties. Bearing this inmind, consider the following identity

S 2 PþN , S ¼ USSU> ¼ ULLU>; (9)where PþN denotes the space of N �N symmetric positivedefinite (SPD) matrices, and LL is the diagonal matrix thatcontains the eigenvalues of S sorted in decreasing order.For SPD matrices as in (9), the SVD is identical to the eigen-value decomposition (EVD). Thus, if S 2 PþN , the solution to(2) is also given as (cf. (8))

E�ðuuÞ ¼ UdffiffiffiffiffiffiffiffiffiffiffiffiLLdðuuÞ

p; (10)

where Ud are also the first d eigenvectors of S, and LLdðuuÞ ¼PKk¼1 ukLL

kd is the Kth order polynomial of its eigenvalues

defined by uu.Consider now specifying S as

S ¼ 12

IþD�1=2AD�1=2� �

: (11)

Recalling that �i D�1=2AD�1=2

� � 2 ½�1; 1 8 i, and afterusing the identity shifting and scaling, we deduce that

BERBERIDIS AND GIANNAKIS: NODE EMBEDDING WITH ADAPTIVE SIMILARITIES FOR SCALABLE LEARNING OVER GRAPHS 639

�iðSÞ 2 ½0; 1 8 i; hence, matrix S in (11) is SPD. It can also bereadily verified that the first d eigenvectors of S coincidewith the eigenvectors corresponding to the d smallest eigen-values of the symmetric normalized Laplacian matrix

Lsym :¼ I�D�1=2AD�1=2: (12)These smallest eigenvalues are known to contain usefulinformation on cluster structures of different resolution lev-els, a key property that has been successfully employed byspectral clustering [17]. Intuitively, assigning weight uuk tok-hop paths in the node similarity of (5), is equivalent toshrinking the d-dimensional spectral node embeddings(rows of Ud) coordinates according to LLdðuuÞ. Interestingly,assigning large weights to longer paths (K � 1) is equiva-lent to fast shrinking the coordinates that correspond tosmall eigenvalues and capture the fine-grained structuresand local relations, what leads to a coarse, high-level clusterdescription of the graph.

2.4 Relation to RandomWalks

Apart from the spectral embedding interpretation discussedin the last section, using powers of (11) to capture multihopsimilarities also admits an interesting random walk inter-pretation. We begin by expressing the kth power of S as

Sk ¼ 12k

IþD�1=2AD�1=2� �k

¼Xkt¼0

atðkÞ D�1=2AD�1=2� �t

;(13)

where the sequence

atðkÞ :¼12k

kt

� �; 0 � t � k

0; else

�(14)

can be interpreted as nonzero weights that Sk assigns to allpaths with the number of hops up to k (see Fig. 1).

Using (13) and (14), the multihop similarity in (5)becomes

SGðuuÞ ¼XKt¼0

ctðuuÞ D�1=2AD�1=2� �t

¼ D�1=2XKt¼0

ctðuuÞPt !

D1=2;

(15)

where

ctðuuÞ :¼XKk¼1

ukatðkÞ; (16)

and P ¼ AD�1 is the probability transition matrix of a sim-ple random walk defined over G; that is, Pi;j is the probabi-ity that a random walker positioned on node (state) jtransitions to node i in one step. Thus, the k-hop similarityfunction defined in (3) is expressed as

sGðvi; vj; uuÞ ¼ffiffiffiffiffidjdi

s XKt¼0

ctðuuÞPrfXt ¼ vijX0 ¼ vjg; (17)

where PrfXt ¼ vijX0 ¼ vjg :¼ Pt½ ij is the probability that arandom walk starting from vj lands on vi after t steps.

Interestingly, SGðuuÞ does not weigh landing probabilitiesof different lengths independently. Instead, it accumulatesthe latter as weighted combinations (cf. (16)) in a basis of“wavelet”-type functions of different resolution (see Fig. 1).

Having established links to spectral clustering and ran-dom walks, our novel SGðuuÞ is well motivated as a family ofnode similarity matrices. Nevertheless, before devising analgorithm for learning uu and testing it on real graphs, we

will evaluate how well the basis fSkgKk¼1, on which SGðuuÞ isbuilt, can capture underlying node similarities.

3 MODEL EXPRESSIVENESS

This section introduces a performance metric that quantifieshow well a node similarity matrix derived from the graphitself matches the “true” underlying similarity structurebetween nodes. The discussion is followed by numericalevaluation of the performance of different similarity matri-ces (including the one in (13)) on graphs that are generatedaccording to the stochastic block model [2].

To begin, suppose that for a given set of nodes, an adja-cency matrix A is generated as

A fAðAÞ;where fAðAÞ is a probability density function defined overthe space of all possible adjacency matrices. Let the “true”underlying similarity between nodes vi and vj be

s�ðvi; vjÞ :¼ Prfði; jÞ 2 Eg ¼ EfA Ai;j�

;

which is the probability that the two nodes are connected.The “true” similarity matrix is thus given as the expectedadjacency matrix

S� :¼ EfA A½ :We define the quality-of-match (QoM) between the underly-ing S� and any similarity Ŝ ¼ F ðAÞ estimated from the adja-cency matrix as

QoM :¼ EfA PC S�; F ðAÞð Þ½ ; (18)

Fig. 1. Matrix Sk is equivalent to applying “wavelet”-type weights atðkÞover walks with hops � k.


where

PC X1;X2ð Þ :¼ vec X1ð Þð Þ>vec X2ð Þ

kX1kFkX2kF; (19)

is the Pearson correlation between two matrices X1 and X2,with vec Xð Þ denoting matrix vectorization. The latter is usedfor appropriate rescaling of the “true” similarity matrix inorder for the comparison with SG to be meaningful. Intui-tively, (18) measures how well the estimated node similari-ties in Ŝ are expected to match the pattern of trueunderlying similarities in S�, when edges are generatedaccording to the known fAð�Þ.

3.1 Numerical Experiments and Observations

We numerically evaluate the QoM achieved by differentsimilarity matrices, on a set of N nodes whose interconnec-tions are generated according to a stochastic block model(SBM). For this set of experiments, we divided the nodesinto three clusters of equal size

Cl ¼ fi : ðl� 1ÞN=3 � i � lN=3g; l 2 f1; 2; 3g;with inter- and intra-connection probabilities

Prfði; jÞ 2 Eg ¼p; ði; jÞ in the same Clcq; i 2 C1 and j 2 C3q else;

8<: (20)

where p is the probability of connection when two nodesbelong to the same cluster, and c < 1 introduces asymme-try and a hierarchical clustering organization (see Fig. 2-topleft), by making two of the clusters less likely to connect; we

have related Python scripts available.1 The SBM probabilitymatrix [2] is given as

Wsbm ¼p q cqq p qcq q p

24

35; (21)

and the underlying similarity can be expressed as

S� ¼ E A½ ¼ Wsbm � 1N=31TN=3� �

� diagðp1NÞ; (22)

where � denotes the Kronecker product.For each experiment, we set N ¼ 150 and generated a

graph according to (20). We then compared the QoMbetween (22) and the kth power of the proposed (11), thekth power of the adjacency (Ak), as well as each of the fol-lowing well known similarity metrics:

� ŜPPR :¼ ð1� aÞðI� aAD�1Þ�1: the steady state prob-ability that a random walk restarting at vj with prob-ability 1� a at every step is located at vi. Essentiallya personalized PageRank (PPR) computed for everynode of the graph, inheriting the properties of thecelebrated centrality measure [7], [8], [9].

� ŜKATZ :¼ ð1� bÞðI� bAÞ�1A: the Katz index [12], anexponentially weighted summation over paths of allpossible hops between two nodes.

� ŜNEIGH :¼ A2: the number of common neighbors thatevery pair of nodes shares.

Fig. 2. Depiction of groundtruth and estimated similarity matrices, as yielded from an instance of the numerical experiments described in Section 3.1.

1. https://github.com/DimBer/ASE-project/tree/master/sim_tests


� ŜAA :¼ AD�1A: Adamic-Adar [4] is a variant of com-mon neighbors where each set of neighbors isweighted inversely proportional to its cardinality.

The resulting QoM was averaged over 200 experiments.

Parameters a in ŜPPR and b in ŜKATZ were tuned to maxi-mize the performance of the metrics. Fig. 3 depicts QoM asa function of k, for three different scenarios.

In the first scenario (Fig. 3a), with graphs being denseand clustered (p ¼ 0:3, q ¼ 0:1), the proposed Sk improvessharply in the first few steps, reaching maximum QoM after4 or 5 steps, and gradually decreases as k continues toincrease. The kth order proximities that are given as entriesof Ak follow a similar trend, however their QoM peaksshortly after 2 or 3 steps and declines fast for larger k. Thematrix plots of a randomly selected experiment depicted inFig. 2 can aid in understanding the underlying mechanismthat gives rise to this highly step-dependent behavior. Spe-cifically, S1 (bottom left) that has the same sparsity patternas the adjacency is a poor match to the dense block-structureof S�. On the other side of the spectrum, S15 (bottom right) istoo “flat” and also a poor similarity metric. Meanwhile, tak-ing k ¼ 6 promotes enough mixing without “dissipating.”As a result, S6 (bottom center) visibly matches the structureof S�. Interestingly, for k 2 ½4; 10 the proposed Sk surpassesin QoM all other similarity metrics that were tested. Never-theless, the simple 2-hop Adamic-adar, common-neighborssimilarities perform reasonably well by exploiting the rela-tively dense structure of the graphs.

Results were markedly different in the second scenarioshown in Fig. 3b. Here, graphs were generated with the sameclustering structure but significantly sparser,with edge prob-ability parameters p ¼ 0:15 and q ¼ 0:05. For sparser graphs,Ak and Sk require more steps to reach peak QoM (4 and 9respectively). Similarly, PPR which relies on long paths per-forms much better than the short-reaching Adamic-Adar.This behavior is intuitively reasonable because the sparser agraph is, the longer become the paths that need to beexplored around each node, in order for the latter to “gauge”its position on the graph.

Finally, a third scenario (Fig. 3c) was examined, whereeach graph was generated without a clustering structure(p ¼ q ¼ 0:1 and c ¼ 1); essentially an Erdos-Renyi graph.For this degenerate case that is of no real practical interest,all pairs of nodes are equally similar; this type of similarityrequires infinitely long paths to be described.

In a nutshell, the presented numerical study hints at thetwo following facts. First, Sk can successfully model similar-ities that are based on grouping nodes in arbitrary and mul-tilevel sets with variable degrees of homophily andheterophily. The second fact, is that the performance of Sk

varies significantly with k. Moreover, the way that k affectsperformance may also vary from graph to graph, dependingon the underlying properties—what suggests viewing thisway as a graph “signature” that is also validated by the realgraphs in Section 6. Thus, a principled means of specifyingSGðuuÞ by learning the parameters that match this graph“signature” in an unsupervised mode, is highly motivated.

4 UNSUPERVISED SIMILARITY LEARNING

We have arrived at the point where for a given graph, it isprudent to select a specific uu 2 SK without supervision. Fol-lowing the discussion in Section 3, it would be ideal to fitSGðuuÞ to a true S� by minimizing an expected cost

uu� ¼ arg minuu2SK

EfA ‘ S�;SGðA; uuÞð Þ½ : (23)

Unfortunately, we only have one realization A of fAð�Þ,which means that without prior knowledge, the bestapproximation of S� that we can obtain is the adjacencymatrix itself, that is S� � A. Using this approximation yields

minuu2SK

‘ A;SGðA; uuÞð Þ: (24)

While straightforward, (24) yields embeddings with limitedgeneralization capability. Simply put, regardless of thechoice of ‘ð�Þ, solving (24) amounts to predicting a set ofedges by tuning a similarity metric that is generated by thesame set of edges.

To mitigate overfitting but also promote generalization ofthe similarity metric and of the resulting embeddings, weexplore the following idea. Suppose we are given a pairA1;A2 of adjacency matrices both drawn independentlyfrom fAð�Þ. In this case, we would be able to use one asapproximation of S� � A1, and the other to form the multi-hop similarity matrix SGðA2; uuÞ; parameters uu can then belearned by solving

minuu2SK

‘ A1;SGðA2; uuÞð Þ: (25)

Fig. 3. Quality of match between true SBM similarity and various estimates, as yielded from experiments of Section 3.1.


Since separate samples are not available, we approximatethe aforementioned process by randomly extracting part ofA and approaching (25) as

minuu2SK

‘S A;SGðA � Sc; uuÞð Þ; (26)

where S 2 f1; . . . ; Ng2 is a subset of all possible pairs ofnodes with jSj ¼ Ns, and Sc is an N �N binary sectionmatrix with Sci;j ¼ 0, if fi; jg 2 S, and Sci;j ¼ 1, otherwise.Furthermore, ‘Sð�; �Þ in (26) denotes cost ‘ð�; �Þ applied selec-tively only to entries of the matrix variables that belong toS. Here, such that S ¼ Sþ [ S�, with Sþ 2 E being as subsetof the edges and S� 2 f1; . . . ; Ng2 n E a subset of node indextuples that are not connected (non-edges). To balance theinfluence of existing and non-existing edges, we use subsetsof equal cardinality, that is jSþj ¼ jS�j ¼ Ns=2.

To arrive from the unsupervised similarity learningframework (26) to a practical method, it remains to specifytwo modular sub-systems: one responsible for samplingedges, and one specifying ‘ð�; �Þ to find uu� by solving (26).

4.1 Edge Sampling

The choice of the sampling scheme for S plays an importantrole in the overall performance of the proposed adaptiveembedding framework. Ideally, edge sampling should takeinto account the following criteria.

� Sample Sþ should be representative of the graph;� Edge removal should inflict minimal perturbation;� Edge removal should avoid isolating nodes; and� Sampling scheme should be simple and scalable.

Aiming at a ‘sweet spot’ of these objectives, we populate Sþby sampling edges according to the following procedure:first, a node v1 is sampled uniformly at random from V; then,a second node v2 is sampled uniformly from the neighbor-hood set NGðv1Þ of v1. The selected edge is removed only ifboth adjacent nodes have degree greater than one. Non-edges S� are obtained by uniform samplingwithout replace-ment over f1; . . . ; Ng2 n E. The overall procedure is summa-rized in Algorithm 2. For Ns � N , sampling probabilitiesremain approximately unchanged despite the removals,since the probability of selecting the same node is relativelysmall. Thus, one may approximate Prfet ¼ ði; jÞg � Prfe0 ¼ði; jÞg, and assuming for simplicity that di > 18i, it followsthat

Prfe0 ¼ ði; jÞg ¼ Prfv1 ¼ i; v2 ¼ jg þ Prfv1 ¼ j; v2 ¼ ig¼ Prfv2 ¼ ijv1 ¼ jgPrfv1 ¼ jgþ Prfv2 ¼ jjv1 ¼ igPrfv1 ¼ ig

¼ 1dj

1

Nþ 1di

1

N/ di þ dj

didj;

(27)

meaning that edge e ¼ ði; jÞ is removed with probabilitythat is proportional to the harmonic mean of the degrees ofthe nodes that it connects. As shown in [14], the perturba-tion that the removal of edge e ¼ ði; jÞ inflicts on the spec-trum of an undirected graph is proportional to didj; that is,removing edges that connect high-degree nodes leads tohigher perturbation. Thus, Algorithm 2 tends to inflict mini-mal perturbation by sampling with probability that isinversely proportional to didj for di; dj � 1; this is because

the denominator of (27) dominates its numerator for largedegrees. On the other hand, for smaller di and dj, thenumerator ensures relatively high probabilities for moder-ate-degree nodes. The combination of the two effects yieldsedge samples that are fairly representative of the graph,while inflicting low perturbation when removed.

4.2 Parameter Training

Subsequently, for a given sample S, we can obtain the corre-sponding optimal parameters as (cf. (26))

uu�S ¼ arg minuu2SK

Xi;j2S

‘ Ai;j; sG�ðvi; vj; uuÞ� �

; (28)

where G� :¼ V; E n Sþð Þ is the original graph with the ran-domly sampled subset Sþ of edges removed.

Algorithm 1. ADAPTIVE SIMILARITY EMBEDDING

Input: G Output: E// Training phaseQQ ¼ ;while jQQj < Ts doG�, Sþ, S� ¼ SAMPLE EDGES( G )uu�S ¼ TRAIN PARAMETERS( G�;Sþ;S�)QQ ¼ QQ [ uu�S

end whileuu� ¼ T�1s

Puu2QQ uu

// Embedding phase

S ¼ 12 IþD�1=2AD�1=2� �

S ¼ UdSSdUTdSSdðuu�Þ ¼

PKk¼1 u

�kSS

kd

return E ¼ UdffiffiffiffiffiffiffiffiffiffiffiffiffiffiSSdðuu�Þ

p

Algorithm 2. SAMPLE EDGES

Input: G Output: G�;Sþ;S�// Sample edgesSþ ¼ ;, G� ¼ Gwhile jSþj < Ns=2 doSample v1 Unif Vð Þif jN G�ðv1Þj > 1 thenSample v2 Unif NG�ðv1Þð Þif jN G�ðv2Þj > 1 thenSþ ¼ Sþ [ ðv1; v2ÞG� ¼ G� n ðv1; v2Þ

end ifend if

end while// Sample non-edgesS� ¼ ;while jS�j < Ns=2 doSample ðv1; v2Þ Unif V � Vð Þif ðv1; v2Þ =2 E doS� ¼ S� [ ðv1; v2Þ

end ifend whilereturn G�, Sþ, S�

Interestingly, one way that (28) could be solved is byexplicitly computing the entries of SGðuuÞ that are in S. Thiswould require performing K sparse matrix-vector products


to obtain every column of Sk for k 2 f1; . . . ; Kg, for all thecolumns that contain sampled entries. In the worst case, ifall nodes in the tuples of S correspond to different columnsof SGðuuÞ, two random walks are required for every tuple, fora total of 2Ns random walks. This requires O NsKjEjð Þ com-putations, and O NsNð Þ memory if they are to be performedconcurrently or in matrix form. Since K will typically be inthe order of tens, these requirements will be affordable, ifNs is relatively small. Nevertheless, they quickly becomecumbersome for Ns � K, which may be necessary to esti-mate theK-dimensional uu.

Instead, we will rely on the fact that the proposedembeddings are smooth and differentiable wrt to uu (cf. (10)),to develop a solution that allows for selecting arbitrarilylarge Ns, using the approximation

sG�ðvi; vj; uuÞ � sEðe�i ðuu; e�j ðuuÞÞ¼ e�i ðuuÞ� �>

e�j ðuuÞ

¼ffiffiffiffiffiffiffiffiffiffiffiffiffiSS�d ðuuÞ

qu�i

�> ffiffiffiffiffiffiffiffiffiffiffiffiffiSS�d ðuuÞ

qu�j

¼ u�i� �>

SS�d ðuuÞu�j

¼ x>i;j uu;

(29)

where

xi;j ¼ u�i � u�j� �>

SSKd ; (30)

and

SSKd ¼

s1 s21 � � � sK1

..

. ... . .

. ...

sd�1 s2d�1 � � � sKd�1sd s

2d � � � sKd

2666664

3777775:

Conveniently, fxi;jgs act as features over every possible pairof nodes, which when linearly combined with weights uu toproduce similarities, allow us to approach (28) using well-understood learning and optimization tools. Among thevarious loss functions one may fit the removed edges2 usingthe hinge loss

‘ðy; fÞ :¼ maxð0; �� yfÞ; (31)which is suitable for real-world graphs thanks to its robust-ness properties [13]; note that target variables here aredefined as yi;j ¼ 2Ai;j � 1 so that yi;j 2 f�1; 1g. We can thenequivalently express (28) as

uu�S ¼ arg minuu2SK

Xi;j2S

maxð0; �� yi;jx>i;j uuÞ þ �kuuk22; (32)

where � 0 is the regularization parameter of the ‘2 regu-larization typically used to improve the robustness and gen-eralization capability of SVMs [13]. To solve our variant ofsimplex-constrained SVMs (cf. (32)), we employ the

projected-gradient descent approach [3] that we describe inAlgorithm 4, where SIMPLEXPROJ( � ) is a subroutine thatimplements projections onto SK ; the latter can be performedwith OðKlogKÞ complexity as described in [21]. The overallparameter learning procedure for a given sample is summa-rized in Algorithm 3.

Algorithm 3. TRAIN PARAMETERS

Input: G, Sþ, S� Output: uu�SS ¼ 12 IþD�1=2AD�1=2

� �S ¼ UdSSdUTdS ¼ Sþ [ S�Form XS ¼ fxði;jÞgði;jÞ2S as in (30)return uu�S ¼ SIMPLEXSVM( XS ;Sþ;S�)

Algorithm 4. SIMPLEXSVM

Input: X ;Sþ;S�Output: uu�uu0 ¼ 1K 1; t ¼ 1while kuut � uut�1k1 tol dot ¼ tþ 1; ht ¼ a=

ffiffit

p

Sþa ¼ fe 2 Sþj xTe uut�1 � �gS�a ¼ fe 2 S�j xTe uut�1 ��ggt ¼

Pe2S�a xe �

Pe2Sþa xe

zt ¼ ð1� 2ht�Þuut�1 � htNs gtuut ¼ SIMPLEXPROJ( zt )

end whilereturn uut

In general, if runtime or computational resources allow,the sampling and training process described in the last twosections can be repeated Ts times to obtain different fuu�Sgs,which can then be averaged in order to reduce their vari-ance. In practice, this may not be necessary if Ns is largeenough, which will yield a near-deterministic uu. The overallproposed adaptive-similarity embedding (ASE) frameworkis summarized in Algorithm 1.

4.3 Complexity

The computational complexity of ASE is dominated by thecost of performing the truncated SVD of S in the training aswell as testing phases of Algorithm 1. Relying on the spar-sity (jEj � N2) and symmetry of S, the Lanczos algorithmfollowed by EVD of a tridiagonal matrix yield the truncatedSVD in a very efficient manner. Provided that d � N , thedecomposition can be achieved in OðjEjdÞ time and usingOðNdÞ memory. Therefore, for the Ts 1 training roundsand a single embedding round of Algorithm 1, the overallcomplexity is OððTs þ 1ÞjEjdÞ.

5 RELATED WORK

Two recent embedding methods also pursue similaritymatrices that combine walks of different lengths [12], [38].Most relevant to the proposed ASE is the “Arbitrary-OrderProximity Preserved Network Embedding” [12] approach,where a method is proposed for obtaining the SVD of apolynomial of the adjacency matrix without having torecompute the singular vectors.

2. In our implementation, we also provide learning mechanismsbased on least-squares, logistic regression, as well as finding the bestsingle k. Due to space constrains though we only present and reportresults of the SVM-based approach.


Compared to [12], we put forth the following contribu-tions. First, we introduce a family of multihop similaritieswhose decomposition leads to embeddings that inherit therich information contained in the spectral embeddings (cf.Section 2.3). An equally important contribution in terms ofmodeling is that our embeddings can be differentiated withrespect to (wrt) weights uu (cf. (29), (30), (31), and (32)),whereas the embeddings in [12] are non-differentiable wrtthe weights. Hence, [12] can only proceed in a “forward”fashion given some order proximity weights uu, whereas ourapproach allows for “navigating” the space of possible simi-larity functions sðvi; vj; uuÞ in a smooth fashion, meaning thatuu can be learned with simple optimization on well-definedfitting models such as logistic regression or SVMs (cf. (32)).This leads to the third main contribution, which is a meansof learning “personalized” uu (cf. Section 4) in an unsuper-vised fashion, meaning without downstream informationsuch as node or edge labels/attributes that can guide cross-validation in high-dimensional discretized parameter grids.

The second related embedding method presented in [38]builds on the concept of graph attentionmechanisms to placeweights on lengths of truncated randomwalks. These mech-anisms are used to build a similarity matrix containing co-occurrence probabilities. The matrix is jointly decomposedby maximizing a graph-likelihood function. The model in[38] is a generalization of the ones implicitly adopted by [39]and [40], building on similar tools and concepts that emergefrom natural language processing. Different from [39], [40]and the proposed ASE, [38] explicitly constructs and factor-izes a denseN �N similaritymatrix. The detailed procedureincurs complexity that is cubic wrt N , and becomes at bestquadratic after model approximations, meaning that [38]scales rather poorly beyond small graphs.

6 EXPERIMENTAL EVALUATION

The present section reports extensive experimental results ona variety of real-world networks. The aim of the presentedtests is twofold. First, to determine and quantify the qualityof the proposed ASE embeddings for different downstreamlearning tasks. Second, to analyze and interpret the resultingembedding parameters for different networks.

Datasets. In our experiments, we used the following real-world networks (see also Table 2).

� ca-AstroPh. The Astro Physics collaboration net-work is from the e-print arXiv and covers scientific

collaborations between co-authored papers submit-ted to Astro Physics category [52]. If an author i co-authored a paper with author j, the graph contains aundirected edge from i to j. If the paper is co-auth-ored by k authors, this generates a completely con-nected (sub)graph on k nodes.

� ca-CondMat. Condense Matter Physics collabora-tion network from ArXiv [52].

� CoCit. A co-citation network of papers citing otherpapers extracted by [36]; labels represent conferencesin which papers were published.

� com-DBLP. Computer science research bibliographycollaboration network [52].

� com-Amazon. Network collected by crawling Ama-zon website [52]. It is based on “Customers WhoBought This Item Also Bought” feature of the Ama-zon website. If a product i is frequently co-purchasedwith product j, the graph contains an undirectededge from i to j.

� vk2016-17. VK is a Russian all-encompassingsocial network. In [36], two snapshots of the networkwere extracted in November 2016 and May 2017, toobtain information about link appearance.

� email-Enron. Enron email communication net-work covering all the email communication within adataset of around half a million emails [52].

� PPI (H.Sapiens). Subgraph of the protein-proteininteraction network for Homo Sapiens. The sub-graph corresponds to the graph induced by nodesfor which labels (representing biological states) wereobtained from the hallmark gene sets [40].

� Wikipedia. This is a co-occurrence network ofwordsappearing in the first million bytes of the Wikipediadump. The labels represent the Part-of-Speech (POS)tags inferred using the Stanford POS-Tagger [40].

� BlogCatalog. A network of social relationships ofthe bloggers listed on the BlogCatalog website. Thelabels represent blogger interests inferred throughthe meta-data provided by the bloggers.

Methods. Experiments were run using the following unsu-pervised and scalable embedding methods.

� ASE. Our proposed adaptive similarity embedding.Based on observations made in Sections 3, and toretain optimization stability, we set the maximumnumber of steps to K ¼ 10. We also use the defaultSVM regularizer (� ¼ 1). To have a single learninground with learned parameters having small enoughvariance, we sampled with Ns=2 ¼ 1;000. We madeour implementation of ASE freely available.3

� VERSE [36]. This is a scalable framework for generat-ing node embeddings according to a similarity func-tion by minimizing a KL-divergence-objective viastochastic optimization. We used the default versionwith similarity (PPR with a ¼ 0:85), as suggestedand implemented by the authors.4

� Deepwalk [39]. This approach learns an embeddingby sampling random walks from each node, and

TABLE 2Network Characteristics

Graph jVj jEj jYj DensityPPI (H. Sapiens) 3,890 76,584 50 10�2Wikipedia 4,733 184,182 40 1:6� 10�2BlogCatalog 10,312 333,983 39 6:2� 10�3ca-CondMat 23,133 93,497 - 3:5� 10�4ca-AstroPh 18,772 198,110 - 1:1� 10�3email-Enron 36,692 183,831 - 2:7� 10�4CoCit 44,312 195,362 15 2� 10�4vk2016-17 78,593 2,680,542 - 8:7� 10�4com-Amazon 334,863 925,872 - 1:7� 10�5com-DBLP 317,080 1,049,866 - 2:1� 10�5

3. https://github.com/DimBer/ASE-project4. https://github.com/xgfs/verse


applying word2vec-based learning on those walks.We use the default parameters proposed in [39], i.e.,walk length t ¼ 80, number of walks per nodeg ¼ 80, window size w ¼ 10, and the scalable C++implementation5 provided in [36].

� HOPE [29]. This SVD-based approach approximateshigh-order proximities and leverages directed edges.We report the results obtained with the defaultparameters, i.e., Katz similarity as the similaritymeasure with b inversely proportional to the spectralradius.

� AROPE [12]. An approach for fast computation ofthin SVD of different polynomials of A. We used theofficial Python implementation6 to produce theembeddings. We selected the polynomial (hyper)parameters of AROPE using a set of validation edgesthat was sampled similarily to ASE (Algorithm 2).We consider proximity orders in the range [1,10],and perform grid search over the different proximityweights as suggested in [12].

� LINE [35]. This approach learns a d-dimensionalembedding in two steps, both using adjacency simi-larity. First, it learns d=2 dimensions using first-orderproximity; then, it learns another d=2 features usingsecond-order proximity. Last, the two halves are nor-malized and concatenated. We obtained a copy ofthe code,7 and run experiments with T ¼ 1010 sam-ples (although T ¼ 109 yielded the same accuracyfor smaller graphs), and s ¼ 5 negative samples, asdescribed in the paper.

� Spectral. This approach relies on the first d eigenvec-tors of D�1=2AD�1=2. The baseline was developed forclustering [17], and has also been run as a benchmarkfor node embeddings [40]. In our case, spectralembedding is of particular interest since it can beobtained by column-wise normalization of theembeddings generated by the proposed method.

We excluded comparisons with Node2vec [40] becausethey use cross-validation on node labels for hyper-parame-ter selection. Thus comparing Node2vec to methods such asLINE, Deepwalk, HOPE, VERSE, and EMB that all operatewith fixed hyperparameters in a fully unsupervised manner

would be unfair. We also excluded comparisons withGraRep [31] and M-NMF [30] due to their limited scalability(OðN2dÞ computational and OðN2Þmemory complexity).

Evaluation Methodology. Our experiment setting followsthe one in [36]. All methods are set to embed nodes todimension d ¼ 100. Using the resulting embeddings as fea-ture vectors, we evaluated their performance in terms ofnode classification and link prediction accuracy, and clus-tering quality. All experiments were repeated 10 times andreported are the averaged results.

Interpretation of Results. One interesting aspect of the pro-posed ASE method, is that the inferred parameters uu� fromthe first phase of Algorithm 1 can be used to characterisethe underlying similarity structure of the graph, and theway nodes “interact” over different path lengths (short,medium, and long range). The “strength” of interactions isinferred by how uniform the coefficients of uu� are, anddepend on the value of �. Since the default value was � ¼ 1for all graphs, the results can be interpreted as relative inter-action strengths between them. The resulting fuu�gs for allgraphs are listed in Table 3.

It can be immediately observed that the type of nodeinteractions varies significantly across different graphs,with similar behavior for graphs that belong to the samedomain. Specifically, ca-CondMat, ca-AstroPh, andCoCit that belong to the citation/co-authorship domain allshow relatively strong interactions of short range. BlogCa-talog shows very strong short-range similarities of onlyone-hop neighborhood interactions among bloggers. On theother hand, the Wikipedia word co-occurrence networkshows a strong tendency for long-range interactions; whileother graphs, such as the PPI protein interaction networkstay on the medium range.

Node Classification. Graphs with labeled nodes are fre-quently used to measure the ability of embedding methodsto produce features suitable for classification. For eachexperiment, nodes were randomly split to a training set anda test set. Similar to other works, and to cope with multi-label targets, we fed the training features and labels into theone-vs-the-rest configuration of logistic regression classifierprovided by the sklearn Python library. In the testingphase, we sorted the predicted class probabilities for eachnode in decreasing order, and extracted the top-ki rankinglabels, were ki is the true number of labels of node vi. Wethen computed the Micro- and Macro-averaged F1 scores[10] of the predicted labels.

TABLE 3Inferred Parameters and Interpretation

Graph u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 range strength

PPI (H. Sapiens) 0.00 0.14 0.31 0.29 0.21 0.04 0.00 0.00 0.00 0.00 medium mediumWikipedia 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.37 0.62 long strongBlogCatalog 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 short very strongca-CondMat 0.55 0.33 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 short strongca-AstroPh 0.76 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 short strongemail-Enron 0.24 0.25 0.18 0.14 0.1 0.06 0.02 0.00 0.00 0.00 medium weakCoCit 0.61 0.33 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 short strongvk2016-17 0.71 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 short strongcom-Amazon 0.10 0.10 0.10 0.10 0.09 0.09 0.09 0.09 0.09 0.09 short very weakcom-DBLP 0.11 0.10 0.10 0.09 0.09 0.09 0.09 0.09 0.09 0.08 short very weak

5. https://github.com/xgfs/deepwalk-c6. https://github.com/ZW-ZHANG/AROPE7. https://github.com/tangjianpku/LINE


https://github.com/ZW-ZHANG/AROPE

Apart from comparisons with alternative embeddingmethods, node classification can reveal whether availablenode labels (metadata) are distributed in a manner thatmatches the node relations/interactions that are inferred byASE. To reveal this information, we obtain embeddings forevery k 2 f1; . . . ; 10g by ignoring the training phase and“forcing” uu� ¼ ek (i.e., 1 at the kth entry and 0 elsewhere) inAlgorithm 1, and then using each embedding for classifica-tion with 10 percent labeling rate. Fig. 4 plots Micro andMacro F1 for all labeled graphs as a function of k, while redshade is placed on the hops where the unsupervised ASEparameters uu� are non-zero (cf. Table 1). As seen in Fig. 4,the accuracy on the four labeled graphs evolves with k in amarkedly different manner. Nevertheless, ASE identifiesthe trends and tends to assign non-zero weights to hops

that yield a desirable trade-off between Micro and MacroF1. Bearing in mind that ASE does not use labels for trainingor validation, this is rather remarkable considering the factthat uu� depends only on the graph.

We also compared the classification accuracy of ASEembeddings with those of the alternative embeddingapproaches, with results plotted in Fig. 5. The plots for somemethod-graph pairs are not discernible when values are toolow. While the relative performance of any given methodvaries from graph to graph, ASE adapts to each graph andyields consistently reliable embeddings, with accuracy thatin most cases reaches or surpasses that of state-of-the-artmethods, especially in terms ofMacroF1. The two exceptionsare the Macro F1 in CoCit, and Micro F1 in Wikipedia,where VERSE andHOPE are correspondinglymore accurate.Interestingly, HOPE achieving high Micro F1 and lowMacroF1 in Wikipedia is in agreement with the findings in Fig. 4,combinedwith the fact that HOPE focuses on longer paths.

Link Prediction. Link prediction is the task of estimatingthe probability that a link between two unconnected nodeswill appear in the future. We repeat the experiment per-formed in [36] on the vk2016-17 social network. For every

Fig. 4. Micro and Macro F1 scores for the four labeled graphs, when the “pure” k-order Sk is used for embedding, given as a function of k. Red shade

denotes the corresponding k’s where ASE assigned non-zero uuk’s; see also Table 2.

TABLE 4Link Prediction Accuracy on vk2016-17

VERSE ASE LINE Deepwalk AROPE HOPE Spectral

0.79 0.75 0.74 0.69 0.65 0.62 0.60

Fig. 5. Micro (upper row) and Macro (lower row) F1 scores that different embeddings + logistic regression yield on labeled graphs, as a function of thelabeling rated (percentage of training data).


possible edge, we build a feature vector as the Hadamardproduct between the embedded vectors of its two adjacentnodes. Using the two time instances of vk2016-17, we pre-dict whether a new friendship link appears betweenNovember 2016 and May 2017, using 50 percent of the newlinks for training and 50 percent for testing. To train thebinary logistic regression classifier, we also randomly sam-ple non-existing edges as negative examples. The link pre-diction accuracy for different embeddings is reported inTable 4. While for this experiment ASE does not reach theaccuracy of VERSE, it provides the second most accuratelink prediction, far surpassing the also SVD-based HOPEand spectral embeddings.

Node Clustering. Finally, the embedded vectors were usedto cluster the nodes into different communities, using thesklearn library K-means with the default K-means++ ini-tialization [18]. We evaluate the quality of node clusteringwith conductance, a well-known metric for measuring thegoodness of a community [5]; conductance is minimized forlarge,well connected communities that are alsowell separatedfrom the rest of the graph. Each plot in Fig. 6 gives the averageconductance across communities, as a function of the totalnumber of clusters. Results indicate that the proposed ASE as

well as the spectral clustering benchmark yield much lowerconductance compared to other embeddings. Apparently,since ASE builds on the same basis of eigenvectors used bynormalized spectral clustering, it inherits the property of thelatter to approximately minimize the normalized-cut metric[17], which is very similar to conductance. A closer look at theresulting clusters, reveals that clustering beased on VERSE,Deepwalk, LINE, and HOPE splits graphs into very largecommunities of roughly equal size, cutting a large number ofedges in the process. This is an indication that these methodsare subject to a resolution limit, which is the inability to detectwell-separated communities that are below a certain size [1].On the other hand, Spectral and the proposed ASE separatethe graph into a large-core component, and many smallerwell-separated communities, a structure thatmany large-scaleinformation networks have been observed to have [5]. Indeed,the conductance gap is smaller for BlogCatalog, which isrelatively small andwith less pronounced communities.

Parameter Sensitivity. We also present results in Fig. 7 aftervarying ASE parameters and measured embedding runtimefor PPI as well as classificationMicro F1 accuracy with 10 per-cent labeling rate. The aim is to assess the sensitivity of ASEwrt its basic parameters. The plot on the left shows how

Fig. 6. Average conductance of different embeddings used by kmeans for clustering, as a function of number of clusters.

Fig. 7. Sensitivity (F-1 Micro on left axes, and Runtime on right axes) of ASE on PPI graphs wrt various parameters.


increasing � (cf. (32)) may decrease accuracy by forcing theentries of uu� to be close to uniform, thus losing the benefits ofgraph-specific adaptation. Regarding the number of samplededges Ns, results (middle plot) indicate relative robustness ofASE embeddings, given a minimum number of samples. Asexpected, sampling a large number of edgesmay cause notice-able perturbation on the graph (even using the minimally-per-turbing Algorithm 2); this may be causing a slight decrease inaccuracy. Sensitivity is also measured wrt K (i.e., the maxi-mum walk length considered in the optimization). Asexpected, the accuracy increases sharply with K for the firstfew steps, and then plateaus as higher order coefficients ofPPI take zero values (cf., Table 3) and do not affect the results.Finally, the plot on the left depicts accuracy across a range ofembedding dimensions d.

Runtime. Finally, we compared different embeddingmethods in terms of runtime. Results for all graphs arereported in Fig. 8. All experiments were run on a personalworkstation with a quad-core i5 processor, and 16 GB ofRAM. For our proposed ASE, we provide a light-weight yethighly portable implementation8 that uses the SVDLIBClibrary [51] for sparse SVD. We also developed a more scal-able implementation9 that relies on (and requires installationof) the SLEPc package [49]; this scalable version can performlarge-scale sparse SVD onmultiple processes and distributedmemory environments using the message-passing interface(MPI) [48]. We used the high-performance implementationfor the five larger graphs, and the portable one for the fivesmaller ones. Evidently, ASE and HOPE that are SVD-basedare orders of magnitudes faster than VERSE, Deepwalk, andLINE. The main factor that slows the latter down seems to bethe large number of stochastic optimization iterations thatthese methods must perform to reach accurate embeddings.Nevertheless, it should be noted that sampling based meth-ods enjoy nearly-full parallelization and could thus benefitmore from highly multi-threaded environments. On theother hand, methods that rely on SVD (and EVD) can greatlybenefit from decades of research on how to efficiently per-form these decompositions, and a suite of stable and highlyoptimized software tools.

7 CONCLUSIONS AND FUTURE WORK

We presented a scalable node embedding framework that isbased on factorizing an adaptive node similarity matrix.The model is carefully studied, interpreted, and numerically

evaluated using stochastic block models, with an algorith-mic scheme proposed for training the model parametersefficiently and without supervision.

The novel framework opens up several interesting futureresearch directions. For instance, one can explore largerfamilies of node similarity metrics that can be learned usingthe graph. Furthermore, it would be interesting to assess theperformance of different randomized edge sampling meth-ods, and generalize the notion of adaptive-similarity to het-erogeneous and multi-layered graph embedding, as well asto edge embedding.

ACKNOWLEDGMENTS

This work was supported by NSF 1901134, 171141, 1514056,and 1500713.

REFERENCES[1] S. Fortunato and M. Barthelemy, “Resolution limit in community

detection,” Proc. Nat. Acad. Sci. United States America, vol. 104,no. 1, pp. 36–41, 2007.

[2] Y. Zhao, E. Levina, and J. Zhu, “Consistency of community detec-tion in networks under degree-corrected stochastic block models,”The Ann. Statist., vol. 40, no. 4, pp. 2266–2292, 2012.

[3] D. P. Bertsekas, Nonlinear Programming. Belmont, NC, USA:Athena Scientific, 1999.

[4] L. A. Adamic, and E. Adar, “Friends and neighbors on the web,”Social Netw., vol. 25, no. 3, pp. 211–230, 2003.

[5] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,“Community structure in large networks: Natural cluster sizesand the absence of large well-defined clusters,” Internet Math.,vol. 6, no. 1, pp. 29–123, 2009.

[6] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in Proc. Int. Conf. Neural Inf. Process.Syst., 2017, pp. 1024–1034.

[7] S. Brin and L. Page, “Reprint of: The anatomy of a large-scalehypertextual web search engine,” Comput. Netw., vol. 56, no. 18,pp. 3825–3833, 2012.

[8] D. F. Gleich, “Pagerank beyond the web,” SIAM Rev., vol. 57,no. 3, pp. 321–363, 2015.

[9] I. M. Kloumann, J. Ugander, and J. Kleinberg, “Block models andpersonalized pagerank,” Proc. Nat. Acad. Sci. United States America,vol. 114, no. 1, pp. 33–38, 2017.

[10] C.D.Manning, P. Raghavan, andH. Schutze, Introduction to Informa-tion Retrieval. Cambridge,MA, USA: CambridgeUniv. Press, 2008.

[11] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings,” in Proc. 33rd Int.Conf. Mach. Learn., 2016, vol. 48, pp. 40–48.

[12] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, and W. Zhu, “Arbitrary-order proximity preserved network embedding,” in Proc. Int.Conf. Knowl. Discovery Data Mining, 2018, pp. 2778–2786.

[13] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithmfor optimal margin classifiers,” in Proc. Workshop Comput. Learn.Theory, 1992, pp. 144–152.

[14] A. Milanese, J. Sun, and T. Nishikawa, “Approximating spectralimpact of structural perturbations in large networks,” Phys. Rev.E, vol. 81, no. 4, pp. 046–112, 2010.

Fig. 8. Runtime of various embedding methods across different graphs.

8. https://github.com/DimBer/ASE-project/tree/master/portable9. https://github.com/DimBer/ASE-project/tree/master/

slepc_based


[15] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey ofgraph embedding: Problems, techniques and applications,” IEEETrans. Knowl. Data Eng., vol. 30, no. 9, pp. 1616–1637, Sep. 2018.

[16] P. Goyal, and E. Ferrara, “Graph embedding techniques, applica-tions, and performance: A survey,” Knowl.-Based Syst., vol. 151,pp. 78–94, 2018.

[17] U. Von Luxburg, “A tutorial on spectral clustering,” Statist. Com-put., vol. 17, no. 4, pp. 395–416, 2007.

[18] D. Arthur, and S. Vassilvitskii, “k-means++: The advantages ofcareful seeding,” in Proc. 18th Annu. ACM-SIAM Symp. Discr.Algorithms, 2007, pp. 1027–1035.

[19] G. H. Golub, and C. Reinsch, “Singular value decomposition andleast squares solutions,” Numerische Math., vol. 14, no. 5, pp. 403–420, 1970.

[20] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis,“Adaptive diffusions for scalable learning over graphs,” IEEETrans. Signal Process., vol. 67, no. 5, pp. 1307–1321, 2018.

[21] L. Condat, “Fast projection onto the simplex and the ‘1 ball,”Math. Program., vol. 158, no. 1/2, pp. 575–585, 2016.

[22] Y. Han and Y. Shen, “Partially supervised graph embedding forpositive unlabeled feature selection,” in Proc. Int. Joint Conf. Artif.Intell., 2016, pp. 1548–1554.

[23] T. Hofmann and J. M. Buhmann, “Multidimensional scaling anddata clustering,” in Proc. Int. Conf. Neural Inf. Process. Syst., 1994,pp. 459–466.

[24] M. Balasubramanian and E. L. Schwartz, “The isomap algorithmand topological stability,” Sci., vol. 295, no. 5552, 2002, Art. no. 7.

[25] X. He and P. Niyogi, “Locality preserving projections,” in Proc.Int. Conf. Neural Inf. Process. Syst., 2003, pp. 153–160.

[26] S. T. Roweis, and L. K. Saul, “Nonlinear dimensionality reductionby locally linear embedding,” Sci., vol. 290, no. 5500, pp. 2323–2326,2000.

[27] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski,and A. J. Smola, “Distributed large-scale natural graphfactorization,” in Proc. World Wide Web Conf., 2013, pp. 37–48.

[28] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Networkrepresentation learning with rich text information,” in Proc. Int.Joint Conf. Artif. Intell., 2015, pp. 2111–2117.

[29] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transi-tivity preserving graph embedding,” in Proc. Int. Conf. Knowl. Dis-covery Data Mining, 2016, pp. 1105–1114.

[30] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Networkembedding as matrix factorization: Unifying DeepWalk, LINE,PTE, and node2vec,” in Proc. Int. Conf. Web Search Data Mining,2018, pp. 459–467.

[31] S. Cao, W. Lu, and Q. Xu, “GraRep: Learning graph representa-tions with global structural information,” in Proc. Int. Conf. Inf.Knowl. Manage., 2015, pp. 891–900.

[32] B. Shaw, and T. Jebara, “Structure preserving embedding,” inProc. Int. Conf. Mach. Learn., 2009, pp. 937–944.

[33] Y. Zhao, Z. Liu, and M. Sun, “Representation learning for measur-ing entity relatedness with rich information,” in Proc. Int. JointConf. Artif. Intell., 2015, pp. 1412–1418.

[34] Y. Koren, R. M. Bell, and C. Volinsky, “Matrix factorization techni-ques for recommender systems,” IEEE Comput., vol. 42, no. 8,pp. 30–37, Aug. 2009.

[35] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE:Large-scale information network embedding,” in Proc. World WideWeb Conf., 2015, pp. 1067–1077.

[36] A. Tsitsulin, D. Mottin, P. Karras, and E. Muller, “VERSE: Versa-tile graph embeddings from similarity measures,” in Proc. WorldWide Web Conf., 2018, pp. 539–548.

[37] J. Tang, M. Qu, and Q. Mei, “PTE: Predictive text embeddingthrough large-scale heterogeneous text networks,” in Proc. Int.Conf. Knowl. Discovery Data Mining, 2015, pp. 1165–1174.

[38] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. Alemi, “Watchyour step: Learning graph embeddings through attention,” arXiv:1710.09599, 2017.

[39] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learningof social representations,” in Proc. ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2014, pp. 701–710.

[40] A. Grover and J. Leskovec, “node2vec: Scalable feature learningfor networks,” in Proc. ACM SIGKDD Int. Conf. Knowl. DiscoveryData Mining, 2016, pp. 855–864.

[41] A. Bordes, N. Usunier, A. Garcia-Duran, J.Weston, andO. Yakhnenko,“Translating embeddings for modeling multirelational data,” in Proc.Int. Conf. Neural Inf. Process. Syst., 2013, pp. 2787–2795.

[42] R. Xie, Z. Liu, and M. Sun, “Representation learning of knowledgegraphs with hierarchical types,” in Proc. Int. Joint Conf. Artif.Intell., 2016, pp. 2965–2971.

[43] C. Donnat,M. Zitnik, D.Hallac, and J. Leskovec, “Learning structuralnode embeddings via diffusion wavelets,” in Proc. 24th ACMSIGKDD Int. Conf. Know.DiscoveryDataMining, 2018, pp. 1320–1329.

[44] L. F. R. Ribeiro, P. H. P. Saverese, and D. R. Figueiredo, “struc2vec:Learning node representations from structural identity,” in Proc.ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2017,pp. 385–394.

[45] D. Wang, P. Cui, and W. Zhu, “Structural deep networkembedding,” in Proc. ACM SIGKDD Int. Conf. Knowl. DiscoveryData Mining, 2016, pp. 1225–1234.

[46] F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu, “Learning deep repre-sentations for graph clustering,” in Proc. 28th AAAI Conf. Artif.Intell., 2014, pp. 1293–1299.

[47] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learninggraph representations,” in Proc. 30th AAAI Conf. Artif. Intell., 2016,pp. 1145–1152.

[48] B. Barker, “Message passing interface (MPI),” Workshop: High Per-form. Comput. Stampede, vol. 256, 2015.

[49] V. Hernandez, J. E. Roman, and V. Vidal, “SLEPc: A scalable andflexible toolkit for the solution of eigenvalue problems,” ACMTrans. Math. Softw., vol. 31, no. 3, pp. 351–362, 2005.

[50] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel,“Convolutional 2D knowledge graph embeddings,” in Proc. AAAIConf., Feb. 2–7, 2018, pp. 1811–1818.

[51] [Online]. Available: https://tedlab.mit.edu/~ dr/SVDLIBC/,Accessed: 2019.

[52] [Online]. Available: https://snap.stanford.edu/data/index.html,Accessed: 2019.

Dimitris Berberidis (S’15) received the diplomadegree in electrical and computer engineering(ECE) from theUniversity of Patras, Patras,Greece,in 2012; and the MSc as well as PhD degrees inECE from the University of Minnesota, Minneapolis,MN. His research interests lie in the areas of statisti-cal signal processing, focusing on sketching andtracking of large-scale processes, and in machinelearning, focusing on the developpement of algo-rithms for scalable learning over graphs, includingsemi-supervised classification, and node embed-ding. He is a studentmember of the IEEE.

Georgios B. Giannakis (F’97) received thediploma degree in electrical engr. from the NationalTechnical University of Athens, Greece, in 1981,the MSc degree in electrical engineering, in 1983,the MSc degree in mathematics, in 1986, and thePhD in electrical engineering, in 1986, from theUni-versity of Southern California (USC). From 1982 to1986 he was with USC. He was with the Universityof Virginia from 1987 to 1998, and since 1999 hehas been a professor with the University of Minne-sota, where he holds an Endowed chair inWireless

Telecommunications, a University of Minnesota McKnight Presidentialchair in ECE, and serves as director of the Digital Technology Center. Hisgeneral interests span the areas of communications, networking and statis-tical learning - subjects on which he has published more than 450 journalpapers, 750 conference papers, 25 book chapters, two edited books, andtwo research monographs (h-index 142). Current research focuses onlearning from big data, wireless cognitive radios, and network science withapplications to social, brain, and power networks with renewables. He isthe (co-) inventor of 32 patents issued, and the (co-) recipient of nine bestjournal paper awards from the IEEESignal Processing (SP) andCommuni-cations Societies, including the G. Marconi Prize Paper Award in WirelessCommunications. He also received Technical Achievement Awards fromthe SP Society (2000), from EURASIP (2005), a Young Faculty TeachingAward, the G. W. Taylor Award for Distinguished Research from the Uni-versity of Minnesota, and the IEEE Fourier Technical Field Award (inaugu-ral recipient in 2015). He is a fellow of EURASIP, and has served the IEEEin a number of posts, including that of a distinguished lecturer for the IEEE-SPSociety. He is a fellow of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/csdl.


https://tedlab.mit.edu/~ dr/SVDLIBC/https://snap.stanford.edu/data/index.html

01-tkde-li-2930987-x02-tkde-ding-2931903-x03-tkde-wu-2898401-x04-tkde-hu-2931014-x05-tkde-zhu-2930056-x06-tkde-ma-2932406-x07-tkde-ge-2930598-x08-tkde-kulkarni-2912179-x09-tkde-he-2932388-x10-tkde-wang-2931901-x11-tkde-wu-2930696-x12-tkde-wang-2931906-x13-tkde-xiao-2899597-x14-tkde-jiang-2930518-x15-tkde-feng-2933837-x16-tkde-zhang-2932063-x17-tkde-luo-2916683-x18-tkde-chen-2931687-x19-tkde-xiao-2931548-x20-tkde-kohn-2905235-x21-tkde-li-2930690-x22-tkde-lin-2930516-x23-tkde-bermperidis-2931542-x24-tkde-xu-2932984-x25-tkde-li-2931327-x26-tkde-yang-2932742-x27-tkde-yang-2932666-x28-tkde-wang-2904569-x29-tkde-aggarwal-2935203-x30-tkde-gao-2930060-x31-tkde-xuan-2933833-x32-tkde-chan-2931969-x33-tkde-zhang-2933516-x34-tkde-plantevit-2931340-x

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice



















/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition ()

Node Embedding with Adaptive Similarities for Scalable Learning … · 2019. 8. 13. · 28 (semi-)supervised learning tasks, such as node classiﬁca-29 tion, link prediction, or

Documents