Embedding Graphs for Shortest-Path Distance Predictions Zhuowei Zhao ORCID 0000-0002-6891-6432 Submitted in total fulfilment of the requirements of the degree of Master of Philosophy School of Computing and Information Systems THE UNIVERSITY OF MELBOURNE February 2020
110
Embed
Embedding Graphs for Shortest-Path Distance Predictions
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Embedding Graphs for Shortest-PathDistance Predictions
Zhuowei ZhaoORCID 0000-0002-6891-6432
Submitted in total fulfilment of the requirements of the degree of
Master of Philosophy
School of Computing and Information SystemsTHE UNIVERSITY OF MELBOURNE
All rights reserved. No part of the publication may be reproduced in any form by print,photoprint, microfilm or any other means without written permission from the author.
Abstract
Graph is an important data structure and is used in an abundance of real-world appli-
cations including navigation systems, social networks, and web search engines, just to
name but a few. We study a classic graph problem – computing graph shortest-path
distances. This problem has many applications, such as finding nearest neighbors for
place of interest (POI) recommendation or social network friendship recommendation. To
compute a shortest-path distance, traditional approaches traverse the graph to find the
shortest path and return the path length. These approaches lack time efficiency over large
graphs. In the applications above, the distances may be needed first (e.g., to rank POIs),
while the actual shortest paths may be computed later (e.g., after a POI has been chosen).
Thus, an alternative approach precomputes and stores the distances, and answers dis-
tance queries with simple lookups. This approach, however, falls short in the space cost
– O(n2) in the worst case for n vertices, even with various optimizations.
To address these limitations, we take an embedding based approach to predict the
shortest-path distance between two vertices using their embeddings without comput-
ing their path online or storing their distance offline. Graph embedding is an emerging
technique for graph analysis that has yielded strong performance in applications such
as node classification, link prediction, graph reconstruction, and more. We propose a
representation learning approach to learn a k-dimensional (k n) embedding for ev-
ery vertex. This embedding preserves the distance information of the vertex to the other
vertices. We then train a multi-layer perceptron (MLP) to predict the distance between
two vertices given their embeddings. We thus achieve fast distance predictions with-
out a high space cost (i.e., only O(kn)). Experimental results on road network graphs,
social network graphs, and web document graphs confirm these advantages, while our
iii
approach also produces distance predictions that are up to 97% more accurate than those
by the state-of-the-art approaches.
Our embeddings are not limited for only distance predictions. We further study their
applicability on other graph problems such as link prediction and graph reconstruction.
Experimental results show that our embeddings are highly effective in these tasks.
iv
Declaration
This is to certify that
1. the thesis comprises only my original work towards the MPhil,
2. due acknowledgement has been made in the text to all other material used,
3. the thesis is less than 50,000 words in length, exclusive of tables, figures, bibliogra-
phies and appendices.
Zhuowei Zhao, February 2020
v
This page is intentionally left blank.
Acknowledgements
First of all, I would like to express my deepest gratitude to my supervisors, Dr. Jianzhong
Qi and Prof. Rui Zhang for their continuous support during my MPhil study. They have
guided me with their rich knowledge. Their passion in research has deeply encouraged
me. Without their support, this thesis would not have been possible.
I am deeply grateful to Prof. Wei Wang (The University of New South Wales) who
provided invaluable discussions and insightful feedback to my research.
I also sincerely thank my Advisory Committee Chair, Dr. Sean Maynard. He keeps
watching my progress and has given me generous support during my MPhil study. With-
out his insightful feedback and constructive comments, my progress would not have
been that smooth.
Then, I would like to thank The University of Melbourne and School of Computing
and Information Systems for providing a supportive research environment and rich re-
sources for my MPhil study.
Last but not least, I would like to thank all my fellow research students with whom
I share an office or work in various occasions for their support in research, life, and all
the pleasant memories, including Xinting Huang, Jiabo He, Yixin Su, Shiquan Yang, Yi-
3.1 Node2vec Based Distance Prediction Errors . . . . . . . . . . . . . . . . . . 283.2 Node2vec Based Distance Prediction Errors on Different Networks . . . . 283.3 Auto-encoder Based Distance Prediction Errors on Different Networks . . 303.4 Geodnn Based Distance Prediction Errors on Different Networks . . . . . 31
4.1 Comparing Huber Loss with Reverse Huber Loss on Road Networks . . . 384.2 Comparing Huber Loss with Reverse Huber Loss on Social Networks . . . 384.3 Using MnSE and MnCE as the Loss Function on MB dataset . . . . . . . . 394.4 Performance of Emsembling Models on MB . . . . . . . . . . . . . . . . . . 40
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Mean Absolute and Mean Relative Errors on Road Networks (Smaller Graphs) 525.3 Mean Absolute and Mean Relative Errors on Social Networks and Web
Page Graph (Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4 Max Absolute and Max Relative Errors on Road Networks (Smaller Graphs) 535.5 Max Absolute and Max Relative Errors on Social Networks and Web Graph
(Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Preprocessing and Query Times on Road Networks (Smaller Graphs) . . . 545.7 Preprocessing and Query Times on Social Networks and Web Page Graph
(Smaller Graphs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.8 Mean Absolute and Mean Relative Errors on Road Networks (Larger Graphs) 575.9 Mean Absolute and Mean Relative Errors on Social Networks (Larger Graphs) 575.10 Max Absolute and Max Relative Errors on Road Networks (Larger Graphs) 585.11 Max Absolute and Max Relative Errors on Social Networks (Larger Graphs) 585.12 Preprocessing and Query Times on Road Networks (Larger Graphs) . . . 595.13 Preprocessing and Query Times on Social Networks (Larger Graphs) . . . 595.14 Effectiveness of Embedding Learning (DG) . . . . . . . . . . . . . . . . . . 665.15 Impact of MLP Structure Landmark-dg + MLP (DG) . . . . . . . . . . . . . 675.16 Impact of MLP Structure Vdist2vec (DG) . . . . . . . . . . . . . . . . . . . 675.17 Impact of Number of Center Vertices (SH) . . . . . . . . . . . . . . . . . . . 685.18 Impact of Loss Function (DG and MB) . . . . . . . . . . . . . . . . . . . . . 685.19 Impact of Loss Function (SU and EPA) . . . . . . . . . . . . . . . . . . . . . 695.20 Impact of Loss Function (FBTV and FBPOL) . . . . . . . . . . . . . . . . . . 695.21 Graph Reconstruction and Link Prediction performance Measured in MAP 715.22 Graph Reconstruction and Link Prediction Processing Time . . . . . . . . 71
xvii
This page is intentionally left blank.
List of Abbreviations and Symbols
CH Contraction Hierarchies
DNN Deep Neural Network
MAP Mean Average Precision
MnCE Mean Cube Error
MLP Multilayer Perceptron
MnAE Mean Absolute Error
MnRE Mean Relative Error
MnSE Mean Square Error
MxAE Max Absolute Error
MxRE Max Relative Error
NN Nearest Neighbors
NDCG Normalized Discounted Cumulative Gain
POI Place of Interest
PCA Principal Component Analysis
PT Preprocessing Time
QT Query Time
RRVHL Reverse Huber Loss
xix
v A vertex
u A vertex
l A landmark vertex
G A graph
E An edge set
v A vertex set
L A landmark set
La A label set
La(v) Label of vertex v
d(v, u) Distance between v and v
d(v, u) Estimated distance between v and v
dq(v, u, La) Distance between v and v computed by label set La
pv,u Shortest path from v to u
vi An embedding vector for vi
V An embedding matrix for V
L Training loss
xx
Chapter 1
Introduction
1.1 Background
Graphs were first introduced by Leonhard Euler in 1735 [47] to solve a mathematical
problem called the Konigsberg bridge problem, which is a problem to traverse a number of
islands connected by bridges via each bridge for exactly once. Euler modeled the problem
by representing the islands as vertices and the bridges as edges of a graph, respectively.
Since then, graphs become an important mathematical tool in many disciplines includ-
ing computer science, chemistry, linguistics, geography, and many more. In computer
science, graphs are an essential data structure and are commonly used to model trans-
portation networks, social networks, and web page link structures, just to name but a few.
Figure 1.1 gives an example, where Figure 1.1a is a social network graph, Figure 1.1b is a
web page graph, and Figure 1.1c is a transport network graph.
1
1.1 Background Introduction
(a) Social network graph (b) Web page graph
(c) Transportation network graph
Figure (1.1) Examples of graphs in real life
In graph theory, a basic problem is to compute the distance between two vertices,
which may be used to model the travel cost between two places of interest (POIs), the social
closeness of two individuals, the relevance of two web pages, etc. Figure 1.2 shows the
distance between two vertices on a road network graph. We can see that such a distance
is not necessarily the Euclidean distance between the two vertices. The vertex distances
are fundamental for recommending POIs to tourists, suggesting friends to social network
users, or ranking web pages for search engines. In these applications, there may be mil-
lions of vertices and users who issue distance queries. For example, the Florida road
network [1] has over a million vertices; Google Maps has over a billion active users [5];
there are more than a billion active websites [8]; and Facebook has over 2 billions so-
cial network users [7]. Answering distance queries under such settings poses significant
challenges in both space and time costs.
2
Introduction 1.2 Research Gap
Figure (1.2) Vertex distance on a road network graph
In this thesis, we revisit the problem of computing the distance between two vertices
in a graph. Here, the distance refers to the length of the graph shortest path between the
two vertices. We use distance for brevity when the context is clear. Figure 1.3a shows an
abstracted example of the problem, where v1, v2, ..., v5 are the vertices, and the numbers
on the edges are the edge weights. Consider vertices v1 and v5. Their distance is the
length of path v1 → v4 → v5, which is 4. Our aim is to answer queries on such distances
(approximately) with a high efficiency.
1.2 Research Gap
A traditional approach uses graph shortest path algorithms to compute the shortest path
between two vertices, along which the path length (i.e., the distance) is computed. Di-
jkstra’s single-source shortest-path (SSSP) algorithm [32] and the Floyd-Warshall all-pair
shortest-path (APSP) algorithm [36] are simple and effective algorithms for this purpose.
However, these methods may have large computational complexity when they are run-
ning online over large graphs, i.e., O(m + n log n) and O(n3), where m and n are the
numbers of edges and vertices, respectively. More recent algorithms such as contraction
hierarchies (CH) [44] reduce the time cost via preprocessing the graphs to add shortcut
edges (i.e., shortest paths between some vertices). These algorithms focus on computing
the shortest paths rather than the distances.
In applications such as those mentioned above, the distances may be needed first
3
1.2 Research Gap Introduction
v2(l1)
v1 v4 v5
v3(l2)
3 4 6
1 3
3 5 5
(a) A graph example
v1
v2
v3
v4
v5
v1 v2 v3 v4 v5
0 3 3 1 4
3 0 6 4 6
3 6 0 4 5
1 4 4 0 3
4 6 5 3 0
(b) Distance labeling
v1
v2
v3
v4
v5
v2(l1) v3(l2)
3 3
0 6
6 0
4 4
6 5
(c) Landmark labeling
Figure (1.3) Graph shortest-path distance problem
while the actual shortest-paths may be computed later. Meanwhile, the distances do not
update frequently or do not need to support real-time updates. For example, to rank
POIs for recommendation, we may just need the distances to the POIs, while the shortest
path can be computed after a POI has been chosen by the user. Also, the POI locations
do not change often. Similarly, to recommend friends for a user, we may need distances
in a social network graph that represent her social closeness to other users, but not the
shortest paths. Therefore, generating the recommendations can be done offline without
requiring real-time updates of the social network graph. Such applications are targeted
in this study.
Under such application contexts, studies (e.g., [14, 29, 58]) preprocess a graph and
build new data structures to enable fast distance queries without online shortest path
computations. Distance labeling is commonly used in these studies. The basic idea is to
precompute a vector of (distance) values for each vertex as its distance label. At query time,
only the distance labels of the two query vertices are examined to derive their distance,
which is simpler than shortest path computation. In an extreme case, the distance label
4
Introduction 1.2 Research Gap
of every vertex consists of its distances to all other vertices (cf. Figure 1.3b). A distance
query is answered by a simple lookup in O(1) time, but this requires O(n2) space to store
all the distance labels. Various labeling approaches (e.g., 2-hop labeling [29] and highway
labeling [58]) are proposed to reduce the distance label size.
Hub labelling [29] is a representative labelling approach. It labels every vertex with its
distances to vertices on its shortest paths to all the other vertices. The vertices used for
labelling are called hubs. The hubs are chosen such that there exists exactly one hub on the
shortest path of every pair of vertices. Every vertex only stores its distances to the hubs
on its shortest paths to the other vertices. Since some of these shortest paths may share
the same hub, hub labelling can produce distance labels with smaller sizes. At query
time, the distance labels of the two queries vertices are scanned to find their shared hub,
which must be a vertex on their shortest path. The distances to this hub are summed up
and returned as the query answer. This approach has been shown to be query efficient,
but its worst-case space cost is still O(n2) [57].
To avoid the O(n2) space cost, approximate techniques are proposed [25, 92], among
which landmark labeling [49, 77, 90] is a representative approach. The landmark label-
ing approach chooses a subset of k (k n) vertices as the landmarks. Every vertex vi
stores its distances to these landmarks as its distance label, i.e., a k-dimensional vector
〈d(vi, l1), d(vi, l2), . . . , d(vi, lk)〉, where l1, l2, . . . , lk ∈ L represent the landmarks and d(·)
represents the distance. At query time, the distance labels of the two query vertices vi and
vj are scanned, where the distances to the same landmark are summed up. The smallest
distance sum, i.e., mind(vi, l) + d(vj , l)|l ∈ L, is returned as the query answer (for undi-
rected graphs). In Figure 1.3, v2 and v3 are chosen as the landmarks (denoted by l1 and l2,
respectively), and the distance labels are shown in Figure 1.3c. The distance between v1
and v5 is computed as mind(v1, l1)+d(v5, l1), d(v1, l2)+d(v5, l2) = min3+6, 3+5 = 8,
which is twice as large as the actual distance between v1 and v5 (i.e., 4). As the example
shows, even though landmark labeling reduces the space cost toO(kn), it may not return
the exact distance between vi and vj when their shortest path does not pass any land-
mark. How the landmarks are chosen plays a critical role in the algorithm accuracy. Since
finding the k optimal landmarks is NP-hard [77], heuristics are proposed [38, 77, 89] such
5
1.3 Contributions of the Thesis Introduction
as choosing the vertices that are on more shortest paths as the landmarks.
1.3 Contributions of the Thesis
To avoid the limitations in landmark choosing and to preserve more distance information
in the distance labels, in this study, we propose a representation learning based approach
to learn an embedding for every vertex as its distance label. Our idea is motivated by the
recent advances in learning graph embeddings. Studies [22, 23, 48] show that vertices
can be mapped into a latent space where their structural similarity (e.g., the number of
common neighboring vertices) can be computed. This motivates us to map the vertices
into a latent space to compute their spatial similarity, i.e., shortest-path distances.
Our learned embeddings do not rely on any particular landmarks or discriminate any
vertices with shortest paths that do not pass any landmarks. Thus, our embeddings may
yield more accurate distance predictions for such vertices, while we retain a low space
cost. These will be verified by an experimental study on real-world graphs (Chapter 5).
To learn the vertex embeddings, we first adopt existing representation learning mod-
els, including an auto-encoder model [17] and the node2vec model [48]. Given the em-
beddings of two vertices learned by these models, we train a multilayer perceptron (MLP)
to predict the distance between the two vertices. We observe that the vertex embed-
dings learned by these models suffer in the distance prediction accuracy. For the auto-
encoder, its learned embeddings tend to encode the average distances between the ver-
tices,1 which do not help predict the distance of two specific vertices. Node2vec encodes
the local neighborhood information rather than the global distances. Neither model receives
direct training signals from the distance prediction of two vertices when learning the embeddings
for the two vertices.
To overcome these limitations, we further propose a distance preserving vertex to vector
(vdist2vec) model for vertex embedding. Our vdist2vec model learns vertex embeddings
jointly with training an MLP to make distance predictions based on such embeddings.
This way, the vertex embeddings are guided by signals from distance predictions, which
1Auto-encoders tend to learn to reconstruct the average of all training instances when used without pre-training [52].
6
Introduction 1.3 Contributions of the Thesis
can better preserve the distance information.
Our vdist2vec model aims to learn an n× k dimensional matrix V where each row is
the embedding of a vertex (recall that n is the number of vertices and k is the embedding
dimensionality). This matrix is randomly initialized. When training the vdist2vec model,
we use two n-dimensional one-hot vectors to represent two vertices vi and vj for which
the distance is to be predicted. These two vectors are multiplied by V separately, which
fetches the two k-dimensional vectors vi and vj (i.e., the embeddings) corresponding to
vi and vj in V. Vectors vi and vj are then concatenated into a 2k-dimensional vector and
fed into an MLP to predict the distance between vi and vj . The optimization goal here is
to minimize the difference between the predicted distance and the actual vertex distance.
The prediction errors are propagated back to update vi, vj, and the MLP.
Once our model is trained, when a distance query comes with two query vertices vi
and vj , we just need to fetch vi and vj from V and feed them into the trained MLP to
predict the distance between vi and vj .
In summary, our study makes the following contributions:
• We propose a learning based approach to predict vertex distances without the need
of choosing a particular set of landmarks for distance labeling. Our approach has
an O(k) distance prediction time cost and an O(kn) space cost, where k is a small
constant denoting the vertex embedding dimensionality.
• We adopt existing representation learning techniques and study their limitations.
To address those limitations, we further propose to learn vertex embeddings while
jointly train an MLP to predict vertex distances based on such embeddings. Our
model is simple and efficient, since it is based on one-hot vectors and an MLP.
Our model is also highly accurate, since the embeddings are guided by distance
predictions directly.
• To further optimize the performance of our model, we propose a novel loss function
and an ensembling based network structure that guide the model learning to suit
the characteristics of the underlying data. We also discuss how to scale our model to
larger graphs, to handle graph updates, and to extend to other graph applications.
7
1.4 Outline of the Thesis Introduction
• We perform experiments on real road networks, social networks, and web page
graphs. The experimental results confirm the superiority of our proposed approaches.
Comparing with state-of-the-art approximate distance prediction approaches, our
approach reduces both the mean and the maximum distance prediction errors, and
the advantage is up to 97%.
• To examine the general applicability of our model, we further perform link predic-
tion and graph reconstruction experiments on social networks. The results show
that our distance guided embeddings are also effective in these applications.
1.4 Outline of the Thesis
The rest of the thesis is organized as follows.
1. In Chapter 2, we review the related work on shortest-path distance computation
models and graph embedding models. We discuss both exact distance computa-
tion models and approximate distance computation models. We describe how each
model works and analyze their advantages and limitations. In addition, we discuss
applying graph embedding methods on shortest-path distance problem as well as
other graph applications.
2. In Chapter 3, we formulate our problem and present a two-stage solution frame-
work. This framework allows us to adapt existing representation learning tech-
niques to learn vertex embeddings for vertex distance predictions.
3. In Chapter 4, we further propose a single-stage solution. We describe our proposed
model in details including the model structure, the loss function, the model opti-
mizations, and how to scale our model to large graphs and to handle graph updates.
We also discuss how to adapt our model to other graph applications such as link
prediction and graph reconstruction.
4. In Chapter 5, we present experimental results under various settings and examine
the impact of embedding dimensionality, MLP structure, graph updates, and loss
8
Introduction 1.4 Outline of the Thesis
function. We also show the effectiveness of the proposed model in applications
such as POI recommendations, link prediction, and graph reconstruction.
5. In Chapter 6, we conclude the thesis with a discussion on the future work.
9
This page is intentionally left blank.
Chapter 2
Related Work
In this section, we discuss four lines of related studies: exact shortest-path distance com-
putation, approximate shortest-path distance computation, graph embedding, and graph
embedding applications.
2.1 Exact Distance Computation
To compute shortest-path distances, the first step is to compute the shortest paths. Two
classic shortest-path algorithms are Dijkstra’s algorithm [32] and the Floyd-Warshall algo-
rithm [36]. Dijkstra’s algorithm [32] is a single-source shortest path (SSSP) algorithm that
computes the shortest-paths from a given source vertex to all the other vertices in a
graph. Floyd-Warshall algorithm [36] is an all-pair shortest path (APSP) algorithm that com-
putes all the shortest path between every vertex pairs in a graph. These two algorithms
have O(m + n log n) [37] and O(n3) time costs, where m and n are the numbers of edges
and vertices, respectively. More recent algorithms such as contraction hierarchies [44] re-
duce the time costs via adding shortcut edges (i.e., shortest paths between some vertices).
Once a shortest path is computed, the corresponding distance can be derived by simply
summing up the edge weights on the path. For efficient distance query processing, these
algorithms may be run to precompute the distance between every pair of vertices. Then,
a distance query can be answered by a simple lookup in O(1) time. Such an approach,
however, has a high space cost, i.e., O(n2).
To reduce the space cost while retaining a high query time efficiency, a stream of
studies [14, 24, 29, 43, 58] precompute a distance label for every vertex vi. The distance
11
2.1 Exact Distance Computation Related Work
label of vi contains the distances to a subset of vertices Vi = vi1, vi2, . . . , vik ⊆ V in the
form of 〈(vi1, d(vi, vi1)), (vi2, d(vi, vi2)), . . . , (vik, d(vi, vik))〉. Here, V represents the full
vertex set of a given graph, and k may vary for different vertices. The distance of two
query vertices vi and vj is derived from their distance labels as:1
based on on grouping similar vertices. Graph embeddings as a graph representation may
be used in graph compression if we can reconstruct the graph based on the embeddings
with a high accuracy.
Link prediction is to find the missing links or predict the future links in graphs based
on the observed graph structure. It has many applications. For example, in social net-
work graphs, link prediction can be used to find potential relationships for friend recom-
mendation and advertising. To achieve this goal, one way is to compute vertices’ simi-
larity and predict probable links among them [12, 62]. Other methods such as maximum
likelihood methods [20, 28] and probabilistic methods [99, 51, 42] solve the problem from
a statistical viewpoint. Graph embeddings maps vertices into a latent space so that their
similarity can be computed. For example, we can use Euclidean distance of the vertex
embeddings to describe their similarity.
Visualizing a graph in a proper way helps viewers gain information about a graph
conveniently and quickly. It has many applications in different fields where graph data
are used [91, 40, 60, 31]. Embedding representation can be adapted to a dimensionality
reduction model such as Principal Component Analysis (PCA) [73], and then be visual-
ized in a Euclidean space. The Euclidean distances between the vertices can demonstrate
their hidden relationship clearly.
2.5 Summary
In this chapter, we reviewed methods for shortest-path distance computation, including
landmark based methods, distance labeling methods and graph embedding methods.
By comparing these methods, we obtain a clearer view about their advantages and lim-
itations. Distance labeling methods may have a high accuracy while its space cost is
O(n2) in its worst case. Landmark based methods have a linear space cost (O(kn)) but
potentially a lower accuracy. Graph embedding methods also have a linear space cost
(to embedding dimensionality), and they have advantages in query speed (i.e., parallel
vector processing). However, it is challenging to keep both the global and local distance
information of a graph in the embedding vectors. These motivate us to develop a learn-
21
2.5 Summary Related Work
ing based approach that retains the advantages of graph embeddings in query efficiency
while overcoming the challenges in preserving the distance information.
22
Chapter 3
Adapted Two-Stage Models
This chapter presents our problem solutions by adapting existing representation learning
techniques. We start with basic concepts and a problem definition in Section 3.1. We then
present a two-stage framework for adapting representation learning techniques to solve
our problem in Section 3.2. We adapt existing representation learning models to make
distance predictions and show their limitations in Section 3.3.
3.1 Problem Formulation
We consider a graph G = 〈V,E〉, where V is a set of vertices and E is a set of edges. An
edge ei,j ∈ E represents a connection between two vertices vi and vj ∈ V . Each edge ei,j
is associated with a weight denoted by ei,j .w, which represents the cost (i.e., distance) to
travel across the edge. For simplicity, in what follows, our discussions assume undirected
edges, i.e., one can travel from both directions on ei,j with the same cost ei,j .w, although
our proposed techniques work for both directed and undirected edges.
Given two vertices vi and vj in G, a path pi,j between vi and vj consists of a sequence
of vertices vi → v1 → v2 → ... → vx → vj starting from vi and ending at vj , such that
there is an edge between any two adjacent vertices in the sequence. The length of pi,j ,
denoted by |pi,j |, is the sum of the weights of the edges between adjacent vertices in pi,j :
|pi,j |= ei,1.w + e1,2.w + ...+ ex,j .w (3.1)
Among all the paths between vi and vj , we are interested in the one with the smallest
length, i.e., the shortest path. Let such a path be p∗i,j . The length of this path is the (shortest-
23
3.2 A Two-Stage Solution Framework Adapted Two-Stage Models
G
k-dimensional
vectors
Representation
learning
network
v1
v2
v3
v4
v5
v2
v1 v4 v5
v3
3 4 6
1 3
3 55
(a) Vertex representation learning
Distance
prediction
network
(MLP)
vi
vj
di,j
(b) Distance predictor training (distance prediction)
Figure (3.1) Solution framework
path) distance between vi and vj , denoted by d(vi, vj).
d(vi, vj) = |p∗i,j | (3.2)
Given the concepts above, the shortest-path distance query is defined as follows.
Definition 1 (Shortest-path distance query) Given two query vertices vi and vj that belong
to a graph G, a shortest-path distance query returns the shortest-path distance between vi and
vj , i.e., d(vi, vj).
Our aim is to provide an approximate answer for a shortest-path distance query with
a high accuracy and efficiency.
3.2 A Two-Stage Solution Framework
We take a learning based approach to answer shortest-path distance queries. Given a
graph G, we first take a two-stage procedure that allows us to adapt existing representa-
tion learning techniques to answer shortest-path distance queries:
24
Adapted Two-Stage Models 3.2 A Two-Stage Solution Framework
1. Representation learning. We preprocess G by mapping each vertex vi ∈ V to a k-
dimensional vector representation vi ∈ Rk (cf. Figure 3.1a).1 The goal of this stage
is to learn vertex representations that preserve the graph distances between the
vertices, i.e., vertices that have small distances inG should also have small distances
for their learned vector representations, and vice versa.
2. Distance predictor training. We train a multi-layer perceptron (MLP) using the learned
vectors vi and vj between every pair of vertices vi and vj in V as the input and
distance d(vi, vj) as the target output (cf. Figure 3.1b). We use the mean square error
as the default loss function Ld to optimize the MLP parameters:
Ld = EP [(d(vi, vj)− di,j)2] (3.3)
Here, di,j denotes the predicted distance between vi and vj , and P denotes a dis-
tribution over V × V . In the simplest case, P is just the full set of V × V , i.e., to
optimize for every pair of vertices in G.
At query time, given two query vertices vi and vj , their learned representations vi
and vj are fetched and then fed into the MLP trained in the stage above. The input order
of the vectors reflects the travel direction in graph. For example, if we travel from vj to
vi instead, vj will be put in front of vi when being fed into MLP. For directed graphs,
this can distinguish the distance difference when the travel direction changes. This also
applies to our Vdist2vec model proposed in Chapter 4. The output of the MLP is returned
as the distance query answer (cf. Figure 3.1b).
Our two-stage model framework takes advantage of the recent advances in neural
networks and representation learning to avoid online graph traversals. Since neural net-
work inference (i.e., predictions) can be done efficiently, our solution can offer query
answers with a high efficiency.
Our solution offers approximate query answers, the accuracy of which is determined
by the quality of the learned vertex vectors. In what follows, we focus on the represen-
tation learning stage to obtain high-quality vectors that well preserve the vertex distance1For directed graphs, we learn two embeddings for each vertex vi, one for vi as the source vertex and the
other for vi as the destination vertex, respectively.
not all graphs contain vertex coordinates, e.g., social network graphs do not, which limits
the applicability of this approach.
Figure (3.3) A road network example
Another simple way to obtain vertex representations is to use their distance labels
such as the landmark labels as described in Section 2.2. Then, we take advantage of the
capability of the MLP to learn a non-linear function to predict the shortest-path distance
based on such distance labels, rather than simply scanning the labels and summing up the
distances to the landmarks. We also use this approach as a baseline model in Chapter 5.
Table (3.4) Geodnn Based Distance Prediction Errors on Different Networks
Mean Absolute Error Mean Relative Error Max Absolute Error Max Relative Error
DG 1,566 0.092 41,376 262
MB 95 0.097 3,563 92
SU 442 0.108 14,916 162
FBPOL N/A N/A N/A N/A
FBTV N/A N/A N/A N/A
EPA N/A N/A N/A N/A
Table 3.4 shows geodnn approach performance on road networks (DG, MB, and SU).
31
3.4 Summary Adapted Two-Stage Models
As geodnn predicts the distances between two vertices based on their Euclidean distance,
it works well on road networks with few detours in shortest-path traversal. For example,
geodnn has a good performance on MB which is a grid shaped road network where
vertices are located neatly on grid line.
3.4 Summary
In this chapter, we presented a two-stage solution framework and adapted existing ver-
tex presentations for this framework, including node2vec, auto-encoder, geo-coordinates
(longitudes and latitudes), and landmark labels. We described each representation with
examples and analyzed their advantages and disadvantages. The two-stage solution
framework verifies the feasibility of a learning based solution for graph shortest-path dis-
tance predictions, while it also has limitations in that it separates representation learning
from distance predictions. This leads to sub-optimal vertex representations and distance
prediction accuracy. We propose a single-stage model to address this limitation in the
next chapter.
32
Chapter 4
Proposed Single-Stage Model
In Chapter 3, we described two-stage models where embedding learning and dis-
tance prediction are disconnected. To further improve the embedding quality, in this
chapter, we propose a single-stage model called vdist2vec that learns embeddings which
are guided directly by the distance prediction.
We detail the vdist2vec model structure in Section 4.1. We then discuss the choice of
loss functions for the model in Section 4.2. To further enhance the distance prediction
accuracy, we design a variant of the vdist2vec model using the ensembling technique in
Section 4.3. We scale vdist2vec to large graphs in Section 4.4. We cover update handling
in Section 4.5 and algorithm costs in Section 4.6.
4.1 Vdist2vec
Our vdist2vec model connects vertex representation learning with distance prediction to
form a single neural network. The model takes two vertices vi and vj as the input, learns
their representations, and predicts their distance as the targeted output. This structure
enables the distance signals from the output layer of the distance prediction network to be
propagated back to the representation learning network. Thus, the vertex representations
Part of the content of this chapter is published in
1. * Jianzhong Qi, Wei Wang, Rui Zhang, and Zhuowei Zhao. A Learning Based Approach to PredictShortest-Path Distances. International Conference on Extending Database Technology (EDBT), 2020.(CORE Ranking: A, accepted in December 2019)* The authors are ordered alphabetically
33
4.1 Vdist2vec Proposed Single-Stage Model
can be learned to better preserve the distance information.
Our model structure is illustrated by Figure 4.1. In the model, the input vertices vi and
vj are each represented as a size-|V | one-hot vector. The one-hot vector of vi (vj), denoted
by hi, has a 1 in the i-th (j-th) dimension and 0’s in all other dimensions. The next layer
is an embedding layer, which is used for representation learning. This layer has k nodes,
and its weight matrix is a |V |×k (2|V |×k for directed graphs) matrix that will be used as
the vertex vectors for all vertices, denoted by V = [v1T ,v2
T , ...,v|V|T ]T . Multiplying hi
(hj) by V yields vi (vj), i.e.,
vi = hiV (4.1)
Vectors vi and vj are then fed into a distance prediction network to predict the distance
between vi and vj . Recall that the distance prediction network is an MLP where the
default loss function Ld is the mean square error on the actual vertex distances and the
predicted distances (cf. Equation 3.3).
2|V |-dimensionalone-hot layer
k-dimensionalembeddinglayer
MLPinput layer
MLP hiddenand outputlayers
hi
(vi)
hj
(vj)
vi
vj
Fullyconnected
di,j
Figure (4.1) Vdist2vec model structure
At training time, the vertex representation matrix V is randomly initialized. The
corresponding vertex pairs’ vectors will then be concatenated and fed into the network
in batches to train the MLP. The training loss Ld will be propagated back to optimize
the MLP and the vertex representations in V. The optimization goal is to minimize the
34
Proposed Single-Stage Model 4.2 Loss Function
errors from exact distances to predicted distances between two vertices. At query time,
the vertex vectors vi and vj of the query vertices vi and vj are fetched from V, and the
MLP trained as part of vdist2vec is used to make a distance prediction.
Our vdist2vec model can be adapted for other distance prediction problems on graph.
For example, we can use the resistance distances [64] to replace the shortest-path distance
in vdist2vec output to train a model for predicting the resistance distance. In addition,
the distance to be predicted between two vertices is not limited to a scalar. Vectors can be
applied to represent multiple distance information such as top-k shortest-path distances.
To adopt our vdist2vec model for top-k shortest-path distances, we can simply use a
size-k vector as the model output while keeping the other parts of our model unchanged.
In real graph based database systems such as Neo4j [97], the vertex embeddings
learned by our vdist2vec model can be stored together with the vertices in a graph in
the system. This helps the system achieve faster shortest-path distance query and easier
vertices visualization.
4.2 Loss Function
Mean square error (MnSE) is one of the most commonly used loss function in machine
learning models, which is also our default loss function Ld. This error is defined by
Equation 4.2.
MnSE =1
n
n∑i=1
(d(vi, vj)− di,j))2 (4.2)
Recall that d(vi, vj) and di,j) are the ground truth and predicted distances between vi and
vj , respectively.
Another commonly used loss function is mean absolute error (MnAE), which is defined
by Equation 4.3.
MnAE =1
n
n∑i=1
|d(vi, vj)− di,j)| (4.3)
These two error measurements have a somewhat similar effect in model learning. How-
ever, MnSE is more sensitive to the variance of the error than MnAE. For example, given
a set of ground truth values D = 1, 1, 1, 1 and two sets of prediction values from two
35
4.2 Loss Function Proposed Single-Stage Model
different models D1 = 1, 2, 3, 4 and D2 = 1, 1, 4, 4, the absolute prediction errors of
the two models are E1 = 0, 1, 2, 3 and E2 = 0, 0, 3, 3, respectively. Both models share
the same MnAE (i.e., 1.5), while the second model has a larger MnSE (i.e., 4.5 vs. 3.5), as
E2 has a larger variance.
Figure (4.2) Error distribution of our model (DG)
To examine the impact of loss functions, we analyse the error distribution of our
model. Figure 4.2 is an example of the error distribution of the vdist2v model on the
DG graph dataset (detailed in Chapter 5) with MnSE as the loss function (the error distri-
bution is similar when using MnAE as the loss function). We see that a very small portion
(e.g., less than 1%) of the vertex pairs have much larger prediction errors (see the spike
to the right of the figure) than the other vertex pairs. Using MnSE as the loss function
gives the same weight to the larger error values as those of the smaller error values. This
is good for controlling the maximum prediction errors but may suffer in the mean pre-
diction errors. Next, we optimize the loss function to reduce the mean prediction errors.
4.2.1 Reducing Mean Errors
As discussed above, applying MnSE as the loss function emphasizes the large errors
which only form 1% of all verex pairs. To reduce the mean errors, we should guide
our model to focus more on the rest of the vertex pairs (which have smaller errors).
Our idea originates from a loss function named the Huber loss that combines MnSE
36
Proposed Single-Stage Model 4.2 Loss Function
and MnAE [55, 41, 100] . Its basic idea is to use MnAE when the error is larger than a
parameter δ, and to use MnSE otherwise. Equation 4.4 defines this loss function.
HLδ(a) =1
2a2, for |a|≤ δ
HLδ(a) = δ(|a|−1
2δ), otherwise
(4.4)
By applying Huber loss, errors below δ are weighted the same as MnSE, while errors
above are weighted less (multiplied by δ which is smaller than the error itself).
Inspired by the Huber Loss, we propose a reverse Huber loss (REVHL) function that
is defined in Equation 4.5. As the equation shows, now the errors below δ are weighted
heavier more as it is multiplied by δ which is larger than the error itself.
Lδ(a) = δ|a|, for |a|≤ δ
Lδ(a) =1
2(a2 + δ2), otherwise
(4.5)
We also adapted the equation for the case where the error exceeds δ to make the
overall loss function continuously differentiable, enabling it to be used in model training.
To show that REVHL is continuously differentiable, first, both δ|a| and 12(a2 + δ2) are
continuous and continuously differentiable themselves. Further, when a = δ,
Lδ(δ) = δ2, limx→0
Lδ(δ + x) =1
2((δ + x)2 + δ2)
=1
2(δ2 + δ2)
= δ2
L′δ(δ) = δ, limx→0
L′δ(δ + x) = (δ + x)
= δ
Thus, REVHL is also continuously differentiable at a = δ.
Our REVHL yields trained models with lower prediction errors as shown in Table 4.1
and Table 4.2, where MnRE denotes the mean relative error.
To apply REVHL on our model, we need to select δ. A suitable δ for our model should
separate the few vertex pairs with much larger errors with the rest of the vertex pairs.
During training, errors become smaller so that a δ that is suitable for earlier iteration
37
4.2 Loss Function Proposed Single-Stage Model
Table (4.1) Comparing Huber Loss with Reverse Huber Loss on Road Networks
The geodnn approach only works on road networks as it makes predictions based on
the geo-coordinates of the vertices. Its performance relies on how far away the shortest
paths deviate from the straight lines between the vertices. It is the second best baseline
approach on MB which is a small grid shaped road network with few detours. It drops to
the third on DG and SU which are larger road networks that cover rivers and have larger
detours.
The node2vec approach focuses on embedding the neighborhood of the vertices. It
works better on graphs with small diameters where the vertices are all near each other.
For example, FBPOL has a small diameter of 14, for which node2vec is the second best
approach. When the graph diameter becomes larger (e.g., 96km for DG), node2vec in-
curs larger errors since the neighborhood becomes less relevant to the distance between
vertices far away.
54
Experiments 5.2 Results
The auto-encoder tends to generate embeddings that preserve the average distances
between the vertices. This leads to an unsatisfactory prediction performance in general,
as evidenced by the large mean errors reported. On the other hand, this property may
help avoid large maximum errors, e.g., the auto-encoder is the best baseline on FBPOL
in terms of the maximum errors (cf. Table 5.5) , in contrast, is better on graphs with
larger diameters where the vertex distances may have a larger variance, e.g., DG and
SU. This is because a larger variance on the distances may offer a stronger signal for the
auto-encoder to learn different embeddings for different vertices, rather than the same
average distance.
Tables 5.4 and 5.5 show the MxAE and the MxRE of the models. Our vdist2vec and
vdist2vec-S models also outperforms the baselines on these two measures (except on
FBPOL where auto-encoder is equally good in MxRE). The vdist2vec model reduces the
MxAE by up to 92% (376 vs. 4,724 for vdist2vec and landmark-bt on MB) and the MxRE
by up to 96% (63 vs. 1,549 for vdist2vec and landmark-bt on SU), while the performance
of vdist2vec-S is even stronger. This again verifies the capability of our models to learn
and preserve the vertex distance information. Note that, similar to what has been ob-
served over MnAE and MnRE, a larger MxAE does not mean a larger MxRE either, e.g.,
on DG, geodnn and vdist2vec have similar MxRE but geodnn has a much larger MxAE.
This is because MxAE and MxRE are usually observed from different pairs of vertices –
MxAE tends to come from vertices far away, while MxRE tends to come from vertices
with a very small distance (e.g., 1). Comparing Tables 5.2 and 5.3 with Tables 5.4 and 5.5,
we find that, in general, the baseline methods do not yield low mean and maximum er-
rors at the same time. For example, landmark-bt is close to vdist2vec on FBTV in terms
of the mean errors, while its maximum errors are much larger than those of vdist2vec on
the same dataset. Similarly, auto-encoder is close to vdist2vec on FBPOL in terms of the
maximum errors, but its mean errors are more than 7 times larger than those of vdist2vec
on the same dataset. This further highlights the advantage of vdist2vec and vdist2vec-S,
which can achieve low mean and maximum errors at the same time.
Tables 5.6 and 5.7 show the preprocesing (model training) time PT and distance pre-
diction (query) time QT. In terms of PT, the landmark approaches are much faster. Their
55
5.2 Results Experiments
precomputation procedures are deterministic and much simpler than the training pro-
cedures of the learning based models which involve multiple iterations of numeric op-
timization on the neural networks. For learning based models, the main parameter that
affects the preprocessing time is the number of embedding dimensions. As geo has the
lowest number of embedding dimensions (2), it has the smallest preprocessing cost. The
node2vec model has a fixed number of dimensions (128, as suggested to be the best for
the shortest-path distance problem [78]). Our models vdist2vec and vdist2vec-S have
more complex distance prediction networks and hence longer training times.
In terms of QT, the learning based approaches are at the same (or smaller) magnitude
as the landmark approaches. This is because distance prediction in the learning based
approaches is a simple forward propagation procedure, which can be easily parallelized
and take full advantage of the computation power of the GPU. The geodnn model is
the fastest, for that its input layer only has four dimensions (i.e., two geo-coordinates).
The other three learning based approaches node2vec, auto-encoder, and vdist2vec have
very similar MLP structures and input sizes which are larger than that of geodnn. Thus,
their QT are similar and are larger than that of geodnn. Note that QT of node2vec differs
slightly from those of auto-encoder and vdist2vec. This is because node2vec has a con-
stant embedding dimensionality k = 128 which is suggested to be optimal [78], while the
embedding dimensionality of auto-encoder and vdist2vec is varying with the number of
vertices (i.e., k = 2%|V |). The QT of vdist2vec-S is longer than that of vdist2vec again for
its slightly more complex structure.
5.2.3 Performance on Larger Graphs
Tables 5.8 to 5.13 show the model performance on the larger graphs, i.e., FL, NY, SH, and
POK. These graphs cannot be processed in full by vdist2vec under our hardware con-
straints. Following the procedure described in Section 4.4, for each of the road networks
FL, NY, and SH, vdist2vec clusters them into 0.1%|V | clusters using the k-means algo-
rithm and learns embeddings (and the MLP) for the cluster center vertices. The model
then randomly samples 100,000 pairs of vertices and uses their geo-coordinates and dis-
tances to learn the offset coefficients λ1 and λ2. The learned embeddings of the center
56
Experiments 5.2 Results
vertices and coefficients will enable vdist2vec to predict the distance between any pair
of vertices in the graph. For the social network POK, vdist2vec uses 0.1%|V | vertices
with the largest degrees as the center vertices. It learns vertex embeddings to predict dis-
tances between these center vertices and the rest of the vertices. These embeddings are
then used to predict the distance between any two vertices.
Table (5.8) Mean Absolute and Mean Relative Errors on Road Networks (Larger Graphs)
FL NY SHMnAE MnRE MnAE MnRE MnAE MnRE
baseline
landmark-bt OT OT 24,851 0.167 6,144 0.554landmark-dg 67,104 0.134 20,483 0.164 4,407 0.403geodnn 363,661 0.317 207,694 0.862 14,842 0.990node2vec OT OT 217,400 0.703 19,465 1.276auto-encoder OT OT OT OT OT OT
Among the baseline models, the landmark approaches are run on the full graphs
which may not complete in time. We terminate the algorithms after 48 hours and denote
it as “OT” in the tables. For geodnn, we randomly sample 100,000 pairs of vertices and
use their geo-coordinates and distances to train the MLP for distance prediction. For
node2vec and auto-encoder, their vertex embeddings need to be learned for all vertices,
which may also run overtime. For the datasets where they can learn the embeddings in
time, we also randomly sample 100,000 pairs of vertices to train the MLP.
57
5.2 Results Experiments
Table (5.10) Max Absolute and Max Relative Errors on Road Networks (Larger Graphs)
FL NY SHMxAE MxRE MxAE MxRE MxAE MxRE
baseline
landmark-bt OT OT 764,169 36 127,433 1,787landmark-dg 2,380,151 84 571,447 67 88,038 595geodnn 3,085,059 82 982,265 84 69,382 77node2vec OT OT 1,087,000 65 93,433 688auto-encoder OT OT OT OT OT OT
Table (5.11) Max Absolute and Max Relative Errors on Social Networks (Larger Graphs)
POKMxAE MxRE
baseline
landmark-bt OT OTlandmark-dg 6 6geodnn N/A N/Anode2vec OT OTauto-encoder OT OT
proposedvdist2vec 5 5vdist2vec-S 5 5
For testing, since it will take too long to test all pairs of vertices, following the strategy
of previous studies [49, 77], we randomly sample 100,000 pairs of vertices (different from
those sampled for training) and test all the approaches on them. We use k = 50% ×
0.1%|V | for the SH dataset and k = 5% × 0.1%|V | for the other three datasets as SH is
quite smaller than the other three datasets.
As Tables 5.8 to 5.11 show, our vdist2vec and vdist2vec-S models also produce smaller
distance prediction errors than the baselines on the larger graphs, even when our em-
beddings and MLP are not trained on every pair of vertices. For vdist2vec, the reduc-
tions achieved in MnAE, MnRE, MxAE, MxRE are up to 92% (17,278 vs. 217,400 for
vdist2vec and node2vec on NY), 94% (0.056 vs. 0.862 for vdist2vec and geodnn on NY),
80% (604,489 vs. 3,085,059 for vdist2vec and geodnn on FL), and 99% (18 vs. 1,787 for
vdist2vec and landmark-bt on SH), respectively. The improvement of vdist2vec-S over
58
Experiments 5.2 Results
Table (5.12) Preprocessing and Query Times on Road Networks (Larger Graphs)
FL NY SHPT QT PT QT PT QT
baseline
landmark-bt OT OT 39.6h 11.712µs 14.7h 8.423µslandmark-dg 29.3s 54.492µs 9.4s 16.325µs 1.5s 13.584µsgeodnn 0.1h 0.444µs 0.1h 0.458µs 0.1h 0.432µsnode2vec OT OT 26.3h 0.751µs 2.8h 0.781µsauto-encoder OT OT OT OT OT OT