Page 1
THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri ([email protected] ) KTH Royal Institute of TechnologyTiago Simas ([email protected] ) Telefonica Research Dionysios Logothetis ([email protected] ) Facebook
Page 2
2
Alice42 likes
Weighted graphs capture relationship strength
distance
similarity social proximity
rating preference
influential nodes
optimal propagation paths
communities
recommendations
BobMax
3 likes
Page 3
3
Sparsification techniques reduce the graph size and still give exact or good
approximate results
G G’f(G) ~ f(G’)
Page 4
THE METRIC BACKBONE
Reduces the graph size while maintaining relevant structure
The minimum subgraph of a weighted graph, that preserves the shortest paths of the original graph
4
B
E
DA
C2
3
10
4
2
1
B
E
DA
C2
3
2
1
Page 5
WHAT CAN WE USE IT FOR?• Exact computations
• any algorithm that depends on the shortest paths• reachability, connectivity• betweenness centrality, closeness centrality
• Approximation• PageRank, random walks• eigenvector centrality• community detection, clustering
5
Page 6
WHAT CAN WE USE IT FOR?• Exact computations
• any algorithm that depends on the shortest paths• reachability, connectivity• betweenness centrality, closeness centrality
• Approximation• PageRank, random walks• eigenvector centrality• community detection, clustering
5
Improves community detection modularity and recommender
systems accuracy
Page 7
IMPACT ON LARGE-SCALE SYSTEMS• Graph Databases
• fewer edges => smaller path search space
• Batch Graph Processing• CPU and memory requirements depend on #messages
• #messages proportional to #edges
• fewer edges => improved analysis performance
• Graph Compression• fewer edges => storage reduction
6
Page 9
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
8
B
E
DA
C2
3
10
4
2
1
Page 10
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
9
B
E
DA
C2
3
10
4
2
1
CE is 1st-order semi-metric:
C-D-E is a shorter2-hop path
Page 11
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
10
B
E
DA
C2
3
10
4
2
1
AD is 2nd-order semi-metric:
A-B-C-D is a shorter 3-hop path
CE is 1st-order semi-metric:
C-D-E is a shorter2-hop path
Page 12
SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints
11
B
E
DA
C2
3
10
4
2
1
CE is 1st-order semi-metric:
C-D-E is a shorter2-hop path
AD is 2nd-order semi-metric:
A-B-C-D is a shorter 3-hop path
AB, BC, CD, DE are metric
Page 13
BACKBONE ALGORITHM
Page 14
BACKBONE CALCULATION• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?• compute APSP and store O(N2) paths
13
Page 15
BACKBONE CALCULATION• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?• compute APSP and store O(N2) paths
Can we calculate or approximate the backbone
without solving APSP?
13
Page 16
ORDER OF SEMI-METRICITY
14
Page 17
ORDER OF SEMI-METRICITY
14
Most semi-metric edges are1st-order semi-metric
Page 18
A 3-PHASE BACKBONE ALGORITHM
15
Find 1st-order semi-metric edges: only look at triangles
1.
Page 19
A 3-PHASE BACKBONE ALGORITHM
15
Find 1st-order semi-metric edges: only look at triangles
1. Scalable & practicalfor large graphs
Page 20
EXAMPLE
16
B
E
DA
C2
3
10
4
2
1
Page 21
EXAMPLE
17
B
E
DA
C2
3
10
4
2
1
Phase 1
Page 22
EXAMPLE
18
B
E
DA
C2
3
10 2
1
Phase 1
Page 23
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric edges: only look at triangles
1. Scalable & practicalfor large graphs
Page 24
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric edges: only look at triangles
1.
Identify metric edges in 2-hop paths
2.
Scalable & practicalfor large graphs
Page 25
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric edges: only look at triangles
1.
Identify metric edges in 2-hop paths
2.
Scalable & practicalfor large graphs
Most semi-metric edgeshave been removed
Page 26
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
Page 27
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
M
M
MM
The lowest-weight edge of every vertex is metric
Page 28
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
M
M
MM
The lowest-weight edge of every vertex is metric
uv2
4
2
1
any indirect pathfrom u to vwould have
larger weight
Page 29
EXAMPLE
20
B
E
DA
C2
3
10 2
1
Phase 2
?
M
M
MM
The lowest-weight edge of every vertex is metric
uv2
4
2
1
any indirect pathfrom u to vwould have
larger weight
Page 30
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric edges: only look at triangles!
1.
Identify metric edges in 2-hop paths
2.
Scalable & practicalfor large graphs!
Most semi-metric edgeshave been removed
Page 31
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric edges: only look at triangles!
1.
Identify metric edges in 2-hop paths
2.
Run a BFS for remaining unlabeled edges.
3.
Scalable & practicalfor large graphs!
Most semi-metric edgeshave been removed
Page 32
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric edges: only look at triangles!
1.
Identify metric edges in 2-hop paths
2.
Run a BFS for remaining unlabeled edges.
3.
Scalable & practicalfor large graphs!
1%-9% edges
Most semi-metric edgeshave been removed
Page 33
EXAMPLE
22
B
E
DA
C2
3
10 2
1
Phase 3
M
M
MM
BFS
Page 34
EXAMPLE
22
B
E
DA
C2
3
10 2
1
Phase 3
M
M
MM
BFS
Explore paths with shorter
distances only
Page 35
EXAMPLE
22
B
E
DA
C2
3
10 2
1
Phase 3
M
M
MM
BFS
Explore paths with shorter
distances only
If the BFS arrives at the target, the edge
is semi-metric
Page 36
EXAMPLE
23
B
E
DA
C2
3
2
1
Metric Backbone
Page 37
DISTRIBUTED IMPLEMENTATION
code available: http://grafos.ml/okapi.html#analytics
24
Implementation in the vertex-centric model
Page 39
EVALUATION GOALS
• How does our algorithm compare to APSP?
• Are large, real-world graphs semi-metric?
• Can we improve graph analysis performance?
26
Page 40
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
Page 41
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months for million-edge graphs
Page 42
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months for million-edge graphs
In the order of days for million-edge graphs
Page 43
COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months for million-edge graphs
In the order of days for million-edge graphs
Our algorithm is 120-180x faster than SSSPand 11-14x faster than MSSP: order of hours for million-edge graphs
Page 44
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Page 45
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Page 46
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Page 47
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Page 48
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Page 49
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Slow
Page 50
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Slow
Labels up to 1-9%of the total edges
Page 51
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fastand scalable
Removes up to 90%of semi-metric edges
Moderately fast
Labels up to 60%of the unlabeled edges
Slow
Labels up to 1-9%of the total edges
Phase 1 is the fastest and most useful phase
Page 52
PHASE 1 SCALABILITY
29
Page 53
PHASE 1 SCALABILITY
29
<200s on a billion-edge graph
Page 54
PHASE 1 SCALABILITY
29
almost linear scalability
<200s on a billion-edge graph
Page 55
SEMI-METRICITY IN REAL GRAPHS
30
Graph |V| |E| metric semi-metricity
Facebook 190M 49.9B custom 26.5%Twitter 40M 1.5B jaccard 39%Tuenti 12M 685M jaccard 59%
Livejournal 4.8M 34M jaccard 40%NotreDame 0.3M 1.5M jaccard, adamic 45%-29%
DBLP 318K 1M jaccard, adamic 23%-9%Twitter-ego 81K 1.7M jaccard, adamic 57%-39%Movielens 1.6K 1.9M jaccard 88%
Facebook 1K 143K #messages, message size 78%-77%
US-Airports 0.5K 6K #passengers 72%C-Elegans 0.3K 2.3K #connections 17%
Page 56
SEMI-METRICITY IN REAL GRAPHS
30
Graph |V| |E| metric semi-metricity
Facebook 190M 49.9B custom 26.5%Twitter 40M 1.5B jaccard 39%Tuenti 12M 685M jaccard 59%
Livejournal 4.8M 34M jaccard 40%NotreDame 0.3M 1.5M jaccard, adamic 45%-29%
DBLP 318K 1M jaccard, adamic 23%-9%Twitter-ego 81K 1.7M jaccard, adamic 57%-39%Movielens 1.6K 1.9M jaccard 88%
Facebook 1K 143K #messages, message size 78%-77%
US-Airports 0.5K 6K #passengers 72%C-Elegans 0.3K 2.3K #connections 17%
% 1st-order semi-metric edges =>
reduction in memory and communication
Page 57
QUERY SPEEDUP ON NEO4J
31
6.7x speedup
Page 58
APACHE GIRAPH SPEEDUP
32
Including the time to calculate the backbone
4x speedup
Page 59
APACHE GIRAPH SPEEDUP
33
6x speedup
Page 60
COMMUNICATION REDUCTION
34
Up to 70% for highly semi-metric graphs
Page 61
BEST PRACTICESWhen to use the backbone?
• semi-metric weighting schemes, e.g. neighborhood similarity• we can amortize the overhead: e.g. many algorithms on the same graph,
multiple distance queries• lossy compression is ok
When not to use the backbone?
• for metric weighting schemes• we need to run one-off analysis• we need lossless compression
35
Page 62
RECAP: MAIN CONTRIBUTIONS
36
• An algorithm for computing the metric backbone without solving APSP
• An open-source distributed implementation• Graph query and graph analytics speedup on
Neo4j and Apache Giraph
Page 63
THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri ([email protected] ) KTH Royal Institute of TechnologyTiago Simas ([email protected] ) Telefonica Research Dionysios Logothetis ([email protected] ) Facebook