Making Pull-Based Graph Processing Performanthlitz/papers/grazelle.pdfa thread’s global ID, and thread barriers involve all threads. The Edge phase is parallelized using a dynamic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Making Pull-Based Graph Processing PerformantSamuel Grossman
ing Pull-Based Graph Processing Performant. In PPoPP ’18: 23ndACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, February 24–28, 2018, Vienna, Austria. ACM, New
York, NY, USA, 15 pages. https://doi.org/10.1145/3178487.3178506
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than the author(s) must
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from [email protected].
and initializes the frontier to contain just a single vertex. It
otherwise behaves the same way as Connected Components,
all the way down to the use of minimization as its aggrega-
tion operator [58]. The only effect of a difference in frontier
fullness is biassing the execution towards either push or pull.
252
PPoPP ’18, February 24–28, 2018, Vienna, Austria Samuel Grossman, Heiner Litz, and Christos Kozyrakis
6.1 Effectiveness of Scheduler AwarenessScheduler awareness eliminates write conflicts and reduces
the number of write operations a pull engine needs to per-
form. We expect it to be at its most beneficial when writes
and conflicts are common. We therefore begin our analysis
with a detailed look at PageRank, followed by some insights
as to its impact on Connected Components. Because Breadth-
First Search performs one write per vertex and writes do not
conflict, it is unaffected by scheduler awareness.
We evaluate the effectiveness of scheduler awareness by
comparing two pull engine configurations: one parallelized
using a scheduler-aware interface (Listings 3, 4, 5, and 6) and
one parallelized using a traditional interface (Listing 2 with
the inner for changed to parallel_for and appropriate
atomics added). With the traditional interface, the proba-
bility of write conflicts depends on the number of threads
used, the degree of vertices processed, and the scheduler
granularity (i.e. the number of edges per chunk). Conversely,
the scheduler-aware interface must perform a merge oper-
ation at the end of the nested loop. The incurred overhead
depends on the scheduler granularity (i.e. the number of
chunks created).
Figure 5 quantifies the performance impact of scheduler
awareness for PageRank using the 6 input graphs with a
fixed scheduler granularity of 1,000 edge vectors per chunk,
which roughly approximates the default maximum chunk
size Cilk Plus would use for any of these graphs [32]. For
reference, we also show results for the traditional approach
without synchronization, even though it leads to incorrect
output. Scheduler awareness is clearly beneficial across the
board, irrespective of graph dimensions. The largest benefit
is for uk-2007: the write conflicts with the traditional in-
terface are sufficiently prevalent that scheduler awareness
improves performance by 50×. For such consistently low-
degree graphs as dimacs-usa the speedup drops as low as
15%; in these cases, scheduler awareness removes all syn-
chronization but does not significantly reduce the actual
number of writes performed.
Figure 6 quantifies the sensitivity of PageRank perfor-
mance to chunk size for three representative graphs. Per-
formance with the traditional interface is often strongly de-
pendent on the chunk size, particularly for scale-free graphs
with frequent high-degree vertices. The ideal chunk size is
graph-dependent, and simply switching to a large chunk size
is undesirable because doing so can lead to load imbalance.
Conversely, performance with the scheduler-aware interface
is largely insensitive to chunk size.
Figure 7 illustrates how scheduler awareness improves
multi-core scalability for the same graphs as in Figure 6 by
showing performance as we increase the number of active
physical cores and NUMA nodes. In each test involving mul-
tiple NUMA nodes, the number of active physical cores per
node is kept equal. The chunk size is selected for each graph
based on its result in Figure 6, with the goal of picking a
granularity that produces similar performance between the
two interfaces. All values are normalized to the performance
result of the traditional interface with a single thread. As ex-
pected, scheduler awareness is most effective for graphs with
greater numbers of vertices having high in-degree. In fact,
without scheduler awareness the performance of PageRank
on uk-2007 barely scales with increasing thread count. Nev-
ertheless, even low-degree graphs can benefit from scheduler
awareness, as reflected in the results for dimacs-usa.We turn our attention now to Connected Components,
which has lower write intensity than PageRank. To isolate
the impact of the reduced write intensity, we present re-
sults for two versions: one implemented as described and
a modified version with higher write intensity. The latter
unconditionally writes values to vertex properties, even if
the value to be written is equal to the value already present.
Due to space limitations, Figure 8 presents only end-to-end
performance results using Grazelle’s default scheduling gran-
ularity (§5) on a single socket with all 28 logical cores active.
Despite its reduced write intensity, Connected Com-
ponents clearly benefits from scheduler awareness.
cit-Patents, for instance, exhibits a speedup of 40%.
Scheduler awareness is unsurprisingly more effective with
the modified version, resulting in a speedup of up to 2.4×.In the worst case, there is a 3% slowdown for uk-2007 in
Figure 8b. This occurs because Grazelle’s default scheduling
granularity is coarse for this graph (approximately 1 million
vectors per chunk), meaning that the number of write
conflicts is quite small with the traditional interface.
6.2 Effectiveness of Vector-SparseVector-Sparse adds padding to the very compact Compressed-
Sparse layout in order to better support vectorization. In
other words, it trades off some compactness for performance.
Both the compactness loss and the performance gain depend
on the average packing efficiency of the edge vectors. Pack-
ing efficiency is the percentage of valid bits set per vector.
For a 4-element vector, it ranges from 25% (only one edge
is valid) to 100% (all four edges are valid). Figure 9a shows
the average edge vector packing efficiency across all 6 of
our real-world datasets. Figure 9b shows the same for a total
of 30 synthetic graphs generated with the R-MAT genera-
tor [11] included in X-Stream [55]. We show results with 4-,
8-, and 16-element vectors (256-, 512-, and 1024-bit vectors)
to evaluate the effectiveness of Vector-Sparse with current
and future processors. Many real-world graphs, including
both twitter-2010 and uk-2007, have an average degree
of at least 25, which leads to an average packing efficiency
of well over 90% for 4-element vectors and close to that
even for 8-element vectors. With 4-element vectors, packing
efficiency is at least 75% in nearly all cases, suggesting po-
tentially large benefits from vectorization. Unsurprisingly,
packing efficiency drops with wider vectors.
253
Making Pull-Based Graph Processing Performant PPoPP ’18, February 24–28, 2018, Vienna, Austria
0.0
0.2
0.4
0.6
0.8
1.0
C D L T F U
Rel. E
xecution T
ime
Input Graph
Traditional Traditional, Nonatomic Scheduler-Aware
(a) Execution time relative to the traditional interface. Lower is better.
0%
20%
40%
60%
80%
100%
C D L T F U
Tim
e C
ontr
ibution
Input Graph
Work (T) Work (T-NA) Work (SA) Merge Write Idle
(b) Execution time profile for each scheduler interface.
Figure 5. Performance impact of scheduler awareness on PageRank with a scheduling granularity of 1,000 edge vectors per
chunk. T = Traditional; T-NA = Traditional, Nonatomic; SA = Scheduler-Aware.
Scheduler-AwareTraditional
0.0
0.2
0.4
0.6
0.8
1.0
1.2
100 1,000 10,000
Rel. E
xecution T
ime
Granularity (# Vectors / Chunk)
(a) dimacs-usa
0.0
0.2
0.4
0.6
0.8
1.0
1.2
100 1,000 10,000
Rel. E
xecution T
ime
Granularity (# Vectors / Chunk)
(b) twitter-2010
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1,000 10,000 100,000
Rel. E
xecution T
ime
Granularity (# Vectors / Chunk)
(c) uk-2007
Figure 6. Sensitivity of PageRank performance to chunk size. Horizontal axis is logarithmic. Granularities for uk-2007 are10× those of other graphs. Baselines are traditional interface results for the smallest shown granularities. Lower is better.
Scheduler-AwareTraditional
0
10
20
30
40
50
0 14 28 42 56
Rel. P
erf
orm
ance
# Physical Cores (14 per Socket)
(a) dimacs-usa, granularity 5,000
0
10
20
30
40
0 14 28 42 56
Rel. P
erf
orm
ance
# Physical Cores (14 per Socket)
(b) twitter-2010, granularity 5,000
0
10
20
30
40
50
60
70
0 14 28 42 56
Rel. P
erf
orm
ance
# Physical Cores (14 per Socket)
(c) uk-2007, granularity 50,000
Figure 7.Multi-core scaling of different scheduler interfaces with PageRank. Represented as performance relative to that of
the traditional interface run with a single thread. Higher is better.
We evaluate performance gains from vectorization with
4-element AVX vectors by comparing vectorized implemen-
tations of each phase with non-vectorized implementations
of the same. Our non-vectorized implementation of the Ver-
tex phase targets only a single vertex per iteration. In the
Edge phase, we disable vectorization by replacing vectorized
code, such as the vgatherqpd instruction, with versions that
process a single edge at a time.
We show results separated by Grazelle phase when run-
ning PageRank (Figure 10a) and as end-to-end speedups
across the three applications (Figure 10b). Edge-Pull is clearly
the most responsive to vectorization, showing speedups of
approximately 2× irrespective of input. Edge-Push and Ver-
tex are largely unresponsive to vectorization, the former due
to the lack of AVX atomic-update-scatter instructions and
the latter because of memory bandwidth saturation. PageR-
ank benefits the most from vectorization because Grazelle
exclusively selects Edge-Pull for its execution. Benefits for
other applications depends on the extent to which they use
Edge-Pull, which in turn depends on the frontier size.
254
PPoPP ’18, February 24–28, 2018, Vienna, Austria Samuel Grossman, Heiner Litz, and Christos Kozyrakis
Scheduler-AwareTraditional Traditional, Nonatomic
0.0
0.2
0.4
0.6
0.8
1.0
1.2
C D L T F U
Rel. E
xecution T
ime
Input Graph
(a) Write-intense version
0.0
0.2
0.4
0.6
0.8
1.0
1.2
C D L T F U
Rel. E
xecution T
ime
Input Graph
(b) Standard version
Figure 8. Performance impact of scheduler awareness on
Connected Components with Grazelle’s default scheduler
granularity. Shown as execution time relative to the tradi-
tional interface. Lower is better.
6.3 Comparison with Existing Graph FrameworksWe compare Grazelle to Ligra version 1.5, a July 2015 snap-
shot of Polymer, GraphMat version 1.0, and in-memory X-
Stream version 1.0. Ligra’s pull engine is the state-of-the-art
for a CPU-based implementation, Polymer is a NUMA-aware
derivative of Ligra, GraphMat has previously been cited as
being the best-performing framework [23, 61]. X-Stream is
unique in that it is an edge-centric framework: it creates
cache-sized streaming partitions from an unordered list of
edges and performs in-memory shuffle operations to ex-
change messages between them [55].
Per-application results are shown in Figures 11, 12, and 13.
Lower execution time is better. PageRank results are shown
individually for the push-based and pull-based engines of
Grazelle and Ligra; Polymer’s implementation exclusively
uses a push-based engine, and GraphMat does not contain
a pull-based engine. Connected Components and Breadth-
First Search results are shown for both Ligra and Ligra-Dense,a modified version of Ligra that maintains engine switching
functionality but uses only a dense frontier representation.
One of Ligra’s key optimizations is the use of both sparse and
dense frontier representations, a feature not implemented
in Grazelle, so we include Ligra-Dense results to facilitate a
fairer comparison.
With the exception of the small cit-Patents graph,
frameworks Ligra, Polymer, GraphMat, and X-Stream by
up to 15.2×, 4.6×, 4.7×, and 66.8×, even in the presence of
frontier optimizations.
AcknowledgementsWe thank our anonymous reviewers and our shepherd,
Michelle Goodstein, for their feedback and assistance in
improving our paper. This work is supported by the Na-
tional Science Foundation (grant number SHF-1408911), the
Stanford Platform Lab, Samsung, and Huawei.
256
PPoPP ’18, February 24–28, 2018, Vienna, Austria Samuel Grossman, Heiner Litz, and Christos Kozyrakis
References[1] Manuel Arenaz, Juan Touriño, and Ramón Doallo. 2004. An Inspector-
Executor Algorithm for Irregular Assignment Parallelization. In
ISPA ’04. Springer Berlin Heidelberg, 4–15. https://doi.org/10.1007/978-3-540-30566-8_4
[2] Scott Beamer, Krste Asanović, and David A. Patterson. 2011. Searchingfor a parent instead of fighting over children: A fast breadth-first searchimplementation for Graph500. Technical Report. EECS Department,
University of California, Berkeley.
[3] Scott Beamer, Krste Asanović, and David A. Patterson. 2012. Direction-
optimizing Breadth-First Search. In SC ’12. IEEE Computer Society,
1–10. https://dx.doi.org/10.1109/SC.2012.50[4] Scott Beamer, Krste Asanović, and David A. Patterson. 2015. Locality
Exists in Graph Processing: Workload Characterization on an Ivy
Bridge Server. In IISWC ’15. IEEE, 56–65. https://dx.doi.org/10.1109/IISWC.2015.12
[5] Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-
Vector Multiplication on Throughput-Oriented Processors. In SC ’09.ACM, 18:1–18:11. https://dx.doi.org/10.1145/1654059.1654078
[6] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011.
Layered Label Propagation: A MultiResolution Coordinate-Free Order-
ing for Compressing Social Networks. In WWW ’11. ACM, 587–596.
[7] Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I:
Compression Techniques. In WWW ’04. ACM, 595–601.
[8] Aydın Buluç, Jeremy Fineman, Matteo Frigo, John Gilbert, and Charles
Leiserson. 2009. Parallel Sparse Matrix-Vector and Matrix-Transpose-
Vector Multiplication using Compressed Sparse Blocks. In SPAA ’09.ACM, 233–244. https://doi.org/10.1145/1583991.1584053
[9] Aydın Buluç, Samuel Williams, Leonid Oliker, and James Demmel.
2011. Reduced-Bandwidth Multithreaded Algorithms for Sparse
Matrix-Vector Multiplication. In IPDPS ’11. IEEE, 721–733. https://doi.org/10.1109/IPDPS.2011.73
[10] Wei Cao, Lu Yao, Zongzhe Li, Yongxian Wang, and Zhenghua Wang.
2010. Implementing Sparse Matrix-Vector Multiplication using CUDA
based on a Hybrid Sparse Matrix Format. In ICCASM ’10. IEEE, V11–161–V11–165. https://dx.doi.org/10.1109/ICCASM.2010.5623237
[11] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004.
R-MAT: A Recursive Model for Graph Mining. In SDM ’04. SIAM.
Making Pull-Based Graph Processing Performant PPoPP ’18, February 24–28, 2018, Vienna, Austria
[39] James LaGrone, Ayodunni Aribuki, Cody Addison, and Barbara Chap-
man. 2011. A Runtime Implementation of OpenMP Tasks. In IWOMP’11. Springer Berlin Heidelberg, 165–178. https://doi.org/10.1007/978-3-642-21487-5_13
[40] Daniel Langr and Tvrdík. 2016. Evaluation Critera for Sparse Matrix
Storage Formats. IEEE Transactions on Parallel and Distributed Systems27, 2 (February 2016), 428–440. https://dx.doi.org/10.1109/TPDS.2015.2401575
[41] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large
Network Dataset Collection. http://snap.stanford.edu/data. (2014).[42] Lingda Li, Robel Geda, Ari B. Hayes, Yanhao Chen, Pranav Chaudhari,
Eddy Z. Zhang, and Mario Szegedy. 2017. A Simple Yet Effective
Balanced Edge Partition Model for Parallel Computing. Proceedingsof the ACM on Measurement and Analysis of Computing Systems 1, 1(June 2017), 14:1–14:21. https://doi.org/10.1145/3084451
[43] Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey.
2013. Efficient Sparse Matrix-Vector Multiplication on x86-Based
Many-Core Processors. In ICS ’13. ACM, 273–282. https://dx.doi.org/10.1145/2464996.2465013
[44] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo
Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A
Framework for Machine Learning and Data Mining in the Cloud. Proc.VLDB Endowment 5, 8 (April 2012), 716–727. https://dx.doi.org/10.14778/2212351.2212354
[45] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and
Jonathan Berry. 2007. Challenges in Parallel Graph Processing. ParallelProcessing Letters 17, 1 (March 2007), 5–20. http://www.worldscientific.com/doi/abs/10.1142/S0129626407002843
[46] Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehn-
ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel:
A System for Large-scale Graph Processing. In SIGMOD ’10. ACM,
135–146. https://dx.doi.org/10.1145/1807167.1807184[47] María J. Martín, David E. Singh, Juan Touriño, and Francisco F. Rivera.
2002. Exploiting Locality in the Run-Time Parallelization of Irregular
Loops. In ICPP ’02. IEEE, 27–34. https://doi.org/10.1109/ICPP.2002.1040856
[48] Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010.
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU
Architectures. In HiPEAC ’10. Springer Berlin Heidelberg, 111–125.
https://doi.org/10.1007/978-3-642-11515-8_10[49] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Light-
weight Infrastructure for Graph Analytics. In SOSP ’13. ACM, 456–471.
https://dx.doi.org/10.1145/2517349.2522739[50] OpenMP ARB. 2016. OpenMP. http://www.openmp.org/. (2016).[51] Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Li-
dong Zhou, and Maya Haridasan. 2012. Managing Large Graphs on
Multi-cores with Graph Awareness. In USENIX ATC ’12. USENIX, 41–52. http://dl.acm.org/citation.cfm?id=2342821.2342825
[52] Lawrence Rauchwerger and David A. Padua. 1999. The LRPD Test:
Speculative Run-Time Parallelization of Loops with Privatization and
Reduction Parallelization. IEEE Transactions on Parallel and DistributedSystems 10, 2 (February 1999), 160–180. https://doi.org/10.1109/71.752782
[53] Array Regrouping and Structure Splitting Using Whole-Program Ref-
erence Affinity. 2004. Zhong, Yutao and Orlovich, Maksim and
Shen, Xipeng and Ding, Chen. In PLDI ’04. ACM, 255–266. https://dx.doi.org/10.1145/996841.996872
[54] Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy
Zwaenepoel. 2015. Chaos: Scale-out Graph Processing from Sec-
ondary Storage. In SOSP ’15. ACM, 410–424. https://dx.doi.org/10.1145/2815400.2815408
[55] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream:
Edge-centric Graph Processing Using Streaming Partitions. In SOSP’13. ACM, 472–488. https://dx.doi.org/10.1145/2517349.2522740
[56] Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal. 1991. A Simple
Load Balancing Scheme for Task Allocation in Parallel Machines. In
SPAA ’91. ACM, 237–245. https://dx.doi.org/10.1145/113379.113401[57] Semih Salihoglu and Jennifer Widom. 2013. GPS: A Graph Processing
System. In SSDBM ’13. ACM, 22:1–22:12. https://dx.doi.org/10.1145/2484838.2484843
[58] Julian Shun and Guy E. Blelloch. 2013. Ligra: A Lightweight Graph
Processing Framework for Shared Memory. In PPoPP ’13. ACM, 135–
146. https://dx.doi.org/10.1145/2442516.2442530[59] Michelle M. Strout, Larry Carter, and Jeanne Ferrante. 2001. Reschedul-
ing for Locality in Sparse Matrix Computations. In ICCS ’01. SpringerBerlin Heidelberg, 137–146. https://doi.org/10.1007/3-540-45545-0_23
[60] Jiawen Sun, Hans Vandierendonck, and Dimitrios S. Nikolopoulos.
2017. Accelerating Graph Analytics by Utilising the Memory Locality
of Graph Partitioning. In ICPP ’17. IEEE, 181–190. https://doi.org/10.1109/ICPP.2017.27
[61] Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Sub-
ramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi,
Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance
Graph Analytics Made Productive. Proc. VLDB Endowment 8, 11 (July2015), 1214–1225. https://dx.doi.org/10.14778/2809974.2809983
[62] Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation.
Graph Computation to the Trillions. In SoCC ’15. ACM, 408–421. https://dx.doi.org/10.1145/2806777.2806849
[66] Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang, and Haibo
Chen. 2015. SYNC or ASYNC: Time to Fuse for Distributed Graph-
Parallel Computation. In PPoPP ’15. ACM, 194–204. https://dx.doi.org/10.1145/2688500.2688508
[67] Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware
Graph-structured Analytics. In PPoPP ’15. ACM, 183–193. https://dx.doi.org/10.1145/2688500.2688507
[68] Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and
Weimin Zheng. 2016. Exploring the Hidden Dimension in Graph
Processing. In OSDI ’16. USENIX, 285–300. https://www.usenix.org/node/199311
[69] Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E.
Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing Billion-
Node Graphs on an Array of Commodity SSDs. In FAST ’15. USENIX,45–58. https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
[70] Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program Locality
Analysis using Reuse Distance. ACM Transactions on ProgrammingLanguages and Systems 31, 6 (August 2009), 20:1–20:39. https://dx.doi.org/10.1145/1552309.1552310
[71] Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph:
Large-Scale Graph Processing on a Single Machine Using 2-Level
Hierarchical Partitioning. In ATC ’15. USENIX, 375–386. https://www.usenix.org/node/190490
Making Pull-Based Graph Processing Performant PPoPP ’18, February 24–28, 2018, Vienna, Austria
• fig10a-edgepull-base, fig10a-edgepull-vec,fig10a-edgepush-base, fig10a-edgepush-vec,fig10a-vertex-base, fig10a-vertex-vec: ConfiguresGrazelle to run the per-phase vectorization perfor-
mance tests (Figure 10a); “edgepull”, “edgepush”, and
“vertex” respectively identify the phase of execution,
and “base” and “vec” respectively identify the baseline
and vectorized implementations.
• fig10b-pr-base, fig10b-pr-vec, fig10b-cc-base,fig10b-cc-vec, fig10b-bfs-base, fig10b-bfs-vec:Configures Grazelle to run the end-to-end application