High-Performance Partition-based and Broadcast- based Spatial Join on GPU-Accelerated Clusters Simin You Dept. of Computer Science CUNY Graduate Center New York, NY, USA [email protected]Jianting Zhang Dept. of Computer Science The City College of New York New York, NY, USA [email protected]Le Gruenwald Dept. of Computer Science The University of Oklahoma Norman, OK, USA [email protected]Abstract—The rapid growing volumes of spatial data have brought significant challenges on developing high- performance spatial data processing techniques in parallel and distributed computing environments. Spatial joins are important data management techniques in gaining insights from large-scale geospatial data. While several distributed spatial join techniques based on symmetric spatial partitions have been implemented on top of existing Big Data systems, they are not capable of natively exploiting massively data parallel computing power provided by modern commodity Graphics Processing Units (GPUs). In this study, we have extended our distributed spatial join framework that was originally designed for broadcast-based spatial joins to partition-based spatial joins. Different from broadcast-based spatial joins that require one input side of a spatial join to be a point dataset and the other side to be sufficiently small for broadcast, the new extension supports non-point spatial data on both sides of a spatial join and allows them to be both large in volumes while still benefit from native parallel hardware acceleration for high performance. We empirically evaluate the performance of the proposed partition-based spatial join prototype system on both a workstation and Amazon EC2 GPU-accelerated clusters and demonstrate its high performance when comparing with the state-of-the-art. Our experiment results also empirically quantify the tradeoffs between partition-based and broadcast-based spatial joins by using real data. Keywords—Spatial Join, Partition-based, Broadcast-based, GPU, Distributed Computing I. INTRODUCTION Advances of sensing, modeling and navigation technologies and newly emerging applications, such as satellite imagery for Earth observation, environmental modeling for climate change studies and GPS data for location dependent services, have generated large volumes of geospatial data. Very often multiple spatial datasets need to be joined to derive new information to support decision making. For example, for each pickup location of a taxi trip record, a spatial join can find the census block that it falls within. Time-varying statistics on the taxi trips originate and designate at the census blocks can potentially reveal travel and traffic patterns that are useful for city and traffic planning. As another example, for each polygon boundary of a US Census Bureau TIGER record, a polyline intersection based spatial join can find the river network (or linearwater) segments that it intersects. While traditional Spatial Databases and Geographical Information System (GIS) have provided decent supports for small datasets, the performance is not acceptable when the data volumes are large. It is thus desirable to use Cloud computing to speed up spatial join query processing in computer clusters. As spatial joins are typically both data intensive and computing intensive and Cloud computing facilities are increasingly equipped with modern multi-core CPUs, many-core Graphics Processing Units (GPUs) and large memory capacity i , new Cloud computing techniques that are capable of effectively utilizing modern parallel and distributed platforms are both technically preferable and practically useful. Several pioneering Cloud-based spatial data management systems, such as HadoopGIS [1] and SpatialHadoop [2], have been developed on top of the Hadoop platform and have achieved impressive scalability. More recent developments, such as SpatialSpark [3], ISP- MC+ and ISP-GPU [4], are built on top of in-memory systems, such as Apache Spark [5] and Cloudera Impala [6], respectively, with demonstrable efficiency and scalability. We refer to Section II for more discussion on the distributed spatial join techniques and the respective research prototype systems. Different from HadoopGIS and SpatialHadoop that perform spatial partitioning before globally and locally joining spatial data items, i.e., partition-based spatial join, ISP-MC+ and ISP-GPU are largely designed for broadcast-based spatial join, where one side (assuming the right side) dataset in a spatial join is broadcast to the partitions of another side (assuming the left side) which is a point dataset (not necessarily spatially partitioned) for local joins. While SpatialSpark supports both broadcast-based and partition-based spatial join (in- memory), when both sides of a spatial join are large in volumes, broadcast-based spatial joins require significantly more memory capacity and is more prone to failures (e.g., due to the out of memory issue). This makes partition- based spatial join on SpatialSpark a more robust choice. We note that partition-based spatial joins typically require reorganizing data according to partitions on either external storage (HDFS for HadoopGIS and SpatialHadoop) or in- memory (SpatialSpark) through additional steps. Our previous work has demonstrated that SpatialSpark is significantly faster than HadoopGIS and SpatialHadoop for partition-based spatial joins [7]. ISP- MC+ and ISP-GPU, which have exploited native multi-core CPU and GPU parallel computing power, are additionally
10
Embed
High-Performance Partition-based and Broadcast- based ...database/HIGEST-DB/publications/pbsj_gpu_tr.pdfsatellite imagery for Earth observation, environmental modeling for climate
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
High-Performance Partition-based and Broadcast-
based Spatial Join on GPU-Accelerated Clusters Simin You
memory, given the quite different cache configurations on
typical GPUs and CPUs.
As balanced workload and coalesced global
memory access are crucial in exploiting the parallel
computing power on GPUs, we have developed an efficient
data-parallel design on GPUs to achieve high performance.
First, we minimize unbalanced workload by applying
parallelization at line segment level rather than at polyline
level. Second, we maximize coalesced global memory
accesses by laying out line segments from the same
polyline consecutively on GPU memory and letting each
GPU thread process a line segment. Figure 5 lists the
kernel of the data parallel design of polyline intersection
test on GPUs and more details are explained below.
In our deign, each pair of polyline intersection test
is assigned to a GPU computing block to utilize GPU
hardware scheduling capability to avoid unbalanced
workload created by variable polyline sizes. Within a
computing block, all threads are used to check line
segments intersection in parallel. Since each thread
performs intersection test on two line segments where each
segment has exactly two endpoints, the workload within a
computing block is perfectly balanced. The actual
implementation for real polyline data is a little more
complex as a polyline may contain multiple linestrings
which may not be continuous in polyline vertex array. The
problem can be solved by recording the offsets of the
linestrings in the vertex array of a polyline and using the
offsets to locate vertices of line segments of the linestrings
to be tested.
Line 1-3 of the kernel in Figure 5 retrieve the
positions of non-continuous linestrings, followed by two
iterations in Line 5 and 6. For each pair of linestrings, all
threads of a block retrieve line segments in pairs and test
for intersection (Line 10-17). We designate a shared
variable, intersected, to indicate whether there is any pair
of line segments intersected for the polyline pair. Once a
segment pair intersects, the intersected variable is set to
true and becomes visible to all threads within the thread
block. The whole thread block then immediately terminates
(Line 18-20). When the thread block returns, GPU
hardware scheduler can schedule another polyline pair on a
new thread block. Since there is no synchronization among
thread blocks, there will be no penalty even though
unbalanced workloads are assigned to blocks.
Figure 6 illustrates the design of polyline
intersection for both multi-core CPUs and GPUs. After the
filter phase, candidate pairs are generated based on MBRs
of polylines. As we mentioned previously, a pair of
polylines can be assigned either to a CPU thread (multi-
core CPU implementation) for iterative processing by
looping though line segments or to a GPU thread block
(GPU implementation) for parallel processing. While GPUs
typically have hardware schedulers to automatically
schedule multiple thread blocks on a GPU, explicit
parallelization on polylines across multiple CPU cores is
needed. While we use OpenMP with dynamic scheduling
for the purpose in this study, other parallel libraries on
multi-core CPUs, such as Intel TBB, may achieve better
performance by utilizing more complex load balancing
algorithms. In both multi-core CPU and GPU
implementations, we have exploited native parallel
programming tools to achieve higher performance based a
shared-memory parallel computing model. This is different
from executing Mapper functions in Hadoop where each
Mapper function is assigned to a CPU core and no resource
sharing is allowed among CPU cores.
V. EXPERIMENTS AND RESULTS
A. Experiment Setup
In order to conduct performance study, we have prepared
real world datasets for two experiments which are also
publically accessible to facilitate independent evaluations
on different parallel and/or distributed platforms. The first
experiment is designed to evaluate point-in-polygon test
based spatial join, which uses pickup locations from New
York City taxi trip data in 2013x (referred as taxi) and New
York City census blocksxi (referred as nycb). The second
experiment is designed to evaluate polyline intersection
based spatial join using two datasets provided by
SpatialHadoop xii , namely TIGER edge and USGS
linearwater. For all four datasets, only geometries are used
from original datasets for the experiment purpose and the
specifications are listed in Table 1. We also apply
appropriate preprocessing on the datasets for running on
different systems. For SpatialHadoop, we use its R-tree
indexing module and leave other parameters by default. For
our system, all the datasets are partitioned using Sort-Tile
partition (256 tiles for taxi, 16 tiles for nycb, 256 tiles for
both edge and linearwater) for partitioned-based spatial
joins. Note that datasets do not need pre-partition for
broadcast-based spatial joins.
We have prepared several hardware configurations
for experiment purposes. The first configuration is a single
… … loop through segments
each iteration one segment
segment
linestring
test intersection
Check at each iteration;
terminate if any segment pair
intersects
GPU Block
Set the result for the pair and schedule the next pair
CPU Thread
candidate pairs
generated by spatial
filtering based on MBR
Figure 6 Data Parallel Polyline Intersection Design
node cluster with a workstation that has dual 8 core CPUs
at 2.6 GHz (16 physical cores in total) and 128 GB
memory. The large memory capacity makes it possible to
experiment spatial joins that require significant amount of
memory. The workstation is also equipped with an Nvidia
GTX Titan GPU with 2,688 cores and 6GB memory.
Another configuration is a 10-node Amazon EC2 cluster, in
which each node is a g2.2xlarge instance consists of 8
vCPUs and 15 GB memory, is used to for scalability test.
Each EC2 instance has an Nvidia GPU with 1,568 cores
and 4GB memory. We vary the number of nodes for
scalability test and term the configurations as EC2-X where
X denotes the number of nodes in the cluster. Both clusters
are installed with Cloudera CDH-5.2.0 to run
SpatialHadoop (version 2.3) and SpatialSpark (with Spark
version 1.1).
Table 1 Experiment Dataset Sizes and Volumes
Dataset # of Records
Size
Taxi 169720892 6.9GB
Nycb 38839 19MB
Linearwater 5857442 8.4GB
Edge 72729686 23.8GB
B. Results of Polyline Intersection Performance on
Standalone Machines
We first evaluate our polyline intersection designs using
edge and linearwater datasets on both multi-core CPUs and
GPUs on our workstation and a g2.2xlarge instance without
involving distributed computing infrastructure. As the
polyline intersection time dominates the end-to-end time in
this experiment, the performance can be used to evaluate
the efficiency of the proposed polygon intersection
technique on both multi-core CPUs and GPUs. The results
are plotted in Figure 7, where CPU-Thread and GPU-Block
refer the implementations of the proposed design, i.e.,
assigning a matched polyline pair to a CPU thread and a
GPU computing block, respectively. Note the data transfer
time between CPUs and GPUs are included when reporting
GPU performance.
For GPU-Block, the approximately 50% higher
performance on the workstation than on the single EC2
instance shown in Figure 7 represents a combined effect of
about 75% more GPU cores and comparable memory
bandwidth when comparing the GPU on the workstation
and the EC2 instance. For CPU-Thread, the 2.4X better
performance on the workstation than that on the EC2
instance reflect the facts that the workstation has 16 CPU
cores while the EC2 instance has 8 virtualized CPUs, in
addition to Cloud virtualization overheads. While the GPU
is only able to achieve 20% higher performance than CPUs
on our high-end workstation, the results show 2.6X
speedup on the EC2 instance where both the CPU the GPU
are less powerful. Note that the reported low GPU speedup
on the workstation represents the high efficiency of our
polygon intersection test technique on both CPUs and
GPUs. While it is not our intension to compare our data
parallel polygon intersection test implementations with
those that have been implemented in GEOS and JTS, we
have observed orders of magnitude of speedups. As
reported in the next subsection, the high efficiency of the
geometry API actually is the key source for our system to
significantly outperform SpatialHadoop for partition-based
spatial joins where SpatialHadoop uses JTS for geometry
APIs.
Figure 7 Polyline Intersection Performance (in seconds)
in edge-linearwater experiment
C. Results of Distributed Partition-Based Spatial
Joins
The end-to-end runtimes (in seconds) for the two
experiments (taxi-nycb and edge-linearwater) under the
four configurations (WS, EC2-10, EC2-8 and EC2-6) on
the three systems (SpatialHadoop, LDE-MC+ and LDE-
GPU) are listed in Table 2. The workstation (denoted as
WS) here is configured as a single-node cluster and is
subjected to distributed infrastructure overheads. LDE-
MC+ and LDE-GPU denote the proposed distributed
computing system using multi-core CPUs and GPUs,
respectively. The runtimes of the three systems include
spatial join times only and the indexing time for the two
input datasets are excluded. The taxi-nycb experiment uses
point-in-polygon test based spatial join and the edge-
linearwater uses polyline intersection base spatial join.
From Table 2 we can see that, comparing with
SpatialHadoop, the LDE implementations on both multi-
core CPUs and GPUs are at least an order of magnitude
faster for all configurations. The efficiency is due to several
factors. First, the specialized LDE framework is a C++
based implementation which can be more efficient than
general purpose JVM based frameworks such as Hadoop
(on which SpatialHadoop is based). The in-memory
processing of LDE is also an import factor where Hadoop
is mainly a disk-based system. With in-memory processing,
intermediate results do not need to write to external disks
which is very expensive. Second, as mentioned earlier, the
dedicated local parallel spatial join module can fully exploit
parallel and SIMD computing power within a single
computing node. Our data-parallel designs in the module,
including both spatial filter and refinement steps, can
effective utilize current generation of hardware, including
multi-core CPUs and GPUs. From a scalability perspective,
the LDE engine has achieved reasonable scalability. When
the number of EC2 instances is increased from 6 to 10
(1.67X), the speedups vary from 1.39X to 1.64X. The GPU
implementations can further achieve 2-3X speedups over
the multi-core CPU implementations which is desirable for
clusters equipped with low profile CPUs.
Table 2 Partition-based Spatial Join Runtimes (s)
WS EC2-
10
EC2-
8
EC2-
6
taxi-nycb SpatialHadoop 1950 1282 1315 2099
LDE-MC+ 191 39 50 63
LDE-GPU 111 19 23 30
edge-
linearwater
SpatialHadoop 9887 3886 5613 6915
LDE-MC+ 554 219 260 360
LDE-GPU 437 97 114 135
D. Results of Broadcast-Based Distributed Spatial
Joins
In addition to comparing the performance of the partition-
based spatial join among SpatialHadoop, LDE-MC+ and
LDE-GPU in both taxi-nycb and edge-linearwater
experiments, we have also compared the performance of
broadcast-based spatial join with partition-based spatial
join using the taxi-nycb experiment. The edge-linearwater
cannot use broadcast-based join due to memory constraint
as discussed earlier. From the results presented in Table 3,
it can be that the LDE framework outperforms all other
systems including ISP, which is expected due to the
lightweight infrastructure overhead by design.
Table 3 Broadcast-based Spatial Join Runtimes (s)
WS EC2-10 EC2-8 EC2-6
taxi-nycb SpatialSpark 355 101 108 144
ISP-MC+ 130 36 44 54
ISP-GPU 96 21 27 34
LDE-MC+ 119 22 25 31
LDE-GPU 50 12 15 16
Comparing the runtimes in Table 3 with Table 2
for the same taxi-nycb experiment, we can observe that the
broadcast-based spatial join is much faster (up to 2X) than
partition-based spatial join using the LDE engine, even
without including the overhead of preprocessing in
partition-based spatial join. The results support our
discussions in Section 3. This may suggest that broadcast-
based spatial join should be preferred whereas possible.
When native parallelization tools are not available, the
broadcast-based spatial join implemented in SpatialSpark
can be an attractive alternative, which outperforms
SpatialHadoop by 5.5X under WS configuration and 12-
14.5X under the three EC2 configurations.
VI. CONCLUSIONS AND FUTURE WORK
In this study, we have designed and implemented partition-
based spatial join on top of our lightweight distributed
processing engine. By integrating distributed processing
and parallel spatial join techniques on GPUs within a single
node, our proposed system can perform large-scale spatial
join effectively and achieve much higher performance than
the state-of-the-art. Experiments comparing the
performance of partition-based and broadcast-based spatial
joins suggest that broadcast-based spatial join techniques
can be more efficient when joining a point dataset and a
relatively small spatial dataset that is suitable for broadcast.
As for future work, we plan to further improve
single node local parallel spatial module by adding more
spatial operators with efficient data-parallel designs. We
also plan to develop a scheduling optimizer for the system
that can perform selectivity estimation to help dynamic
scheduling to achieve higher performance.
ACKNOWLEDGEMENT
This work is supported through NSF Grants IIS-1302423
and IIS-1302439.
REFERENCES
1. A. Aji, F. Wang, et al (2013). Hadoop-gis: A high performance
spatial data warehousing system over mapreduce. In VLDB, 6(11),
pages 1009–1020.
2. E. Eldawy and M. F. Mokbel (2015). SpatialHadoop: A MapReduce
Framework for Spatial Data. In Proc. IEEE ICDE'15.
3. S. You, J. Zhang and L. Gruenwald (2015). Large-Scale Spatial Join
Query Processing in Cloud. In Proc. IEEE CloudDM'15.
4. S. You, J. Zhang and L. Gruenwald (2015). Scalable and Efficient Spatial Data Management on Multi-Core CPU and GPU Clusters: A
Preliminary Implementation based on Impala, in Proc. IEEE
HardBD’15.
5. M. Zaharia, M. Chowdhury et al (2010). Spark: Cluster Computing
with Working Sets. In Proc. HotCloud.
6. M. Kornacker and et al. (2015). Impala: A modern, open-source sql
engine for hadoop. In Proc. CIDR’15.
7. S. You, J. Zhang and L. Gruenwald (2015). You, J. Zhang and L.
Gruenwald (2015). Spatial Join Query Processing in Cloud:
Analyzing Design Choices and Performance Comparisons. To appear in Proc. IEEE HPC4BD. Online at http://www-
cs.ccny.cuny.edu/~jzhang/papers/sjc_compare_tr.pdf 8. J. Zhang, S. You and L. Gruenwald (2015). A Lightweight
Distributed Execution Engine for Large-Scale Spatial Join Query Processing. In Proc. IEEE Big Data Congress’15.
9. E. H. Jacox and H. Samet (2007). Spatial Join Techniques. ACM Trans. Database Syst., vol. 32, no. 1, p. Article #7.
10. A. Aji, G. Teodoro and F. Wang (2014). Haggis: turbocharge a
MapReduce based spatial data warehousing system with GPU engine.
In Proc. ACM BigSpatial’14.
11. H. Vo, A. Aji and F. Wang (2014). SATO: a spatial data partitioning
framework for scalable query processing. In Proc. ACMGIS’14.
12. J. Zhang and S. You (2012). Speeding up large-scale point-in-polygon test based spatial join on GPUs. In Proc. ACM BigSpatial,
23-32.
i http://aws.amazon.com/ec2/instance-types/ ii http://trac.osgeo.org/geos/ iii http://www.vividsolutions.com/jts/JTSHome.htm iv http://hadoop.apache.org/docs/r1.2.1/streaming.html v http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
vi http://openmp.org/wp/ vii https://www.threadingbuildingblocks.org/ viii http://www.nvidia.com/object/cuda_home_new.html ix https://en.wikipedia.org/wiki/Well-known_text x http://chriswhong.com/open-data/foil_nyc_taxi/ xi http://www.nyc.gov/html/dcp/html/bytes/applbyte.shtml xii http://spatialhadoop.cs.umn.edu/datasets.html