GPU SQL Query Accelerator · 2017-07-19 · accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an emerging SQL-like query accelerator, Mi-Galactica.

International Journal of Information Technology Vol. 22 No. 1 2016

1

Keh Kok Yong, Hong Hoe Ong

Accelerative Technology Lab MIMOS Berhad

Kuala Lumpur, Malaysia [email protected], [email protected]

Vooi Voon Yap

Department of Electronic Engineering Universiti Tunku Abdul Rahman

Perak, Malaysia [email protected]

Abstract

The world rapidly grows with every connected sensors and devices with geo-location capabilities to

update its location. Data analytic industries are finding ways to store the data, and also turn this raw

data into valuable information as an eminent business intelligence services. It has inadvertently

conformed a flood of granular data about our world. Crucially, this data flood has outpaced the

traditional compute capabilities to process and analyze it. Thus, it reveals the potential economic

benefits and becomes an overwhelming new research area that requiring sophisticated mechanisms

and technologies to reach the demand. Over the past decade, there have attempts of using

accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an

emerging SQL-like query accelerator, Mi-Galactica. In addition, we extended our system by

offloading geo-spatial computation to the GPU devices. The query operation executes parallelly

with drawing support from a high performance and energy efficient NVIDIA Tesla technology. Our

result has shown the significant speedup.

Keyword: Geospatial, Graphics Processing Units, Database Query Processing, Big Data, Cloud

GPU SQL Query Accelerator

mailto:[email protected]



Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap

I. Introduction

The world rapidly grows with every connected sensors and devices with geo-location capabilities to

update its location. These location-aware data refers as spatial dataset. Gartner reported that Cisco

had projected 50 billion of connected objects. Besides, Digital Universe & EMC estimated that 44

trillion gigabytes of data will be collected in the year of 2020 [1]. Data analytic industries are

finding ways to store the spatial dataset, and also turn this raw data into valuable information as an

eminent business intelligence services. In addition, the value of spatial dataset is already evident.

DataSift uses the collected social media data for predicting consumer actions. Facebook uses an

accumulation of 350 million daily photos upload for the deep learning in image recognition [2].

Importantly, the demand of speedy computation with an appealing visualization is crucial to success.

Thus, it reveals the potential economic benefits and becomes an overwhelming new research area

that requiring sophisticated mechanisms and technologies to reach the demand.

A Graphics Processing Unit (GPU) is not only used for the optimization of image filtering and video

processing, but also it is widely adopted in accelerating big data analytics for scientific, engineering,

and enterprise applications. Jem uses GPU to accelerate texture-based geographic mapping, which

exploiting rendering performance [3]. Chenggang uses the two latest GPU technologies, Kepler and

MIC for accelerating Geospatial application. The parallel implementation shows the massive

speedup and strong scalability in a cluster [4]. Various recent studies have shown that the GPU

manages unprecedented acceleration of applications by offloading the compute-intensive tasks [5]

[6] [7]. The ultra-fast analytic application is crucial to drive business success through quick and

accurate decision making from mining the massive data.

Over the past decade, GPUs have taken the lead in high performance computing. Its evolution of

parallel processing components to fully programmable and powerful co-processors working along

with CPU has allowed cheaper, more energy efficient and faster super computers to be built. Zhe

uses a cluster of GPUs with 30 worker nodes to develop a parallel flow simulation using the lattice

Boltzmann model (LBM) [8]. Titan is the first major supercomputing system to utilize hybrid


3

architecture with both 16-core AMD Opteron CPUs and NVIDIA Tesla K20 GPU accelerators for

scientific computation such as simulation of climate change, nuclear energy modelling, nanoscale

analysis of materials, and other disciplines. The top ranked energy efficient supercomputers in the

world, TSUBAME-KFC, Wilkes and HA-PACS, use NVIDIA’s Kepler GPUs along with high speed

network communication devices such as Infiniband. These facilities have allowed computational

scientists and researchers to address the world’s most challenging computational problems by up to

several orders of magnitude faster.

However, there are studies pointing out that using GPU as a general-purpose computing device has

limitations [9] [10]. The fundamental problem of data transfer between CPUs and GPUs is a cause

of tremendous concern to the accelerator community. The ultra-high speed computation provided by

the GPU may not be able to compensate for the IO latency experienced at the PCIe bus. In addition,

it may turn out to be even more expensive if the parallel computation is not complex enough, where

more time is spent on transferring data to and from the GPU rather than for computation. Despite

this shortcoming, various empirical studies and experiments have shown that GPU is highly energy

efficient and has contributed to significant performance breakthrough across the computing industry.

Exploiting current GPU computing capabilities for database operations, we have to take into a

consideration of the hardware characteristics on a parallel algorithm execution. Also, it needs the

main processor, CPU to direct the main workflow. We propose and implement a GPU query

accelerator called Mi-Galactica using CUDA, and benchmark its performance on NVIDIA Tesla

Kepler architectures against standard PostgreSQL and various distributed Hadoop systems. The

detailed needs for this accelerator and parallel query processing in our work are:

• Partitioning data into fine grained chunks for parallel processing and reducing I/O access

• Applying compression and decompression mechanisms to speedup data I/O operations via

PCI-e transfer.

• Maximizing the usage of single instruction for multiple data to optimize the degree of

parallelism for query operations


• Performance on the GPU implementation should yield a significant acceleration over one

based solely on the CPU

The paper is organized as follows. In Section 2, it provides a review of database accelerator related

works. In Section 3, it briefly discusses the parallel CUDA programming on GPU and the

architecture of the latest NVIDIA Tesla technology. In Section 4, we present the implementation of

the proposed GPU query accelerator with ESRI GIS software. Section 5 briefly discusses

experiments and performance results. Finally, Section 6 concludes and discusses future work.

II. Database Accelerator

Database systems are extremely important to a wide array of industries. There have been

tremendous changes in the various hardware technologies used in accelerating database operations.

The well-known emerging technologies like GPU and FPGA (Field-Programmable Gate Arrays)

have led to an evolution of parallelism, compilation, and I/O reduction for producing more highly

efficient systems. In Govindarauju’s experiment, it presented several common query operations in

million records which storing in a database by using NVIDIA GeForce FX5900 [11] . It showed

GPUs as an effective co-processor for performing database operations. Mueller used FPGAs to

accelerate data processing [12]. This work opened up interesting opportunities for heterogeneous

many-core implementation. In addition, these hardware accelerators offer significant benefits in

power consumption.

Recent works on FPGA query accelerators have attempted various approaches to parallelize data

processing. Glacier [13] implements a set of streaming operators in composing digital circuit. It has

a library of compositional hardware modules. Each circuit is able to execute a specific query.

Woods [14] presents an FPGA framework for static complex event detection. This research looks to

transfer more complex data processing to FPGAs as a means to enhance the classical CPU-based

architectures. Netezza [15], [16] provides a pipeline consisting of DMA, Decompress, Project, and

Restrict computing engines. It reduces the amount of data access by performing projection and


5

restriction using data from prior requested tables. It hides the slow I/O transfer latency, by

compression and decompression of data. However, FPGA raises immense challenges to the

developers. It is generally more complicated and difficult to implement and debug. Thus, it has not

been able to gain a large foothold in the market.

A large body of research has investigated the acceleration of database systems using NVIDIA GPUs

with CUDA programming. Bakkum 1 implemented a GPU query acceleration database called

Virginian Database [17]. It is based on SQLite and develops a subset of commands that are directly

executed in GPU. Also, it uses GPU-mapped memory with efficient caching; therefore, it can

compute a larger size of data which exceeds GPU physical memory size. CoGaDB 2 is a new

column-oriented and GPU accelerate database management system. It designs a co-processing

scheme for GPU memory by caching the working set of data. It minimizes the performance penalty

by using a list of tuple identifiers representation for the data rather than the complete data to

minimize transportation between CPU and GPU [18]. Heimel’s3 [19] approach uses GPU-assisted

query optimization for real-valued range queries based on kernel density estimation into

PostgreSQL. It uses OpenCL because of its open standard that allows it to be easily ported to other

accelerator devices. PG-Storm4 developed a Foreign Data Wrapper module in PostgreSQL, and

offloads the sequential scan operation with massive data to GPU. It also takes the advantage of

GPU’s massively parallel computation capabilities to perform numerical calculation. Todd and Sam

built a Massive Parallel Database (MapD5) to handle big data analytics for an almost boundless

number of interactive socio-economic queries. It applies to geospatial visualization tool that can

probe and inspect more than a billion tweets worldwide. This has given a new emerging trend to

database management system.

1 https://github.com/bakks/virginian 2 http://wwwiti.cs.uni-magdeburg.de/iti_db/research/gpu/cogadb/ 3 https://bitbucket.org/mheimel/gpukde/ 4 https://wiki.postgresql.org/wiki/PGStrom 5 http://www.map-d.com/

https://github.com/bakks/virginian

http://wwwiti.cs.uni-magdeburg.de/iti_db/research/gpu/cogadb/

https://bitbucket.org/mheimel/gpukde/

https://wiki.postgresql.org/wiki/PGStrom

http://www.map-d.com/


Plentitude of emphasis in researching query related parallel algorithms have cultivated to the

development of GPU database. Red Fox works on relational operators to be executed in a GPU

parallel manner [20]. [21], [22] investigate GPU acceleration in indexing, scan and search

operations. [23] examine the important computational building blocks of aggregation. [24] focus the

studies on optimizing GPU sort. These studies have significantly brought up the awareness of using

GPU in big data analytic businesses. It is our belief that GPU can be beneficial for query processing

and widely deployable in big data analytics for database systems. This GPU query accelerator has to

be cautiously designed for parallel data structure and harmonizing the processes between CPU and

GPU.

It is our belief that GPU can be beneficial for query processing by widely deploying it for big data

analytics in database systems. The GPU query accelerator has to be carefully designed with parallel

data structure, and harmonizing the processing between CPU and GPU.

III. Graphic processing unit

In this section, we first discuss the background of GPUs and introduce the NVIDIA’s Kepler

Architecture. Next, we describe how thread and block works in the NVIDIA Kepler architecture.

Finally, we discuss the memory hierarchy of the NVIDIA’s Kepler architecture.

A. Background

GPUs first gained popularity with the rise of 3D gaming in the mid-1990. The demand of even more

powerful and energy efficient GPUs has become increasing ever since. The increase of

computational power of GPUs has attracted many researchers to use the GPU for more general

purpose computing. NVIDIA realized the potential GPUs for general computing and released CUDA

(Compute Unified Device Architecture) in 2006 so that the researcher community can leverage upon

the power of the large number of streaming processors in GPUs. GPUs nowadays are powering a

large range of industries from supercomputers to embedded system.


7

The latest NVIDIA GPU architecture is called Tesla Maxwell, just introduce Q3 2015. These new

cards focus on deep learning sector. However, this paper is based on the Kepler architecture. It has

included a lot of improvements from its predecessor architecture, Fermi. With the current

architecture, a single GPU die can contain up to 2880 CUDA cores. Besides that, the Kepler

architecture introduced new features like Dynamic Parallelism, Hyper-Q, Grid Management unit,

and NVIDIA GPU Direct. It also contains enhanced memory subsystems offering additional caching

capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially

faster DRAM I/O implementation. The principal design goal for the Kepler architecture has been

met with the new features providing the improved power efficiency.

B. Grid, Blocks, Threads and Warps

The programming model for CUDA introduces us to the concepts of threads, blocks, and grids

which run GPU codes called kernels. These threads, blocks, and grids will then run in multiple

SMXs (streaming multiprocessor) in the GPUs in groups of warps. Figure 1 shows the examples of

threads, blocks, and grids. From a programmer’s perspective, they only need to handle the threads,

blocks, and grids assignments, and kernel programming, while the hardware will manage how all the

threads, blocks, and grids are mapped into the SMXs and warps.

In CUDA, all the threads in the same grid will execute the same kernel function but each thread

mostly handles different data. This type of programming model is known as Single Instruction

Multiple Data (SIMD). With the new Kepler architecture, a block can consist up to 1024 threads in

each of the x, y, and z dimensions. The maximum number of block in the x dimension in a grid can

go up to 232-1.

Previously on the Fermi architecture, once a kernel has been launched, its dimension cannot be

changed. In the Kepler architecture, the programmer is allowed to launch another set of grids and

blocks within the kernel, which enables a more flexible programming model. This feature is called

the Dynamic Parallelism.


A warp is a unit of thread scheduling in the SMXs. Once a block is assigned to an SMX, it is divided

into units of warps. Each warp can support up to 32 threads in the Kepler architecture. Each thread in

a warp will run in parallel executing the same line of code. To increase the efficiency of the warps,

we should avoid branch divergence as much as possible. Branch divergence occurs when threads

inside a warp branches into different execution paths.

Thread Grid 0

Grid 1

Block

...

...

(i) (ii) (iii)

Figure 1: Thread, Block and Grids

192 CUDA CORES

Shared Memory

L1 cacheRead-Only data cache

L2 cache

DRAM

Level 2

Level 3

Level 4

Registers Level I

Figure 2: Hierarchy of GPU memory

C. Memory Hierarchy

There are four levels of memory hierarchy in the NVIDIA GPUs as shown in Figure 2. The first

level is register memory. Register memory is a local memory for the in the CUDA cores and have a

total size of 64KB. It is the fastest memory among all the memory types in the SMX. The second

level consists of Shared Memory, L1 cache and read-only data cache. These memories are located

very near the SMX core, and are shared among the 192 CUDA cores in the SMX. The Shared

Memory is usually used to communicate among different threads in the block. The third level

consists of L2 cache, and finally, the fourth level consists of DRAM memory that serves as the main

storage in GPUs and is used to send and read data in bulk from the CPU’s memory.

In the Fermi architecture, the Shared Memory and L1 Cache can be configured as 48 KB of Shared

Memory with 16 KB of L1 cache, or vice versa. The new Kepler architecture allows for additional

flexibility by permitting a 32KB / 32KB split between Shared Memory and L1 cache. The Read-

Only data cache is also new in the Kepler architecture. Previously, programmers would use the

Texture unit to store and load cache memory. However, this method had many limitations. The


9

benefit of the “Read-Only Data cache” uses a separate load set footprint off from Shared/L1 cache

memory. It also supports full speed unaligned memory access patterns.

IV. Implementation

A. Overview of Mi-Galactica

Mi-Galactica is a SQL-like query accelerator. There are four major components to formulate the

system: Connector, Preprocessor, Scheduler and Query Engine; as shown in Figure 3. Connector

enables Mi-Galactica to communicate to PostgreSQL and MySQL. It is to perform frontend

application interaction, data extraction and data interchange. In addition, it can support for the

comma-separated values (CSV) files processing. Scheduler is an internal task engine for managing

the user workloads. Query engine carries out various processes of query investigation by performing

the basic parsing and positioning operations. Then, it produces an execution query plan. There is

further adjustment of the plan by analyzing and tracing parallelizable points and rearranges clause of

objects execution. Mi-Galactica execution engine performs the accelerated query execution in either

CPU or GPU. On the other hand, source data in the database which is needed to be transformed,

and output to a parallel columnar accessible storage system. These components are designed to run

on an energy efficient commodity of GPU accelerator. In addition, it powers to strive forward for

handling big data challenges.

Mi-Galactica adopts the effectiveness from the previous studies [17], [18], [19], [20] and [21] in

query co-processing of the heterogeneous workloads. Figure 4 shows the architectural design of

coupled CPU-GPU architectures. It designs to support plug-ins for acceleration components, which

enabling customization. It eases up developer for adding new features and improves productivity.

In addition, it reduces the size of application as well. The implementation of plug-ins functionality

uses shared libraries. It is installed in a place that prescribed by Mi-Galactica.


Figure 3: Mi-Galactica Four Major Components

Figure 4: Mi-Galactica Architecture

Figure 5: GPU Columnar File System

Source data in the database requires a preprocessing stage. It converts the data into a parallelizable

files structure, GPU FS (File System), then it stores into a column-based orientation, as shown in

Figure 5. Thus, data can access independently, compute parallelly and maximize CPU multithreaded

processing. Each column segments into multi files and the size of each segment is customizable.

Therefore, CPU and GPU have sufficient amount of memory to compute larger data set.

Furthermore, it allows GPU to independently process each column in the segment. Nevertheless, the

change of the data in the database does not automatically trigger an update on the preprocessed data.

Thus, it needs to be re-created or complemented (when only new data is added). The CudaSet is a

parallel file structure, which improving the parallel geo-spatial processing jobs in GPU. It is not a

legacy array of structure (AOS) design that losing of bandwidth and wasting of L2 cache memory.

Mi-Galactica uses CudaSet representation to arrange data in Structure of Array (SOA) access

pattern. It certainly gains high throughput by coalescing the memory access in GPU. Also, it is

critical to memory-bound kernel functions. The required elements of structure can load individually

and no data interleaving as shown in Figure 6. Thus, it achieves high global memory performance.


11

Figure 6: SOA CudaSet Structure

The overhead of data transfer becomes an important factor. It is a bottleneck for fetching data to

GPU computation. Mi-Galactica uses compression to alleviate the performance issues. It

compresses the data into smaller size, which reduces the I/O operation and offloads the task to GPU.

It restructured the data processing by using co-processers schema structure for given database

architecture differently. There are dual compression scheme implemented on GPU. Firstly,

compression scheme applies on Integer data type; it is based on Zukowski work, PFOR-Delta [25].

It store differences instead of actual value. Only the difference of the data is stored between

subsequent values. Thus, bit packing mechanism can further optimize it by using just enough

number of bits to store each element. Secondly, string compression scheme was applied on

characters or text data type; it is based on Lempel–Ziv (LZ77) compression algorithm [26] with

dynamic representation and expression matching. It is a fine-grained parallel redundancy for

encoding and decoding of data with flexible representation. The key of the efficiency is fast retrieval

on the compressed data on CPU. Then, the lookup process has offloaded to GPU.

Query engine comprises both CPU and GPU phases. The CPU phases are in charge of parsing clause

into objects. It identifies the required data source, translates the operation into low level instruction

sets. Then, it arranges execution sequence and dispatch for execution. It uses the combination of

Bison and Flex implementing a SQL query parser. There can be both CPU and GPU related


workloads. However, CPU starts initializing GPU contexts, preparing input data, launching GPU

kernel functions, materializing query results, and controlling the steps of query progress. GPU

phases are in executing specific optimized kernel functions. These are mostly aggregate and

compute intensive functions. The data are used thousands of core to process at once. The GPU

computation related operations involve select, sort, projection, joins and basic aggregation. It

utilizes the mixture of in-house built accelerated parallel processing library – Mi-AccLib6 and open

source libraries such as NVIDIA Thrust7 and CUDPP8.

Scheduler is responsible for managing the received queries. There is an implementation on queue

processing across a pool of work threads in CPU. It controls the concurrency level and intensity of

resource contention on CPU. The resource monitor collects the current status of GPU devices usage.

Then, scheduler uses the collected information for assigning the task to the available GPU. At this

stage, CPU performs an important role in concurrent queueing. Thus, data can safely be added by

one thread, joined or removed by another thread without corrupting the data. In addition, it

maintains optimal concurrency level and workload on the GPUs. There is a data swapping

mechanism to maximize the effective utilization of GPU device memory. Through these processes,

it improves resource utilization. This implementation uses mixture of API (Application Program

Interface) in Boost9 libraries.

Mi-Galactica optimizes the concurrency through pipelining mechanism to overlap the data transfer

via PCI-e bus and the arithmetic computation. These CUDA streams can be executed

asynchronously, which queues the work and returns to CPU immediately. Pinned memory

mechanism is often adapted in certain queries implementation. It uses the Direct Memory Access

(DMA) engine, which can achieve a higher percentage of peak bandwidth. Hype-Q in Kepler

6 MIMOS accelerated library consist of high speed multi-algorithm search engines for text processing, data security engine and video

analytics engines, http://atl.mimos.my/ 7 Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity,

https://developer.nvidia.com/thrust 8 CUDPP is the CUDA Data Parallel Primitives Library, http://cudpp.github.io/ 9 Boost is a set of libraries for the C++ programming language that provide support for tasks and structures, http://www.boost.org/

http://atl.mimos.my/

https://developer.nvidia.com/thrust

http://cudpp.github.io/

http://www.boost.org/


13

architecture supports multiple CPU threads to be launched in GPU simultaneously, thereby intensely

rising GPU utilization and reducing CPU idle times.

B. Interacting Mi-Galactica with ESRI ArcGIS

There are two methods for Mi-Galactica interacting with ESRI Geographic Information System

software, web based GIS and geodatabase management applications, such as ArcGIS desktop,

FileGDB and ArcGIS JavaScript. Both methods are adopting standard database system to view,

store, query and analyze the contain geo dataset. Users use the choice of database such Oracle,

PostgreSQL, Microsoft SQL server and others. ArcGIS transforms the geographical computation

into set of SQL statements. Then, it channels to Mi-Galactica for the parallel computation. Once it

completed, the result set returns back to ArcGIS application for processing the map visualization.

Our in-house builds ODBC database connector is to divert the SQL operation to Mi-Galactica.

V. Experiment Results

In this section, we report our experimental results and analysis. We focus on the execution time on

Mi-Galactica with the utilizing of GPU accelerator and a CPU based system, Apache Spark. It is

one of the fast engines for big data processing system in the market at the moment. We measure the

execution period by adding fixed number of data records for each test run. The dataset is from 1 to

20 million rows of records. In addition, we compare the data preprocessing operation in using both

systems. Heat map is generated with 2 hours’ time interval location data of everyday of the month.

A. Hardware and Operation System

We performed the experiments into a single NVIDIA Tesla K20c GPU computational device. It has

2496 CUDA cores and 5GB GDDR5 RAM, launched in 2013. The workstation is HP Z800. It

contains dual sockets Intel Xeon X5680 CPU with the total of 12 cores, clock rate is 3.33GHz, 32GB

DDR3 RAM, 1TB Hard Disk, and ATI FireMV 2260 as a display device. For software, it has

Microsoft Windows 7 (64bits), Spark Version 1.3.0 and CUDA 7 (Release Candidate).


B. Experiment 1: Data Preprocessing

The computation time of data preprocessing is tested with various size of data. The raw input data

stores in CSV format. It need to process and import into corresponding data warehouse systems.

Spark is converted into Parquet format, which is in columnar storage layout. Mi-Galactica is stored

in a GPU accessible columnar format. Figure 7 shows the execution time on data preprocessing.

The timing includes the data transfer between hard disk, CPU and GPU memory.

The data preprocessing is not a compute intensive operation and requiring high I/O data transfer

between the CPU and GPU memory. In addition, Spark has a minimum startup overhead without

enabling the data compression for processing the raw CSV files. As observed in Figure 7, the

processing speed of Spark eventually overtakes Mi-Galactica. Spark applies the optimization of

utilizing the CPU multi-threading features to preparing these CSV files.

Figure 7: Result of Data Preprocessing

C. Experiment 2

There are a series of processes to produce a heat map. It represents the geographic density of features

on a map. The colored areas represent points that making for layers with a large number of features.

It requires certain toolboxes in ArcGIS to complete entire process and visualize the map, such as

Density toolset, Spatial Analyst & Statistics toolbox. These set of toolboxes generate the SQL

statement and pass to database system to execute. Figure 8 shows a sample of SQL statement for

Heat map.


15

Figure 8: Sample of Heat Map SQL statement

The speedup measurement calculates in (Spark’s Execution Time) / (Mi-Galactica Execution Time).

The results turn out that Mi-Galactica had out performed Spark. The speedup is between 4x to 18x

speedup in executing the SQL statement shown in Figure 9. As observed, Mi-Galactica reduces the

speedup while the rows of data are increasing. It is due to the handling of the I/O (Input/Output)

movement between CPU and GPU memory without applying an efficient streaming mechanism

during this preprocessing stage. In fact, there does not have complex computation to maximize the

GPU resource utilization as well. Thus, it lost the optimization effort at transferring data between

CPU and GPU. However, Mi-Galactica performance is still good enough to provide timely

visualization on geo-location data. Figure 10 shows a snapshot of final visualization output of the

heat map.

Figure 9: Heat map query execution

Figure 10: Visualization of Heat Map


VI. Conclusion

We have presented Mi-Galactica as a GPU query accelerator for assisting geo-spatial data

computation via the generated SQL statement from ESRI software. It applies on any data analytic

operations by using SQL statement, such as Data Cleansing as an example application. The results

have shown that the GPU based solution is an alternative for Big Data processing. In addition, our

GPU query accelerator approach has facilitated seamless integration to other front end application

via a database connector. It allows users to exploit the powers of the GPU by providing the ability

for efficient work distribution in GPU cores with regards to I/O access and compression. In the

future, we are extending our system to be executed in distributed computation environment with

multi nodes processing to support bigger dataset. Furthermore, Mi-Galactica strives towards the

full support of the SQL standard and enabling parallel accelerated query processing.

VII. References

[1] C. McLellan, “The internet of things and big data: Unlocking the power,” ZDNet, Mar-2015.

[2] C. Smith, “Social Media’s New Big Data Frontiers -- Artificial Intelligence, Deep Learning,

And Predictive Marketing,” Business Insider Australia, 2014.

[3] M. Jern, T. Astrom, and S. Johansson, “GeoAnalytics Tools Applied to Large Geospatial

Datasets,” in Information Visualisation, 2008. IV ’08. 12th International Conference, 2008,

pp. 362–372.

[4] C. Lai, M. Huang, X. Shi, and H. You, “Accelerating Geospatial Applications on Hybrid

Architectures,” in High Performance Computing and Communications 2013 IEEE

International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013

IEEE 10th International Conference on, 2013, pp. 1545–1552.

[5] J. Zhang and S. You, “CudaGIS: Report on the Design and Realization of a Massive Data

Parallel GIS on GPUs,” in Proceedings of the Third ACM SIGSPATIAL International

Workshop on GeoStreaming, 2012, pp. 101–108.


17

[6] C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, “GPU accelerated item-based

collaborative filtering for big-data applications,” in Big Data, 2013 IEEE International

Conference on, 2013, pp. 175–180.

[7] S. K. Prasad, M. McDermott, S. Puri, D. Shah, D. Aghajarian, S. Shekhar, and X. Zhou, “A

Vision for GPU-accelerated Parallel Computation on Geo-spatial Datasets,” SIGSPATIAL

Spec., vol. 6, no. 3, pp. 19–26, Apr. 2015.

[8] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, “GPU Cluster for High Performance

Computing,” in Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, 2004, p.

47–.

[9] L. A. S. Gomes, B. S. Neves, and L. B. Pinho, “Empirical Analysis of Multicore CPU and

GPU-Based Parallel Solutions to Sustain Throughput Needed by Scalable Proxy Servers for

Protected Videos,” in Computer Systems (WSCAD-SSC), 2012 13th Symposium on, 2012, pp.

49–56.

[10] C.-J. S. Kyle E Niemeyer, “Recent progress and challenges in exploiting graphics processors

in computational fluid dynamics,” J. Supercomput., vol. 67, pp. 528–564, 2014.

[11] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha, “Fast Computation of

Database Operations Using Graphics Processors .” ACM, New York, NY, USA, 2005.

[12] R. Mueller, J. Teubner, and G. Alonso, “Data Processing on FPGAs,” Proc. VLDB Endow.,

vol. 2, no. 1, pp. 910–921, Aug. 2009.

[13] R. Mueller, J. Teubner, and G. Alonso, “Glacier: A Query-to-hardware Compiler .” ACM,

New York, NY, USA, pp. 1159–1162, 2010.

[14] L. Woods and G. Alonso, “Fast data analytics with FPGAs .” pp. 296–299, Apr-2011.

[15] F. D. Hinshaw, J. K. Metzger, and B. M. Zane, “Optimized database appliance,” 2011.

[16] P. Francisco, “The Netezza Data Appliance Architecture: A Platform for High Performance

Data Warehousing and Analytics ,” 2011.


[17] P. Bakkum and K. Skadron, “Accelerating SQL database operations on a GPU with CUDA,”

in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics

Processing Units, 2010, pp. 94–103.

[18] S. Breß and G. Saake, “Why It is Time for a HyPE: A Hybrid Query Processing Engine for

Efficient GPU Coprocessing in DBMS,” Proc. VLDB Endow., vol. 6, no. 12, pp. 1398–1403,

Aug. 2013.

[19] M. Heimel and V. Markl, “A First Step Towards GPU-assisted Query Optimization,” Proc.

VLDB Endow., pp. 33–44, 2012.

[20] H. Wu, F. Drive, G. Diamos, S. Baxter, M. Garland, T. Sheard, M. Aref, and S. Yalamanchili,

“Red Fox: An Execution Environment for Relational Query Processing on GPUs,” in

Proceeding of theInternational Symposium on Code Generation and Optimization (CGO),

2014, pp. 44:44–44:54.

[21] F. Beier, T. Kilias, and K.-U. Sattler, “GiST Scan Acceleration Using Coprocessors .” ACM,

New York, NY, USA, pp. 63–69, 2012.

[22] K. K. Yong and E. K. Karuppiah, “Hash match on GPU,” in IEEE International Conference

on Open Systems, ICOS 2013, 2013, pp. 150–155.

[23] T. Lauer, A. Datta, Z. Khadikov, and C. Anselm, “Exploring Graphics Processing Units As

Parallel Coprocessors for Online Aggregation .” ACM, New York, NY, USA, pp. 77–84,

2010.

[24] D. Cederman and P. Tsigas, “GPU-Quicksort: A Practical Quicksort Algorithm for Graphics

Processors,” J. Exp. Algorithmics, vol. 14 . ACM, New York, NY, USA, pp. 4:1.4–4:1.24,

Jan-2010.

[25] M. Zukowski, S. Heman, N. Nes, and P. Boncz, “Super-Scalar RAM-CPU Cache

Compression .” p. 59, Apr-2006.

[26] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” Inf. Theory,

IEEE Trans., vol. 23, no. 3, pp. 337–343, May 1977.

GPU SQL Query Accelerator · 2017-07-19 · accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an emerging SQL-like query accelerator, Mi-Galactica.

Documents