International Journal of Information Technology Vol. 22 No. 1 2016 1 Keh Kok Yong, Hong Hoe Ong Accelerative Technology Lab MIMOS Berhad Kuala Lumpur, Malaysia [email protected], [email protected]Vooi Voon Yap Department of Electronic Engineering Universiti Tunku Abdul Rahman Perak, Malaysia [email protected]Abstract The world rapidly grows with every connected sensors and devices with geo-location capabilities to update its location. Data analytic industries are finding ways to store the data, and also turn this raw data into valuable information as an eminent business intelligence services. It has inadvertently conformed a flood of granular data about our world. Crucially, this data flood has outpaced the traditional compute capabilities to process and analyze it. Thus, it reveals the potential economic benefits and becomes an overwhelming new research area that requiring sophisticated mechanisms and technologies to reach the demand. Over the past decade, there have attempts of using accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an emerging SQL-like query accelerator, Mi-Galactica. In addition, we extended our system by offloading geo-spatial computation to the GPU devices. The query operation executes parallelly with drawing support from a high performance and energy efficient NVIDIA Tesla technology. Our result has shown the significant speedup. Keyword: Geospatial, Graphics Processing Units, Database Query Processing, Big Data, Cloud GPU SQL Query Accelerator
18
Embed
GPU SQL Query Accelerator · 2017-07-19 · accelerators along with multicore CPUs in boosting large-scale data computation. We proposed an emerging SQL-like query accelerator, Mi-Galactica.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
International Journal of Information Technology Vol. 22 No. 1 2016
Plentitude of emphasis in researching query related parallel algorithms have cultivated to the
development of GPU database. Red Fox works on relational operators to be executed in a GPU
parallel manner [20]. [21], [22] investigate GPU acceleration in indexing, scan and search
operations. [23] examine the important computational building blocks of aggregation. [24] focus the
studies on optimizing GPU sort. These studies have significantly brought up the awareness of using
GPU in big data analytic businesses. It is our belief that GPU can be beneficial for query processing
and widely deployable in big data analytics for database systems. This GPU query accelerator has to
be cautiously designed for parallel data structure and harmonizing the processes between CPU and
GPU.
It is our belief that GPU can be beneficial for query processing by widely deploying it for big data
analytics in database systems. The GPU query accelerator has to be carefully designed with parallel
data structure, and harmonizing the processing between CPU and GPU.
III. Graphic processing unit
In this section, we first discuss the background of GPUs and introduce the NVIDIA’s Kepler
Architecture. Next, we describe how thread and block works in the NVIDIA Kepler architecture.
Finally, we discuss the memory hierarchy of the NVIDIA’s Kepler architecture.
A. Background
GPUs first gained popularity with the rise of 3D gaming in the mid-1990. The demand of even more
powerful and energy efficient GPUs has become increasing ever since. The increase of
computational power of GPUs has attracted many researchers to use the GPU for more general
purpose computing. NVIDIA realized the potential GPUs for general computing and released CUDA
(Compute Unified Device Architecture) in 2006 so that the researcher community can leverage upon
the power of the large number of streaming processors in GPUs. GPUs nowadays are powering a
large range of industries from supercomputers to embedded system.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
7
The latest NVIDIA GPU architecture is called Tesla Maxwell, just introduce Q3 2015. These new
cards focus on deep learning sector. However, this paper is based on the Kepler architecture. It has
included a lot of improvements from its predecessor architecture, Fermi. With the current
architecture, a single GPU die can contain up to 2880 CUDA cores. Besides that, the Kepler
architecture introduced new features like Dynamic Parallelism, Hyper-Q, Grid Management unit,
and NVIDIA GPU Direct. It also contains enhanced memory subsystems offering additional caching
capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially
faster DRAM I/O implementation. The principal design goal for the Kepler architecture has been
met with the new features providing the improved power efficiency.
B. Grid, Blocks, Threads and Warps
The programming model for CUDA introduces us to the concepts of threads, blocks, and grids
which run GPU codes called kernels. These threads, blocks, and grids will then run in multiple
SMXs (streaming multiprocessor) in the GPUs in groups of warps. Figure 1 shows the examples of
threads, blocks, and grids. From a programmer’s perspective, they only need to handle the threads,
blocks, and grids assignments, and kernel programming, while the hardware will manage how all the
threads, blocks, and grids are mapped into the SMXs and warps.
In CUDA, all the threads in the same grid will execute the same kernel function but each thread
mostly handles different data. This type of programming model is known as Single Instruction
Multiple Data (SIMD). With the new Kepler architecture, a block can consist up to 1024 threads in
each of the x, y, and z dimensions. The maximum number of block in the x dimension in a grid can
go up to 232-1.
Previously on the Fermi architecture, once a kernel has been launched, its dimension cannot be
changed. In the Kepler architecture, the programmer is allowed to launch another set of grids and
blocks within the kernel, which enables a more flexible programming model. This feature is called
the Dynamic Parallelism.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
A warp is a unit of thread scheduling in the SMXs. Once a block is assigned to an SMX, it is divided
into units of warps. Each warp can support up to 32 threads in the Kepler architecture. Each thread in
a warp will run in parallel executing the same line of code. To increase the efficiency of the warps,
we should avoid branch divergence as much as possible. Branch divergence occurs when threads
inside a warp branches into different execution paths.
Thread Grid 0
Grid 1
Block
...
...
(i) (ii) (iii)
Figure 1: Thread, Block and Grids
192 CUDA CORES
Shared Memory
L1 cacheRead-Only data cache
L2 cache
DRAM
Level 2
Level 3
Level 4
Registers Level I
Figure 2: Hierarchy of GPU memory
C. Memory Hierarchy
There are four levels of memory hierarchy in the NVIDIA GPUs as shown in Figure 2. The first
level is register memory. Register memory is a local memory for the in the CUDA cores and have a
total size of 64KB. It is the fastest memory among all the memory types in the SMX. The second
level consists of Shared Memory, L1 cache and read-only data cache. These memories are located
very near the SMX core, and are shared among the 192 CUDA cores in the SMX. The Shared
Memory is usually used to communicate among different threads in the block. The third level
consists of L2 cache, and finally, the fourth level consists of DRAM memory that serves as the main
storage in GPUs and is used to send and read data in bulk from the CPU’s memory.
In the Fermi architecture, the Shared Memory and L1 Cache can be configured as 48 KB of Shared
Memory with 16 KB of L1 cache, or vice versa. The new Kepler architecture allows for additional
flexibility by permitting a 32KB / 32KB split between Shared Memory and L1 cache. The Read-
Only data cache is also new in the Kepler architecture. Previously, programmers would use the
Texture unit to store and load cache memory. However, this method had many limitations. The
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
9
benefit of the “Read-Only Data cache” uses a separate load set footprint off from Shared/L1 cache
memory. It also supports full speed unaligned memory access patterns.
IV. Implementation
A. Overview of Mi-Galactica
Mi-Galactica is a SQL-like query accelerator. There are four major components to formulate the
system: Connector, Preprocessor, Scheduler and Query Engine; as shown in Figure 3. Connector
enables Mi-Galactica to communicate to PostgreSQL and MySQL. It is to perform frontend
application interaction, data extraction and data interchange. In addition, it can support for the
comma-separated values (CSV) files processing. Scheduler is an internal task engine for managing
the user workloads. Query engine carries out various processes of query investigation by performing
the basic parsing and positioning operations. Then, it produces an execution query plan. There is
further adjustment of the plan by analyzing and tracing parallelizable points and rearranges clause of
objects execution. Mi-Galactica execution engine performs the accelerated query execution in either
CPU or GPU. On the other hand, source data in the database which is needed to be transformed,
and output to a parallel columnar accessible storage system. These components are designed to run
on an energy efficient commodity of GPU accelerator. In addition, it powers to strive forward for
handling big data challenges.
Mi-Galactica adopts the effectiveness from the previous studies [17], [18], [19], [20] and [21] in
query co-processing of the heterogeneous workloads. Figure 4 shows the architectural design of
coupled CPU-GPU architectures. It designs to support plug-ins for acceleration components, which
enabling customization. It eases up developer for adding new features and improves productivity.
In addition, it reduces the size of application as well. The implementation of plug-ins functionality
uses shared libraries. It is installed in a place that prescribed by Mi-Galactica.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
Figure 3: Mi-Galactica Four Major Components
Figure 4: Mi-Galactica Architecture
Figure 5: GPU Columnar File System
Source data in the database requires a preprocessing stage. It converts the data into a parallelizable
files structure, GPU FS (File System), then it stores into a column-based orientation, as shown in
Figure 5. Thus, data can access independently, compute parallelly and maximize CPU multithreaded
processing. Each column segments into multi files and the size of each segment is customizable.
Therefore, CPU and GPU have sufficient amount of memory to compute larger data set.
Furthermore, it allows GPU to independently process each column in the segment. Nevertheless, the
change of the data in the database does not automatically trigger an update on the preprocessed data.
Thus, it needs to be re-created or complemented (when only new data is added). The CudaSet is a
parallel file structure, which improving the parallel geo-spatial processing jobs in GPU. It is not a
legacy array of structure (AOS) design that losing of bandwidth and wasting of L2 cache memory.
Mi-Galactica uses CudaSet representation to arrange data in Structure of Array (SOA) access
pattern. It certainly gains high throughput by coalescing the memory access in GPU. Also, it is
critical to memory-bound kernel functions. The required elements of structure can load individually
and no data interleaving as shown in Figure 6. Thus, it achieves high global memory performance.
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
11
Figure 6: SOA CudaSet Structure
The overhead of data transfer becomes an important factor. It is a bottleneck for fetching data to
GPU computation. Mi-Galactica uses compression to alleviate the performance issues. It
compresses the data into smaller size, which reduces the I/O operation and offloads the task to GPU.
It restructured the data processing by using co-processers schema structure for given database
architecture differently. There are dual compression scheme implemented on GPU. Firstly,
compression scheme applies on Integer data type; it is based on Zukowski work, PFOR-Delta [25].
It store differences instead of actual value. Only the difference of the data is stored between
subsequent values. Thus, bit packing mechanism can further optimize it by using just enough
number of bits to store each element. Secondly, string compression scheme was applied on
characters or text data type; it is based on Lempel–Ziv (LZ77) compression algorithm [26] with
dynamic representation and expression matching. It is a fine-grained parallel redundancy for
encoding and decoding of data with flexible representation. The key of the efficiency is fast retrieval
on the compressed data on CPU. Then, the lookup process has offloaded to GPU.
Query engine comprises both CPU and GPU phases. The CPU phases are in charge of parsing clause
into objects. It identifies the required data source, translates the operation into low level instruction
sets. Then, it arranges execution sequence and dispatch for execution. It uses the combination of
Bison and Flex implementing a SQL query parser. There can be both CPU and GPU related
Keh Kok Yong, Hong Hoe Ong and Vooi Voon Yap
workloads. However, CPU starts initializing GPU contexts, preparing input data, launching GPU
kernel functions, materializing query results, and controlling the steps of query progress. GPU
phases are in executing specific optimized kernel functions. These are mostly aggregate and
compute intensive functions. The data are used thousands of core to process at once. The GPU
computation related operations involve select, sort, projection, joins and basic aggregation. It
utilizes the mixture of in-house built accelerated parallel processing library – Mi-AccLib6 and open
source libraries such as NVIDIA Thrust7 and CUDPP8.
Scheduler is responsible for managing the received queries. There is an implementation on queue
processing across a pool of work threads in CPU. It controls the concurrency level and intensity of
resource contention on CPU. The resource monitor collects the current status of GPU devices usage.
Then, scheduler uses the collected information for assigning the task to the available GPU. At this
stage, CPU performs an important role in concurrent queueing. Thus, data can safely be added by
one thread, joined or removed by another thread without corrupting the data. In addition, it
maintains optimal concurrency level and workload on the GPUs. There is a data swapping
mechanism to maximize the effective utilization of GPU device memory. Through these processes,
it improves resource utilization. This implementation uses mixture of API (Application Program
Interface) in Boost9 libraries.
Mi-Galactica optimizes the concurrency through pipelining mechanism to overlap the data transfer
via PCI-e bus and the arithmetic computation. These CUDA streams can be executed
asynchronously, which queues the work and returns to CPU immediately. Pinned memory
mechanism is often adapted in certain queries implementation. It uses the Direct Memory Access
(DMA) engine, which can achieve a higher percentage of peak bandwidth. Hype-Q in Kepler
6 MIMOS accelerated library consist of high speed multi-algorithm search engines for text processing, data security engine and video
analytics engines, http://atl.mimos.my/ 7 Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity,
https://developer.nvidia.com/thrust 8 CUDPP is the CUDA Data Parallel Primitives Library, http://cudpp.github.io/ 9 Boost is a set of libraries for the C++ programming language that provide support for tasks and structures, http://www.boost.org/