Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture

BYJiong He Mian Lu Bingsheng He

Presented ByMohamed Ragab Moawad

Revisiting Co-Processing for Hash Joins on the Coupled

Cpu-GPU Architecture

GPU : Graphics Processing Unit

CPU : Central Processing Unit

GPU complements CPU which performs general processing, by more efficiently handling Graphics Calculations.

Can also accelerate Video transcoding, Image processing and other complex computations using concept of GPGPU (General-purpose computing on Graphics Processing Units). Eg. CUDA (Compute Unified Device Architecture).

Umm.. GPU??

THEN AND NOWIntel ASCI Red/9632 : 2,379 GFLOPS

- Fastest Supercomputer (World), 1999

PARAM PADMA : 1,000 GFLOPS

- Fastest Supercomputer (India), 2003

NVIDIA GTX 780 Ti : 5,046 GFLOPS

- Fastest GPU (General), 2013

GFLOPS : Giga FLoating-point Operations Per Second.A measure of computer performance.

GTX 680 (2GB)

Memory : 4GBMemory : 2GB

Price : Rs. 41,195 Price : Rs. 4,520

Computation power : 3,090 GFLOPS Computation power : 269 GFLOPS

GT 430 (4GB)

Memory is the boss? NO!

• A GPU is tailored for highly parallel operation

while a CPU executes programs serially.

• GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs

• CPU is optimized for sequential code performance. • GPU is specialized for compute-intensive highly

parallel computation.

• GPU has evolved into a highly parallel, multithreaded, many core processors with very high computational horsepower and very high memory bandwidth.

GPU VS CPU

There are two Architectures of CPU-GPU

CPU –GPU Architectures

DISCRETE CPU-GPU ARCH.• Old Model .

• In Which GPU is usually connected to the CPU with a PCI-e bus.

PCI-e

PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCI-e, is a high-speed serial computer expansion bus standard designed to replace the older PCI, PCI-X, and AGP bus standards. PCI-e has numerous improvements over the aforementioned bus standards, including higher maximum system bus throughput, lower I/O pin count and smaller physical footprint, better performance-scaling for bus devices.

PROBLEM !!! • The relatively low bandwidth and high latency

of the PCI-e bus are usually bottleneck issues

So Many hardware vendors have attempted to

resolve this overhead with new architectures.

LIKE: COUPLED CPU-GPU Architecture.

COUPLED CPU-GPU ARCH.• The CPU and the GPU are integrated into a single chip

avoiding the costly data transfer via the PCI-e bus

• Examples: AMD-APU Intel IVY(2012)

These new heterogeneous architectures potentially

open up new optimization opportunities for GPU

query co-processing.

There are many Types of Query co-processing

1. Fine-grained

2. coarse-grained

3. embarrassing

FINE-GRAINED, COARSE-GRAINED, AND EMBARRASSING PARALLELISM: Applications are often classified according to how often their subtasks need to synchronize or communicate with each other.

An application exhibits fine-grained parallelism if its subtasks must communicate many times per second;

it exhibits coarse-grained parallelism if they do not communicate many times per second,

and it is embarrassingly parallel if they rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelize.

SO….

-In the Discrete CPU-GPU Architecture

it is preferred to have coarse-grained co-processing to reduce the data

transfer on the PCI-e bus. Moreover, as the GPU and the

CPU have their own memory controllers and caches .

-In the Discrete CPU-GPU Architecture

It is feasible to have the fine-grained co-processing

OPEN-CLOpen Computing Language (Open-CL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs) .

The advantage of OpenCL is that the same OpenCL code

can run on both the CPU and the GPU without modification.

Previous studies have shown that implementations with Open-

CL achieve very close performance to those with native languages

such as CUDA and OpenMP on the GPU and the

CPU, respectively.

OpenCL can be used to give an application access to a graphics processing unit for non-graphical computing ( general-purpose computing on graphics processing units).

GPGPUGeneral-purpose computing on graphics processing units (GPGPU, rarely GPGP or GP²U) is the utilization of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA.

HASH JOIN CO-PROCESSING On the coupled architecture, co-processing

should be fine-grained, and schedule the workloads carefully

to the CPU and the GPU.

Moreover, we need to consider

the memory specific optimizations for the shared cache

architecture and memory systems exposed by OpenCL.

ARCHITECTURE AWARE HASH JOINSHash joins are considered as the most efficient join algorithm

for main memory databases.

Two main Types of Hash joins:

1.Simple Hash Join.

2.Portioned Hash Join.

FINE-GRAINED STEPS IN HASH JOINSA hash join operator works on two input relations, R and

S. We assume that |R| < |S|. A typical hash join algorithm

has three phases: partition, build, and probe. The partition

phase is optional, and the simple hash join does not have

a partition phase.In SHJ, the build phase constructs an in-memory hash

table for R. Then in the probe phase, for each tuple in S, it

looks up the hash table for matching entries. Both the build

and the probe phases are divided into four steps, b1 to b4 and

p1 to p4, respectively.

A hash table consists of an array of bucket headers. Each bucket header contains two fields: total number of tuples within that bucket and the pointer to a key list. The key list contains all the unique

keys with the same hash value, each of which links a rid list

storing the IDs for all tuples with the same key.

SHJ ALGORITHM:Algorithm 1 Fine-grained steps in SHJ

/*build*/

for each tuple in R do

(b1) compute hash bucket number;

(b2) visit the hash bucket header;

(b3) visit the hash key lists and create

a key header if necessary;

(b4) insert the record id into the rid list;

/*probe*/

for each tuple in S do

(p1) compute hash bucket number;

(p2) visit the hash bucket header;

(p3) visit the hash key lists;

(p4) visit the matching build tuple to compare keys and produce

output tuple;

PHJ ALGORITHM:Main Procedure for PHJ:

/*Partitioning: perform multiple passes if necessary*/

Partition (R);

Partition (S);

/*Apply SHJ on each partition pair*/

for each partition pair Ri and Si do

Apply SHJ on Ri and Si;

Procedure: Partition (R)

for each tuple in R do

(n1) compute partition number;

(n2) visit the partition header;

(n3) insert the <key, rid> into partition;

REVISITING CO-PROCESSING MECHANISMSOff-loading (OL):

proposed to off-load some heavy operators like joins to the GPU while

other operators in the query remain on the CPU.

The basic idea of OL on a step series is: the GPU is designed as a

powerful massively parallel query co-processor, and a step

is evaluated entirely by either the GPU or the CPU.

Query processing continues on the CPU until the off-loaded computation

completes on the GPU, and vice versa. That is, given

a step series s1, ..., sn, we need to decide if si is performed

on the CPU or the GPU.

REVISITING CO-PROCESSING MECHANISMSData dividing (DD): Problem: OL could under-utilize the CPU when the off-loaded computations

are being executed on the GPU, and vice versa.

Moreover: As the performance gap between the

GPU and the CPU on the coupled architecture is smaller

than that on discrete architectures, we need to keep both

the CPU and the GPU busy to further improve the performance.

So: We can model the CPU and the

GPU as two independent processors, and the problem is

to schedule the workload to them. This problem has its

root in parallel query processing . One of the most commonly

used schemes is to partition the input data among

processors, perform parallel query processing on individual

processors and merge the partial results from individual

processors as the final result. We adopt this scheme to be

the data-dividing co-processing scheme (DD) on the coupled Architecture.

PIPELINED EXECUTION (PL).

To address the limitations

of OL and DD, we consider fine-grained workload scheduling

between the CPU and the GPU so that we can capture

their performance differences in processing the same workload.

For example, the GPU is much more efficient than the

CPU on b1 and p1 whereas b3 and p3 are more efficient on

the CPU. Meanwhile, we should keep both processors busy.

Therefore, we leverage the concept of pipelined execution

and develop an adaptive fine-grained co-processing scheme

for maximizing the efficiency of co-processing on the coupled

architecture.

EVALUATIONS ON DISCRETE ARCHITECTURES

END

Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture

Technology

cpu gpu architectures

graphics processing

gpu vs cpu

graphics processing

gpu query coprocessing

central processing unit

architectures of cpu

coupledcpugpu architecture