Top Banner
SEOUL | Oct.7, 2016 Tran Minh Quan and Won-Ki Jeong High-performance Visual Computing Laboratory, NVIDIA GPU Research Center http://hvcl.unist.ac.kr VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS
55

VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

Apr 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

SEOUL | Oct.7, 2016

Tran Minh Quan and Won-Ki Jeong

High-performance Visual Computing Laboratory, NVIDIA GPU Research Center

http://hvcl.unist.ac.kr

VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS

Page 2: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

2

AGENDA

Introduction

Related work

Background

GPU Optimization Strategies

Results and Discussions

Conclusions

Page 3: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

3

INTRODUCTION

Page 4: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

4

DISCRETE WAVELET TRANSFORM (DWT)DWT has been actively studied in the image processing domain

Original signal Level 1 Level 2 Level 3

Page 5: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

5

DISCRETE WAVELET TRANSFORM (DWT)DWT can be accelerated by using special hardware (FPGAs, MIC, GPUs)

NVIDIA CUDA and OpenCL

Page 6: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

6

RELATED WORK

Page 7: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

7

IMAGE VIEWPOINTDirect filtering, Separable filtering, Transpose filtering,…

* **

* T * T

Technically, they match well onto CUDA programming model

Page 8: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

8

LITURATURE SUMMARY Recent GPU-based Discrete Wavelet Transform

[2009] J. Matela, “GPU–Based DWT acceleration for JPEG2000,” 2D filtered + shared memory[2009] J. Franco, G. Bernabe, J. Fernandez, and M. Acacio, “A parallel implementation of the 2D wavelettransform using CUDA,”[2010] J. Franco, G. Bernab, J. Fernndez, and M. Ujaldn, “Parallel 3D fast wavelet transform on many-coreGPUs and multicore CPUs,” 1D filtering + 2D transpose[2009] W. van der Laan, J. B. T. M. Roerdink, and A. Jalba, “Accelerating wavelet-based video coding ongraphics hardware using CUDA,” 1D lifting scheme + shared memory[2011] W. van der Laan, A. Jalba, and J. B. T. M. Roerdink, “Accelerating wavelet lifting on graphicshardware using CUDA,” 1D lifting scheme + shared memory + sliding windows[2011] C. Song, Y. Li, and B. Huang, “A GPU-accelerated wavelet decompression system with SPIHT andreed-solomon decoding for satellite images,” 1D lifting scheme + shared memory[2014] C. Song, Y. Li, J. Guo, and J. Lei, “Block-based two-dimensional wavelet transform running ongraphics processing unit,” 2D lifting scheme + shared memory[2015] P. Enfedaque, F. Auli-Llinas, and J. Moure, “Implementation of the DWT in a GPU through a register-based strategy,” 2D lifting scheme + register-based + warp shuffles

Page 9: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

9

BACKGROUND: DISCRETE WAVELET TRANSFORM

Page 10: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

10

WAVELET LIFTING SCHEMESHaar wavelet family

Haar CDF 5/3 CDF 9/7

α 1.0 0.50 1.58613434

β 0.5 0.25 0.05298012

γ N/A N/A 0.88291108

δ N/A N/A 0.44350685

Page 11: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

11

WAVELET LIFTING SCHEMESCDF 5/3 wavelet family

Haar CDF 5/3 CDF 9/7

α 1.0 0.50 1.58613434

β 0.5 0.25 0.05298012

γ N/A N/A 0.88291108

δ N/A N/A 0.44350685

Page 12: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

12

WAVELET LIFTING SCHEMESCDF 9/7 wavelet family

Haar CDF 5/3 CDF 9/7

α 1.0 0.50 1.58613434

β 0.5 0.25 0.05298012

γ N/A N/A 0.88291108

δ N/A N/A 0.44350685

halo halo

Page 13: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

13

MIXED-BAND DISCRETE WAVELET TRANSFORM3D viewpoint in memory layout

Original signal

Level 1 Level 2

Mixed-bandWavelet

Conventional Wavelet

Level 3

Page 14: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

14

MIXED-BAND DISCRETE WAVELET TRANSFORM2D viewpoint in memory layout

Original signal

Level 1 Level 2

Mixed-bandWavelet

Conventional Wavelet

Level 3

Page 15: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

15

MIXED-BAND DISCRETE WAVELET TRANSFORM1D viewpoint in memory layout

1)Less memory usage due to in-place filtering,

2)Fused multi-level transformationin a single kernel call withoutsynchronizing using globalmemory

3)Reducing the index computationoverhead for shuffling thelocation of wavelet coefficients.

Page 16: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

16

BIT ROTATION PERMUTATIONConversion between conventional and mixed-band DWT

10/13/2016

////////////////////////////////////////////////////////////////  Find the log2 or position of most significant bit  __device__ __forceinline__ int msb(const int x){

int ret = sizeof(unsigned int) * 8 ‐ 1;return x ? ret ‐ __clz(x) : ret;

}

//////////////////////////////////////////////////////////////// Rotate right auto __device__ __forceinline__int ror(const int number, const int dim){

int length = msb(dim);return ((( number&1 ) << (length‐1)) | (number >> 1));

}

Count the number of consecutive leading zero bits, starting at the most significant bit (bit 31) of x.

Page 17: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

17

GPU OPTIMIZATION STRATEGIESUSING SHARED MEMORY

Page 18: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

18

USING SHARED MEMORYData copy from global memory to shared memory (forward transform)

10/13/2016

Page 19: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

19

USING SHARED MEMORYData copy from global memory to shared memory

10/13/2016

////////////////////////////////////////////////////////////////////////////////////////////////////__global__void __cdf53_smem_encode(const float* __restrict__  src, float* __restrict__  dst,

int dimx, int dimy, int ldx, int ldy, int level=1){

extern __shared__ float sharedMem[];int opened_index_1d, closed_index_1d, shift_index_1d, reading_index_1d, writing_index_1d;int2 opened_index_2d, closed_index_2d, shift_index_2d, reading_index_2d, writing_index_2d;int2 numBits = make_int2(logf(dimx)/logf(2.0f), logf(dimy)/logf(2.0f) );// Multi pass reading here#pragma unrollfor(int batch=0; batch<NUMREADINGS; batch++){

opened_index_1d   = threadIdx.y * blockDim.x + threadIdx.x + NUMTHREADS*batch;opened_index_2d   = make_int2(opened_index_1d % OPENEDDIMX,

opened_index_1d / OPENEDDIMX );reading_index_2d  = make_int2(blockIdx.x * CLOSEDDIMX + opened_index_2d.x ‐ HALO,

blockIdx.y * CLOSEDDIMY + opened_index_2d.y ‐ HALO);reading_index_1d = reading_index_2d.y * dimx + reading_index_2d.x;

sharedMem[at(opened_index_2d.x, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)]= src[at(clamp_mirror(reading_index_2d.x, 0, dimx‐1), clamp_mirror(reading_index_2d.y, 0, dimy‐1),

dimx, dimy)];}__syncthreads();// Lifting// Multi pass writing here

}

Page 20: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

20

USING SHARED MEMORY (CON’T)Lifting in the shared memory

10/13/2016

// Lifting along xif((threadIdx.x&1)==0) //lifting with alpha{

sharedMem[at(opened_index_2d.x + 1, opened_index_2d.y, ...)]+= a*(sharedMem[at(opened_index_2d.x + 0, opened_index_2d.y, ...)]

+sharedMem[at(opened_index_2d.x + 2, opened_index_2d.y, ...)]);}__syncthreads();

if((threadIdx.x&1)==0) //lifting with beta{

sharedMem[at(opened_index_2d.x + 0, opened_index_2d.y, ...)]+= b*(sharedMem[at(opened_index_2d.x ‐ 1, opened_index_2d.y, ...)]

+sharedMem[at(opened_index_2d.x + 1, opened_index_2d.y, ...)]);}__syncthreads();

// Lifting along y...

Page 21: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

21

////////////////////////////////////////////////////////////////////////////////////////////////////__global__void __cdf53_smem_encode(const float* __restrict__  src, float* __restrict__  dst,

int dimx, int dimy, int ldx, int ldy, int level=1){

// Multi pass reading here// Lifting// Multi pass writing here#pragma unrollfor(int batch=0; batch<NUMWRITINGS; batch++){

closed_index_1d   = threadIdx.y * blockDim.x + threadIdx.x + NUMTHREADS*batch;closed_index_2d   = make_int2(closed_index_1d % CLOSEDDIMX ,

closed_index_1d / CLOSEDDIMX );writing_index_2d  = make_int2(blockIdx.x * CLOSEDDIMX + closed_index_2d.x, //block stride

blockIdx.y * CLOSEDDIMY + closed_index_2d.y); //block stridewriting_index_1d  = writing_index_2d.y * dimx + writing_index_2d.x;

if ((writing_index_2d.y < dimy) && (writing_index_2d.x < dimx)&& (writing_index_2d.y >= 0) && (writing_index_2d.x >= 0)&& (closed_index_2d.y < CLOSEDDIMY) && (closed_index_2d.x < CLOSEDDIMX)  ){

float result = sharedMem[at(closed_index_2d.x + HALO, closed_index_2d.y + HALO, OPENEDDIMX, OPENEDDIMY)];dst[at(ror(clamp_mirror(writing_index_2d.x, 0, dimx‐1), dimx),

ror(clamp_mirror(writing_index_2d.y, 0, dimy‐1), dimy), dimx, dimy)]= result;

}}

}

Page 22: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

22

USING SHARED MEMORYIntermediate results

10/13/2016

Page 23: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

23

GPU OPTIMIZATION STRATEGIESUSING REGISTERS

Page 24: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

24

USING REGISTERSLifting in the shared memory with registers

10/13/2016

Page 25: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

25

// Lifting along xfloat regs[6];regs[0] = sharedMem[at(opened_index_2d.x ‐ 2, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)];regs[1] = sharedMem[at(opened_index_2d.x ‐ 1, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)];regs[2] = sharedMem[at(opened_index_2d.x + 0, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)];regs[3] = sharedMem[at(opened_index_2d.x + 1, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)];regs[4] = sharedMem[at(opened_index_2d.x + 2, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)];// Lifting with alpharegs[1] += a*(regs[0]+regs[2]);regs[3] += a*(regs[2]+regs[4]);// Lifting with betaregs[2] += b*(regs[1]+regs[3]);__syncthreads();if((threadIdx.x&1)==0){

sharedMem[at(opened_index_2d.x + 0, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)] = regs[2];sharedMem[at(opened_index_2d.x + 1, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)] = regs[3];

}__syncthreads();

// Lifting along y...

USING REGISTERS (CON’T)Lifting in the shared memory with registers

10/13/2016

Page 26: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

26

USING REGISTERS (CON’T)Intermediate results

10/13/2016

Page 27: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

27

GPU OPTIMIZATION STRATEGIESEXPLOITING INSTRUCTION LEVEL

PARALLELISM (ILP)

Page 28: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

28

EXPLOITING INSTRUCTION LEVEL PARALLELISM (ILP)

Data copy from global memory to shared memory (forward transform)

10/13/2016

Page 29: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

29

USING SHARED MEMORYData copy from global memory to shared memory

10/13/2016

////////////////////////////////////////////////////////////////////////////////////////////////////__global__void __cdf53_smem_encode(const float* __restrict__  src, float* __restrict__  dst,

int dimx, int dimy, int ldx, int ldy, int level=1){

extern __shared__ float sharedMem[];int opened_index_1d, closed_index_1d, shift_index_1d, reading_index_1d, writing_index_1d;int2 opened_index_2d, closed_index_2d, shift_index_2d, reading_index_2d, writing_index_2d;int2 numBits = make_int2(logf(dimx)/logf(2.0f), logf(dimy)/logf(2.0f) );// Multi pass reading here#pragma unrollfor(int batch=0; batch<NUMREADINGS; batch++){

opened_index_1d   = threadIdx.y * blockDim.x + threadIdx.x + NUMTHREADS*batch;opened_index_2d   = make_int2(opened_index_1d % OPENEDDIMX,

opened_index_1d / OPENEDDIMX );reading_index_2d  = make_int2(blockIdx.x * CLOSEDDIMX + opened_index_2d.x ‐ HALO,

blockIdx.y * CLOSEDDIMY + opened_index_2d.y ‐ HALO);reading_index_1d  = reading_index_2d.y * dimx + reading_index_2d.x;

sharedMem[at(opened_index_2d.x, opened_index_2d.y, OPENEDDIMX, OPENEDDIMY)]= src[at(clamp_mirror(reading_index_2d.x, 0, dimx‐1), clamp_mirror(reading_index_2d.y, 0, dimy‐1),

dimx, dimy)];}__syncthreads();// Lifting// Multi pass writing here

}

Page 30: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

30

////////////////////////////////////////////////////////////////////////////////////////////////////__global__void __cdf53_smem_encode(const float* __restrict__  src, float* __restrict__  dst,

int dimx, int dimy, int ldx, int ldy, int level=1){

// Multi pass reading here// Lifting// Multi pass writing here#pragma unrollfor(int batch=0; batch<NUMWRITINGS; batch++){

closed_index_1d   = threadIdx.y * blockDim.x + threadIdx.x + NUMTHREADS*batch;closed_index_2d   = make_int2(closed_index_1d % CLOSEDDIMX ,

closed_index_1d / CLOSEDDIMX );writing_index_2d  = make_int2(blockIdx.x * CLOSEDDIMX + closed_index_2d.x, //block stride

blockIdx.y * CLOSEDDIMY + closed_index_2d.y); //block stridewriting_index_1d  = writing_index_2d.y * dimx + writing_index_2d.x;

if ((writing_index_2d.y < dimy) && (writing_index_2d.x < dimx)&& (writing_index_2d.y >= 0) && (writing_index_2d.x >= 0)&& (closed_index_2d.y < CLOSEDDIMY) && (closed_index_2d.x < CLOSEDDIMX)  ){

float result = sharedMem[at(closed_index_2d.x + HALO, closed_index_2d.y + HALO, OPENEDDIMX, OPENEDDIMY)];dst[at(ror(clamp_mirror(writing_index_2d.x, 0, dimx‐1), dimx),

ror(clamp_mirror(writing_index_2d.y, 0, dimy‐1), dimy), dimx, dimy)]= result;

}}

}

Page 31: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

31

EXPLOITING INSTRUCTION LEVEL PARALLELISM (ILP)

Each thread covers a {2, 2} region, which is 4× more memory transactions and 2× more lifting computations per thread.

10/13/2016

int2 eIndex = make_int2(threadIdx.x<<1, threadIdx.y<<1); //2*threadIdx.x

// Lifting along x#pragma unrollfor(p=0; p<2; p++){

sharedMem[at3(eIndex.x + 1, eIndex.y+p, OPENEDDIMX, OPENEDDIMY)]+= a*(sharedMem[at3(eIndex.x + 0, eIndex.y+p, OPENEDDIMX, OPENEDDIMY)]

+sharedMem[at2(eIndex.x + 2, eIndex.y+p, OPENEDDIMX, OPENEDDIMY)]);

sharedMem[at3(eIndex.x + 0, eIndex.y+p, OPENEDDIMX, OPENEDDIMY)]+= b*(sharedMem[at2(eIndex.x ‐ 1, eIndex.y+p, OPENEDDIMX, OPENEDDIMY)]

+sharedMem[at3(eIndex.x + 1, eIndex.y+p, OPENEDDIMX, OPENEDDIMY)]);}__syncthreads();// Lifting along y#pragma unrollfor(p=0; p<2; p++)

sharedMem[at3(eIndex.x+p, eIndex.y + 1, OPENEDDIMX, OPENEDDIMY)]+= a*(sharedMem[at3(eIndex.x+p, eIndex.y + 0, OPENEDDIMX, OPENEDDIMY)]

+sharedMem[at2(eIndex.x+p, eIndex.y + 2, OPENEDDIMX, OPENEDDIMY)]);__syncthreads();

#pragma unrollfor(p=0; p<2; p++)

sharedMem[at3(eIndex.x+p, eIndex.y + 0, OPENEDDIMX, OPENEDDIMY)]+= b*(sharedMem[at2(eIndex.x+p, eIndex.y ‐ 1, OPENEDDIMX, OPENEDDIMY)]

+sharedMem[at3(eIndex.x+p, eIndex.y + 1, OPENEDDIMX, OPENEDDIMY)]);__syncthreads();

Page 32: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

32

EXPLOITING INSTRUCTION LEVEL PARALLELISM (ILP)

Intermediate results

10/13/2016

Page 33: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

33

GPU OPTIMIZATION STRATEGIESEXPLOITING WARP SHUFFLES ON KEPLER GPUS

Page 34: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

34

EXPLOITING WARP SHUFFLESData copy from global memory (texture bound) to registers

10/13/2016

////////////////////////////////////////////////////////////////////////////////////////////////////__global__void __cdf53_smem_encode(const float* __restrict__  src, float* __restrict__  dst,

int dimx, int dimy, int ldx, int ldy, int level=1){

//...//==================================================================// Multi pass reading herexIndex = make_int2((thread1D&31)<<1, (thread1D>>5)<<0);yIndex = make_int2((thread1D>>4)<<0, (thread1D&15)<<1);index_2d = make_int2(__mul24(blockIdx.x,CLOSEDDIMX)+xIndex.x, __mul24(blockIdx.y,CLOSEDDIMY)+xIndex.y);//==================================================================

register float r[WORKPERTHREADSY][WORKPERTHREADSX];//Read from global memory to register#pragma unrollfor(y=0; y<1; y++){

#pragma unrollfor(x=0; x<2; x++){

xp = index_2d.x+x ‐ HALO;yp = index_2d.y+y ‐ HALO;xp = clamp_mirror(xp, 0, dimx‐1);yp = clamp_mirror(yp, 0, dimy‐1);r[y][x] = tex2D(texSrc, xp, yp);

}}// Lifting// Multi pass writing here

}

Binding read-only global memory to texture helps to read faster

Page 35: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

35

EXPLOITING WARP SHUFFLESWarp shuffle instructions

10/13/2016

int __shfl(int var, int srcLane, int width=warpSize);int __shfl_up(int var, unsigned int delta, int width=warpSize);int __shfl_down(int var, unsigned int delta, int width=warpSize);int __shfl_xor(int var, int laneMask, int width=warpSize);

//and many more

__shfl (Indexed any to any)

__shfl_up (Shift right to nth neighbor) __shfl_down (Shift left to nth neighbor)

__shfl_xor (Butterfly exchange XOR)

Page 36: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

36

EXPLOITING WARP SHUFFLES FULL-SHUFFLE

Lifting on register using warp shuffles

10/13/2016

///////////////////////////////////////////////////////////////////////////////////__global__void __cdf53_smem_encode(const float* __restrict__  src, float* __restrict__  dst,

int dimx, int dimy, int ldx, int ldy, int level=1){

//...

// Lifting along xfor(y=0; y<1; y++){

r[y][1] += a*(r[y][0] + __shfl_down(r[y][0], 1) );r[y][0] += b*(r[y][1] + __shfl_up (r[y][1], 1) );

}

// Write temporarily to shared memory#pragma unrollfor(y=0; y<1; y++){

#pragma unrollfor(x=0; x<2; x++){

sharedMem[at3(xIndex.x+x, xIndex.y+y,OPENEDDIMX+2, OPENEDDIMY)] = r[y][x];

}}__syncthreads();//...

// Lifting along y#pragma unrollfor(y=0; y<2; y++){

#pragma unrollfor(x=0; x<1; x++)

r[y][x] = sharedMem[at3(yIndex.x+x, yIndex.y+y,OPENEDDIMX+2, OPENEDDIMY)];

#pragma unrollfor(x=0; x<1; x++){

r[1][x] += a*(r[0][x] + __shfl_down(r[0][x], 1) );r[0][x] += b*(r[1][x] + __shfl_up  (r[1][x], 1) );

}__syncthreads();#pragma unrollfor(y=0; y<2; y++){

#pragma unrollfor(x=0; x<1; x++)

sharedMem[at3(yIndex.x+x, yIndex.y+y,OPENEDDIMX+2, OPENEDDIMY)] = r[y][x];

}__syncthreads();

}// Write from share memory to global memory//...

}

Page 37: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

37

EXPLOITING WARP SHUFFLES FULL-SHUFFLE

Intermediate results

10/13/2016

Page 38: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

38

GPU OPTIMIZATION STRATEGIESCOMBINING ALL: HYBRID APPROACH

Page 39: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

39

Lifting on register using warp shuffles and ILP

10/13/2016

COMBINING ALL: HYBRID APPROACHSEMI-SHUFFLE

Page 40: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

40

FULL-REGISTER - NO SHARED MEMORYLifting on register using warp shuffles and ILP

10/13/2016

Page 41: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

41

RESULTSComparison of different hybrid configurations

Different block and tile configurations of hybrid approach.

Running times (in msecs) of hybrid for CDF 9/7 DWT on an NVIDIA Kepler GPU, image size 1024×1024

Running times (in msecs) of hybrid for CDF 9/7 DWT on an NVIDIA Kepler GPU, image size 1920×1080

Page 42: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

42

RESULTSComparison of various strategies

Running times (in msecs) of various strategies, on NVIDIA Kepler GPUs.

CDF 5/3 DWT

Haar DWT

Page 43: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

43

RESULTSComparison of various strategies

Running times (in msecs) of various strategies, on NVIDIA Kepler GPUs.

CDF 5/3 DWT CDF 9/7 DWT

Page 44: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

44

GPU OPTIMIZATION STRATEGIESFUSED MULTI-LEVEL HAAR DWT

Page 45: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

45

FUSED MULTI-LEVEL HAAR DWTSingle kernel launch

Page 46: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

46

FUSED MULTI-LEVEL HAAR DWTSingle kernel launch

10/13/2016

Haar 2x2

Haar 8x8

encode_8Kernel

decode_8Kernel

Page 47: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

47

RESULTS AND DISCUSSIONS

Page 48: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

48

RESULTSComparison with CPU implementations

Running times (in msecs) of CPU implementations (MATLAB and GSL) and various optimization strategies for CDF 9/7 DWT on an NVIDIA Kepler GPU (including data transfer time).

25× and 65× speed up compared to GSL and MATLAB

Page 49: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

49

RESULTSComparison with other GPU implementations

Running times (in msecs) of other GPU DWT methods and the proposed methods for CDF 9/7 DWT

Page 50: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

50

RESULTSComparison on multi-level GPU DWT

Running times (in msecs) of Enfedaque et al. and hybrid for multi-level CDF 9/7 DWT, on an NVIDIA Kepler GPU.

Running times (in msecs) of hybrid and mb-hybrid for multi-level Haar DWT, on an NVIDIA Kepler GPU

2.7× speed up compared to hybrid

Page 51: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

51

CONCLUSION

Page 52: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

52

CONCLUSION

• Various optimization strategies for 2D DWT on the GPUs have been introduced:

• leverage fast on-chip memories (shared memory and registers),

• warp shuffle instructions,

• thread- and instruction-level parallelism.

Unlike other state-of-the-art GPU DWTs, hybrid parallelism that exploits both ILPand TLP together results in the most optimal performance

• mixed-band layout of Haar DWT outperformed the conventional DWT, especiallyon multi-level transformation.

Contributions

Page 53: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

53

RELATED PUBLICATIONS

T. M. Quan and W.-K. Jeong, “A fast Mixed-Band lifting wavelet transform on the GPU,”in IEEE International Conference on Image Processing, 2014, pp. 1238-1242.

T. M. Quan and W.-K Jeong, “A fast discrete wavelet transform using hybrid parallelismon GPUs,”IEEE Transactions on Parallel and Distributed Systems, published online.

10/13/2016

Page 54: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

54

ACKNOWLEDGMENTS

• This work was partly supported by

• Institute for Information & communications Technology Promotion (IITP) grant funded by theKorea government (MSIP) (No. R0190-15-2012, High Performance Big Data Analytics PlatformPerformance Acceleration Technologies Development),

• the R&D program of MOTIE/KEIT (No. 10054548, Development of Suspended HeterogeneousNanostructure-based Hazardous Gas Microsensor System),

• Basic Science Research Program through the National Research Foundation of Korea (NRF)funded by the Ministry of Education (NRF- 2014R1A1A2058773).

• The authors would like to thank NVIDIA for their hardware support via NVIDIA GPUResearch Center Program

Page 55: VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING …€¦ · VARIOUS OPTIMIZATION STRATEGIES FOR IMPLEMENTING FAST DISCRETE WAVELET TRANSFORMS ON GPUS. 2 AGENDA Introduction Related

SEOUL | Oct.7, 2016

THANK YOU

Do you have any questions?