Top Banner
Split Primitive on the GPU
21

Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Dec 17, 2015

Download

Documents

Vivien Barton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Split Primitive on the GPU

Page 2: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Split Primitive

Split can be defined as performing ::append(x,List[category(x)])

for each x, List holds elements of same category together

Page 3: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Split Sequential Algorithm

I. Count the number of elements falling into each bin– for each element x of list L do

• histogram[category(x)]++ [Possible Clashes on a category]

II. Find starting index for each bin (Prefix Sum)– for each category ‘m’ do

• startIndex[m] = startIndex[m – 1]+histogram[m-1]

III. Assign each element to the output– for each element x of list L do [Initialize localIndex[x]=0]

• itemIndex = localIndex[category(x)]++ [Possible Clashes on

a category]• globalIndex = startIndex[category(x)]• outArray[globalIndex+itemIndex] = x

Page 4: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Split Operation in Parallel

• In order to parallelize the above split algorithm, we require a clash free method for building histogram on the GPU

• Above can be achieved on a parallel machine using one of the following two methods– Personal Histograms for each processors, followed

by merging the histograms– Atomic Operations on Histogram array(s)

Page 5: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Global Memory Atomic Split• Code :

__global__ void globalHist ( unsigned int *histogram, int* gArray, int *category )

{ int curElement; int curCategory;

for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) { curElement= gArray[blockIdx.x * blockDim.x

* i + threadIdx.x]; curCategory = category[curElement]; atomicInc(&histogram[curCategory],99999); }}

• Global Memory too slow to access• Single Histogram in Global Memory (Number of clashes is data dependent)• Overuse of Shared Memory limits the maximum number of categories to 64

Page 6: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Non-Atomic Approach (He et al.)• A Histogram for each ‘Thread’ • Combine all the histograms to get the final histogram

__global__ void nonAtomicHistogram( int* gArray, int *category, unsigned int *tHistGlobal )

{ int curElement, curCategory; __shared__ unsigned int tHist[NUMBINS*NUMTHREADS];

for ( int i=0; i < NUMBINS; i++ ) tHist[threadIDx.x*NUMBINS+i] = 0;

for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) {

curElement = gArray[blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x];

curCategory = category[curElement];tHist[tx*NUMBINS+curCategory]++;

} for ( int i=0; i<NUMBINS; i++ ) tHistGlobal[i * NUMBLOCKS * NUMTHREADS + blockIdx.x*NUMTHREADS +

threadIdx.x] = tHist[tx*NUMBINS+i]; }

Page 7: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Shared Memory Atomic

• Global Atomic does not use the fast shared memory available• Non-Atomic approach overuses the shared memory

• Incorporating atomic operations on fast shared memory may perform better compared to above two approaches

• Shared Memory Atomic can be performed using one of the below mentioned techniques– H/W Atomic Operations– Clash Serial Atomic Operations– Thread Serial Atomic Operations

Page 8: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

SM Atomic :: H/W Atomic• Latest GPUs (G2xx and later) support atomic operations on the Shared Memory

__global__ void histkernel ( unsigned int *blockHists, int* gArray, unsigned int *category )

{const int numThreads = blockDim.x * gridDim.x;extern __shared__ int sharedmem[];unsigned int* s_Hist = (unsigned int *)&sharedmem;unsigned int curElement, curCategory;

for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x)s_Hist[pos] = 0;

__syncthreads();for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ )

{ curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD )

+ ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; atomicInc(&s_Hist[category],9999999); } __syncthreads(); for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x) blockHists[ blockIdx.x + gridDim.x * pos ] = s_Hist[pos];}

Page 9: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

SM Atomic :: Thread Serial

• Threads can be serialized within a ‘warp’ in order to avoid clashes. …………. for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ){

curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i *

NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement];

for ( int i=0; i < WARPSIZE; i++ ) if ( threadIdx.x == i )

s_Hist[curCategory]++;}………….

Page 10: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

SM Atomic :: Clash Serial• Each thread writes to the common histogram of the block until it succeeds.• A Thread is tagged by its thread ID in order to find out if the thread successfully updated the

histogram

//Mainfor(int pos = globalTid; pos < NUMELEMENTS; pos += numThreads) {

unsigned int curElement = gArray[pos]; unsigned int curCategory = category[curElement]; addData256(s_Hist, curCategory, threadTag); }

//Clash serializing function for a Warp__device__ void addData256(volatile unsigned int *s_WarpHist,

unsigned int data, unsigned int threadTag){ unsigned int count; do{ count = s_WarpHist[data] & 0x07FFFFFFU; count = threadTag | (count + 1); s_WarpHist[data] = count; }while(s_WarpHist[data] != count);}

Page 11: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Comparison of Histogram Methodsfor 16 Million Elements

Page 12: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Split using Shared Atomic

• Shared Atomic used to build Block-level histograms

• Parallel Prefix Sum used to compute starting index

• Split is performed by each block for same set of elements used in Step 1

Page 13: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Comparison of Split Methods

• Global Atomic suffers for low number of categories• Non-Atomic can do maximum of 64 categories in one pass

(multiple-pass for higher categories)• Shared Atomic performs better than other 2 GPU methods and CPU

for a wide range of categories• Shared Memory limits maximum number of bins to 2048 (for power

of 2 bins)

Page 14: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Multi Level Split

• Bins higher than 2K are broken into sub-bins

• Hierarchy of bins is created and split is performed at each level for different sub-bins

• Number of splits to be performed grow exponentially

• With 2 levels we can perform split for up to 4Million bins

8 bits 8 bits 8 bits8 bits

32 bit Bin broken into 4 sub-bins of 8 bits

Page 15: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Results for Bins up to 4 Million

Multi Level Split performed on GTX280. Bins from 4K to 512K are handled with 2 passes and results for 1M and 2M bins for 1M elements are computed using 3 passes for better performance

Page 16: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

MLS :: Right to Left• Using an iterative approach

requires constant number of splits at each level

• Highly scalable due to its iterative nature and ideal number of bins can be chosen for best performance

• Dividing the bins from Right-to-Left requires to preserve the order of elements from previous pass

• Complete list of elements is re-arranged at each level

Page 17: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Ordered Atomic• Atomic operations perform

safe reads/writes by serializing the clashes, but do not guarantee required order of operation

• Ordered atomic serializes the clashes in a fixed order provided by the user

• In case of a clash at higher levels in Right-to-Left Split, elements should be inserted in order of their existing position in the list

Page 18: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Split on 4 Billion bins• Right to Left split

can be used for splitting integers to 4 billion bins ( sorting? )

• Integers can be sorted to desired number of bits

( Keys can be 8, 16, 24, 32 bit long, 64 bit too )

Page 19: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

SplitSort Comparison with other GPU Sorting Implementations

Page 20: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Sorting 64 Bit numbers on the GPU

Page 21: Split Primitive on the GPU. Split Primitive Split can be defined as performing :: append(x,List[category(x)]) for each x, List holds elements of same.

Conclusion• Various histogram methods implemented on

shared memory• Split operation now handles millions and billions

of bins using Left-to-Right and Right-to-Left methods of Multi-Level-Split

• Shared memory split operation faster and scalable than previous implementation (He et al.)

• Fastest Sorting achieved with extension of split to billions of bins

• Variable bit-length sorting helpful with keys of varying size ( bit-length )