Transcript

FAST MAP PROJECTION ON CUDA

Yanwei Zhao

Institute of Computing Technology

Chinese Academy of Sciences

July 29, 2011

OutlineOutline

Institute of Computing Technology,Chinese Academy of Sciences

OutlineOutline

Institute of Computing Technology,Chinese Academy of Sciences

Map Projection Establish the relationship between two different

coordinate systems. geographical coordinates → planar cartesian map space

coordinate system

Complicated and time consuming arithmetic operations. Fast answer with desired accuracy→ Slow exact

answer It's need to be accelerated for interactive GIS scenarios.

Institute of Computing Technology,Chinese Academy of Sciences

GPGPU(The general purpose computing on graphics processing units)

GPGPU is a young area of research. Advantage of GPU

Flexibility Power processing Low cost

GPGPU in applications other than 3D graphics GPU accelerates critical path of application

Institute of Computing Technology,Chinese Academy of Sciences

CUDA(Common Unified Device Architecture) NVIDIA's parallel computing

architecture C base programming

language and development toolkit

Advantage: Programmer can focus on the

important issues rather than an unfamiliar language

No need of graphics APIs and write efficient parallel code

Institute of Computing Technology,Chinese Academy of Sciences

The characteristic of Map Projection

Huge amount of coordinates to handle

The complexity of arithmetic operations

The requirement of a realtime response

Institute of Computing Technology,Chinese Academy of Sciences

Our proposals

using the new technology CUDA on the GPU

Take Universal Transverse Mercator (UTM) projection as an example

Performance: Improvement of up to 6x to 8x

(include transfer time) Speed up 70x to 90x

(not include transfer time)Institute of Computing Technology,

Chinese Academy of Sciences

OutlineOutline

Institute of Computing Technology,Chinese Academy of Sciences

Algorithm frameworkCPU

CPU

3. Copy the data from CPU to GPU global memory

5. Copy the result from GPU to CPU

GPU

1.Open the shapefile2.Read the coordinates of all features

6.free up the device memory

Block 0

…………

Thr

ead

0

Thr

ead

1

Thr

ead

n

Block m

……

Thr

ead

0

Thr

ead

1

Thr

ead

n

4. Execute the kernel function

7.Save or display the result

Striped partitioning

Matrix distribution

Institute of Computing Technology,Chinese Academy of Sciences

Striped partitioning

Define the number of block and thread: Block_num,Thread_num

CUDA built-in parameters: GridDim, BlockDim

Geographic feature number: fn

Each block runs features: fn/GridDim.x

Institute of Computing Technology,Chinese Academy of Sciences

Block 0

Block 1

Block m

feature 0

feature 1

feature m

feature m+1

feature m+2

feature 2m

……

……

coord 0

coord 1

coord n

coord 0

coord 1

coord n

thread 0

thread 1

thread n

The relationship between blocks

and features

The relationship between threads and coordinates

Striped partitioning

For surrounding loop: Blocks and features Block → Feature[i] i = blockidx.x*(fn/GridDim.x)

(1)

Block → next Feature[k] k = i + fn/GridDim.x (2)

For inner loop: Threads and coordinates thread→coord[j]

j = threadIdx.x thread→next coord[k]

k = j +Thread_numInstitute of Computing Technology,

Chinese Academy of Sciences

Block 0

Block 1

Block m

feature 0

feature 1

feature m

feature m+1

feature m+2

feature 2m

……

……

coord 0

coord 1

coord n

coord 0

coord 1

coord n

thread 0

thread 1

thread n

The relationship between blocks

and features

The relationship between threads and coordinates

Striped partitioning

For surrounding loop: Blocks and features Block → Feature[i]

i = blockidx.x*(fn/GridDim.x) Block → next Feature[k]

k = i + fn/GridDim.x

For inner loop: Threads and coordinates thread→coord[j]

j = threadIdx.x (1) thread→next coord[k] k = j +Thread_num (2)

Institute of Computing Technology,Chinese Academy of Sciences

Block 0

Block 1

Block m

feature 0

feature 1

feature m

feature m+1

feature m+2

feature 2m

……

……

coord 0

coord 1

coord n

coord 0

coord 1

coord n

thread 0

thread 1

thread n

The relationship between blocks

and features

The relationship between threads and coordinates

Matrix distribution

. . 1

. .

fn gridDim x grdiDim yk

gridDim x grdiDim y

Institute of Computing Technology,Chinese Academy of Sciences

Define the number of block and thread: grid(br,bc), block(tr,tc)

Each block run k features, where: (1)

Feature[i]: (2)

(3)

. .

. .

i blockIdx y GridDim x k

i blockIdx y GridDim x k k

Matrix distribution

Each block run s coordnates, where:

(1)

coord[j]:

[ ]. . . 1

. .

feature i size blockDim x blockDim ys

blockDim x blockDim y

. .

. .

j threadIdx y BlockDim x s

j threadIdx y BlockDim x s s

Institute of Computing Technology,Chinese Academy of Sciences

OutlineOutline

Institute of Computing Technology,Chinese Academy of Sciences

Experiment Environment

Hardware: CPU: Intel Core2 Duo CPU E8500 at 3.18GHz with

2GB of internal memory GPU: NVIDIA GeForce 9800 GTX+ graphics card

which has 512MB memory, 128 CUDA cores and 16 multiprocessors

Software: Microsoft Windows XP Pro SP2 Microsoft Visual Studio 2005 NVIDIA driver 2.2, CUDA sdk 2.2 and CUDA toolkit 2.2

Institute of Computing Technology,Chinese Academy of Sciences

The data parallel degree

total CPU time : initialization and file reading time serial projection time

Institute of Computing Technology,Chinese Academy of Sciences

The data parallel degree

total CPU time : initialization and file reading time serial projection time

Map projection can achieve more than 90 percent of parallelism.

Institute of Computing Technology,Chinese Academy of Sciences

Comparing with CPU

Block_num=64 Thread_num=512

Institute of Computing Technology,Chinese Academy of Sciences

Comparing with CPU

Total time = map projection time + data transfer time

Institute of Computing Technology,Chinese Academy of Sciences

Comparing with CPU

If consider the total time, the performance can obtain 6x to 8x.

Institute of Computing Technology,Chinese Academy of Sciences

Comparing with CPU

If only compare map projection time, we can obtain 70x to 90x speedups.

Institute of Computing Technology,Chinese Academy of Sciences

The performance of different task assignments

striped partitioning : Block_num=64, Thread_num=512

matrix distribution: dim_grid(32,32) = 32*32 blocks dim_block(256,256) = 256*256 threads

Institute of Computing Technology,Chinese Academy of Sciences

The performance of different task assignments

striped partitioning : Block_num=64, Thread_num=512

matrix distribution: dim_grid(32,32) = 32*32 blocks dim_block(256,256) = 256*256 threads

Striped: 6x to 8x

Matrix: 4x to 6x

Institute of Computing Technology,Chinese Academy of Sciences

The performance of different task assignments

thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

…………

Block 0 Block 1 Block m-1

Global Memory

……………… …… …… ……0 1 n-1 n n+

1

2n mn

mn

+n

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

… … … …

Block(0,0)

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

… … … …

Block(m,0)

… … …Global

Memory

B(0,0) B(1,0) B(m,0)

B(0,m) B(1,m) B(m,m)

… … … …

Grid 0

BlockDim.x*GridDim.x

Matrix Striped

Institute of Computing Technology,Chinese Academy of Sciences

The performance of different task assignments

thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

……

thre

ad 0

thre

ad 1

thre

ad n

-1

…………

Block 0 Block 1 Block m-1

Global Memory

……………… …… …… ……0 1 n-1 n n+

1

2n mn

mn

+n

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

… … … …

Block(0,0)

t(0,0) t(1,0) t(n,0)

t(0,1) t(1,1) t(n,1)

t(0,n) t(1,n) t(n,n)

… … … …

Block(m,0)

… … …Global

Memory

B(0,0) B(1,0) B(m,0)

B(0,m) B(1,m) B(m,m)

… … … …

Grid 0

BlockDim.x*GridDim.x

Matrix Striped

All threads in the block accessing consecutive memory.it can only ensure each row of

threads in the block handle consecutive data

Institute of Computing Technology,Chinese Academy of Sciences

OutlineOutline

Institute of Computing Technology,Chinese Academy of Sciences

Conclusion and Future work Implement a fast map projection method.

CUDA-enabled GPUs high speed-up compared to the CPU-based

method the power of modern GPU is able to considerably

speed up in the field of geoscience DEM-based spatial interpolation raster-based spatial analysis

Future work: GPU implementation of other GIS application

Institute of Computing Technology,Chinese Academy of Sciences

Thank you!Q & A

Yanwei Zhao

Institute of Computing Technology

Contact: zhaoyanwei@ict.ac.cn

Institute of Computing Technology,Chinese Academy of Sciences

top related