Top Banner
GPU Computing and CUDA Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1
32

Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Mar 30, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

GPU Computing and CUDA

Cijo ThomasJanith Kaiprath Valiyalappil

CS566 Parallel Programming, Spring '13

1

Page 2: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Introduction

GPU vs CPU

GPU has 100s of cores compared to 4-8 cores for CPU CPU - executes a single thread very quickly GPU - executes many concurrent threads slowly - traditionally

excels for embarrassingly parallel tasks GPU and CPU have complementary properties.

Page 3: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

GPGPU Solve General Purpose problems using GPU. Core idea is to map data parallel algorithms into equivalent

graphics concepts Have to make heavy use of graphics APIs. Traditionally a cumbersome task Never gained prominence among developers.

Until......

Page 4: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA - Introduction Compute Unified Device Architecture Released in 2006 by NVIDIA Easy programming of GPU using C extension Transparently scales harnessing the ever growing power of

NVIDIA GPUs Programs portable to newer GPU releases

Page 5: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA - Architecture Scalable array of multi-threaded SMs (Streaming

Multiprocessors) Each SM consists of multiple Streaming Processor (SM) Inter-thread communication using shared memory CUDA Terms – Host – CPU

Device - GPU

Page 6: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA - Architecture (Cntd.)

[Nickolls,ACM,2008]

Page 7: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA-Thread Hierarchy Threads are grouped into thread blocks, and execute

concurrently on a single SM Thread blocks are grouped into Grids, and are executed

independently and parallely SIMT- Single Instruction Multiple Thread Thread creation,management,scheduling and execution

occurs in groups of 32 threads called warps

Page 8: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA-Thread Hierarchy

[Nickolls,ACM,2008]

Page 9: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA-Memory Hierarchy Each thread has its own local memory apart from register and

stack space (Physically located on device memory off-chip) Next in hierarchy is a low-latency shared memory between

threads in a thread block Then there is high-latency global shared memory All the above memories are physically and logically separate

from system memory.

Page 10: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA-Memory Hierarchy

[Source: Nvidia]

Page 11: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA - Memory Operations cudamalloc,cudafree is used for allocation and releasing

memory in Device. cudamemcpy- is used to transfer data in 2 directions

a) device to host memory - cudaMemcpyHostToDeviceb) host to device memory- cudaMemcpyDeviceToHost

Device memory refers to global shared memory, and not thread block shared memory

Page 12: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA- Programming Model CUDA programs are heterogeneous CPU+GPU co-processing

systems Use CPU core for serial portions, GPU for parallel portions CUDA kernel - can be a simple function or a program on its

own GPU needs 1000s of threads for full efficiency CUDA threads are extremely light-weight with little or no

overhead in creation/switching

Page 13: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA- Program Structure

Allocate memory in device (GPU) Copy data from system memory into device memory Invoke CUDA kernel which performs processing the data Copy results backs from device memory to system memory.

Page 14: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA Programming

[Kirk,2010]

Page 15: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA Example

[Kirk,2010]

Page 16: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Reduction in CUDA

[Nickolls,ACM,2008]

Page 17: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Reduction in CUDA (Cntd.)

[Nickolls,ACM,2008]

Page 18: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA Compilation

[Kirk,2010]

Page 19: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA 5 –Latest CUDA Release CUDA 5 - The latest release of CUDA Released Oct 2013 Kepler Architecture vs Fermi Architecture

Page 20: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA 5 –Dynamic Parallelism GPU thread can launch parallel GPU kernels

[Harris, GPU Tech Conf,2012]

Page 21: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA 5 –Dynamic Parallelism(Cntd)

[Harris, GPU Tech Conf,2012]

Page 22: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA 5 –Dynamic Parallelism(Cntd)Advantages Recursive parallel algorithms More efficient

– GPU kept more occupied Simplify CPU/GPU divide Library calls can be made from kernel

Page 23: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA 5 Features GPU Object Linking

[Harris, GPU Tech Conf,2012]

Page 24: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA 5 Features - RDMA RDMA: Remote Direct Memory Access between any GPUs

in cluster

[Harris, GPU Tech Conf,2012]

Page 25: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA – Supporting WorksCUDA Lite

A source-source translation tool to relieve the programmer from handling memory hierarchy

[Ueng, LCPC , 2008]

Page 26: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA – Supporting Works

m-CUDA

makes CUDA architecture run on regular multi-core CPU systems.

Proves the effectiveness of CUDA model in non-GPU systems as well

[Buck,SC08,2008]

Page 27: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Alternate Developments CUDA not as simple as it sounds People have questioned the future of CUDA CUDA has a strong reputation for performance, but at the

expense of ease of programming Alternates like XMT is developed, challenging CUDA XMT – many core general purpose parallel architecture.

[Caragea,,Hotpar 2010]

Page 28: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA Achievements 375million CUDA capable GPUs sold by Nvidia 1 million toolkit downloads >120,000 active developers Active research community New domains like Big-Data Analytics Shazam – top 5 music app in Apple Store

SalesForce.com – real time twitter data analysis and many more….

Source : NVIDIA

Page 29: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

CUDA Achievements

[Nickolls,IEEE,2010]

Page 30: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Conclusion & Some Thoughts CUDA is promising but only supports NVIDIA

GPU OpenCL, AMD Brook not main stream yet.

Automatic extraction of parallelism Automatic conversion of existing code base

in popular models eg: Java Threads More support for higher level languages

Page 31: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

References [Buck,SC08,2008] : Massimiliano Fatica (NVIDIA), Patrick LeGresley (NVIDIA),Ian Buck

(NVIDIA) ,John Stone (University of Illinois at Urbana-Champaign) , Jim Phillips (University of Illinois at Urbana-Champaign), Scott Morton (Hess Corporation), Paulius Micikevicius (NVIDIA), "High Performance Computing with CUDA" Nov.2008

[Ueng, LCPC , 2008] :Sain-Zee Ueng, Melvin Lathara, Sara S,Wen-mei W. Hwu, CUDA-Lite: Reducing GPU Programming ComplexityInternational Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008

[Nickolls,IEEE,2010]: Nickolls, J, The GPU Computing Era, Micro IEEE, 2010 [Harris,GPU Tech Conf 2012] : Mark Harris, CUDA 5 and Beyond , GPU Tech Conference

2012 [Nickolls,ACM,2008] : John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, Scalable

Parallel Programming with CUDA ,Queue – GPU Computing Vol 6, Issue 2, ACM Digital Library April 2008

[Kirk,2010]: Programming Massively Parallel Processors: A Hands-on Approach 2010, David B. Kirk, Wen-mei W. Hwu

[Caragea,,Hotpar 2010] : GC Caragea, F Keceli, A Tzannes, U Vishkin - Proc. HotPar, 2010

Page 32: Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Thank You!