EPFL CS-206 – Spring 2015 Lec.11 - 1 CS-206 Concurrency Lecture 11 Data Parallel Computing Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Adapted from slides originally developed by Andreas Di Blas, Babak Falsafi, Simon Green, David Kirk, Andreas Moshovos, David Patterson and Waqar Saleem EPFL Copyright 2015 ID IF MEM WB EXE EXE EXE EXE
96
Embed
CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EPFL CS-206 – Spring 2015 Lec.11 - 1
CS-206 Concurrency���Lecture 11
Data ParallelComputingSpring 2015Prof. Babak Falsafiparsa.epfl.ch/courses/cs206/Adapted from slides originally developed by Andreas Di Blas, Babak Falsafi, Simon Green, David Kirk, Andreas Moshovos, David Patterson and Waqar SaleemEPFL Copyright 2015
Recall: Forms of Parallelismu Throughput parallelism
w Perform many (identical) sequential tasks at the same timew E.g., Google search, ATM (bank) transactions
u Task parallelismw Perform tasks that are functionally different in parallelw E.g., iPhoto (face recognition with slide show)
u Pipeline parallelismw Perform tasks that are different in a particular orderw E.g., speech (signal, phonemes, words, conversation)
u Data parallelismw Perform the same task on different dataw E.g., Graphics, data analytics
}Re
duce
tim
e fo
r one
job
EPFL CS-206 – Spring 2015 Lec.11 - 6
Recall: Forms of Parallelismu Throughput parallelism
w Perform many (identical) sequential tasks at the same timew E.g., Google search, ATM (bank) transactions
u Task parallelismw Perform tasks that are functionally different in parallelw E.g., iPhoto (face recognition with slide show)
u Pipeline parallelismw Perform tasks that are different in a particular orderw E.g., speech (signal, phonemes, words, conversation)
u Data parallelismw Perform the same task on different dataw E.g., Graphics, data analytics
}Re
duce
tim
e fo
r one
job
EPFL CS-206 – Spring 2015 Lec.11 - 7
Example: Image Processing/Graphics
int a[N]; // N is large for (i =0; i < N; i++)
a[i] = a[i] * fade;
EPFL CS-206 – Spring 2015 Lec.11 - 8
Example: Speech Recognition (e.g., Siri)
u Signal processing: same algorithm run on a sampleu Neural network: propagate values across neurons
EPFL CS-206 – Spring 2015 Lec.11 - 9
Signal Processing: Data Parallel Transforms
Example: Discrete Fourier Transform (DFT) size 4
Mxx!transform (a matrix) sampled signal (a vector)
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−
−−
−−=
111
1
1111
1111
11
1
1111
1111
111111
111111
4
iii
iiDFT
Matrix operations are embarrassingly data parallel!
EPFL CS-206 – Spring 2015 Lec.11 - 10
A network of neurons
x2
xn
w1j
w2j
wnj
wkjxkk=0
n
∑ + bj
bj
Yj
Hidden Layers
Signal
I:
eΩ
Phonemes
Each neuron
EPFL CS-206 – Spring 2015 Lec.11 - 11
Data Parallel Computation on Neurons
float nron [N]; // for large N for (each neu[i]) for (i=0; i < N; i++) for (j=0; j < nron[i].outputs; j++) nron[i].y[j] = sigmoid( ) nron[i].wkjnron[i].xk
k=0
nron[i].inputs
∑ + nron[i].bj
EPFL CS-206 – Spring 2015 Lec.11 - 12
Example: Data Analytics
u Google processes 20 PB a dayu Wayback Machine has 3 PB + 100 TB/monthu Facebook has 2.5 PB of user data + 15 TB/dayu eBay has 6.5 PB of user data + 50 TB/dayu CERN’s Large Hydron Collider generates 15 PB a year
How do we aggregate this data?
EPFL CS-206 – Spring 2015 Lec.11 - 13
MapReduce in Data Analytics
u It’s about aggregating statistics over datau Divide up the data among serversu Compute the stats (independently)u Then aggregate/reduce
u Example: CloudSuite classification benchmarkw 10’s of GB of web pagesw Rank pages based on the word occurrence (popularity)w Look for celebritiesw It’s an embarrassingly (data) parallel problem!
EPFL CS-206 – Spring 2015 Lec.11 - 14
map map map
Aggregate values by keys
reduce reduce
5004 2513 Gaga Bieber
MapReduce from Google: Data Parallel Computing on Volume Servers
This Course:���Data Parallel Processor Architecture
1. Vector Processorsw Pipelined executionw SIMD: Single instruction, multiple dataw Example: modern ISA extensions
2. Graphics Processing Units (GPUs)w Dense grid of ALUsw SIMT: Single instruction, multiple threadsw Integrated vs. discrete
EPFL CS-206 – Spring 2015 Lec.11 - 16
Recall: MIPS Processor (Instruction Cycle)
u Instructions are fetched from instruction cache and decodedu Operands are fetched from register fileu Execute is the ALU (arithmetic logic unit)u Memory access to data cacheu Write results back to register file
ID IF MEM EXE
WB
Instruction Fetch
Instruction Decode/ Operand Fetch
Execute Memory Access
Write back Result
EPFL CS-206 – Spring 2015 Lec.11 - 17
Recall: MIPS Pipeline (Instruction Cycle)
ID IF MEM EXE
WB
Instruction Fetch
Instruction Decode/ Operand Fetch
Execute Memory Access
Write back Result
int a[N]; // N is large for (i =0; i < N; i++)
a[i] = a[i] * fade;
EPFL CS-206 – Spring 2015 Lec.11 - 18
Fader loop in assembly
for (i =0; i < N; i++) a[i] = a[i] * fade;
u The loop iterates N times (once for each array element)
u Same exact operation for each element
u Assume 32-bit “mul”
; a[] -> $2,
; fade -> $3,
; &a[N] -> $4,
; $5 is a temp
loop:
lw $5, 0($2)
mul $5, $3, $5
sw $5, 0($2)
addi $2, $2, 4
bne $2, $4, loop
EPFL CS-206 – Spring 2015 Lec.11 - 19
Vector Processor: One instruction, multiple data Instruction Fetch
Instruction Decode/ Operand Fetch
Execute Memory Access
Write back Result
ID IF MEM WB EXE
EXE
EXE
EXE
EPFL CS-206 – Spring 2015 Lec.11 - 20
Vector Processing
*
r1 r2
r3
mul r3, r1, r2
SCALAR (1 operation)
v1 v2
v3 *
vector length
mul.v v3, v1, v2
VECTOR (N operations)
u Vector processors have high-level operations that work on linear arrays of numbers: "vectors"
EPFL CS-206 – Spring 2015 Lec.11 - 21
Example vector instructions
Each vector register is multiple scalar registers u In our example, a vector register V has 4 scalars So, u mul.v v1, v2, v1 vector dot product v1*v2 u mul.sv v1, r1, v1 multiplies scalar r1 to all elements of v1 u lw.v v1, 0(r1) loads vector v1 from address r1 u sw.v v1, 0(r1) stores vector v1 at address r1
Operation & Instruction Count (from F. Quintana, U. Barcelona.)
Vector reduces ops by 1.2X, instructions by 20X
EPFL CS-206 – Spring 2015 Lec.11 - 28
Automatic Code Vectorization
for (i =0; i < N; i++) a[i] = a[i] * fade;
Compiler can detect vector operationsu Inspect the codeu Vectorize automaticallyBut, what about
for (i =0; i < N; i++)
a[i] = a[b[i]] * fade;
EPFL CS-206 – Spring 2015 Lec.11 - 29
Automatic Code Vectorization
for (i =0; i < N; i++) a[i] = a[i] * fade;
Compiler can detect vector operationsu Inspect the codeu Vectorize automaticallyBut, what about
for (i =0; i < N; i++)
a[i] = a[b[i]] * fade;
b[i] unknown at compile time!
EPFL CS-206 – Spring 2015 Lec.11 - 30
x86 architecture SIMD supportu Both current AMD and Intel’s x86 processors have ISA and
microarchitecture support SIMD operations.u ISA SIMD support
w MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVXw See the flag field in /proc/cpuinfo
w SSE (Streaming SIMD extensions): ISA extensions to x86w SIMD/vector operations
u Micro architecture supportw Many functional unitsw 8 128-bit vector registers, XMM0, XMM1, …, XMM7
EPFL CS-206 – Spring 2015 Lec.11 - 31
SSE programmingu Vector registers support three data types:
w Integer (16 bytes, 8 shorts, 4 int, 2 long long int, 1 dqword)w single precision floating point (4 floats)w double precision float point (2 doubles).
EPFL CS-206 – Spring 2015 Lec.11 - 32
SSE instructions
u Arithmetic instructionsw ADD, SUB, MUL, DIV, SQRT, MAX, MIN, RCP, etcw PD: two doubles, PS: 4 floats, SS: scalar
w ADDPS – add four floats, ADDSS: scalar add u Logical instructions
w AND, OR, XOR, ANDN, etcw ANDPS – bitwise AND of operandsw ANDNPS – bitwise AND NOT of operands
u Comparison instruction:w CMPPS, CMPSS – compare operands and return all 1’s or 0’s
EPFL CS-206 – Spring 2015 Lec.11 - 33
u 32 x 64-bit registers (also used as 16 x 128-bit registers)u Registers considered as vectors of same data typeu Data types: signed/uns. 8-bit, 16-bit, 32-bit, 64-bit, single prec. floatu Instructions perform the same operation in all lanes
SIMD extensions in ARM: NEON
Dn
Dm
Dd
Lane
Source Registers Source Registers
Operation
Destination Register
Elements Elements Elements
EPFL CS-206 – Spring 2015 Lec.11 - 34
This Course:���Data Parallel Processor Architecture
1. Vector Processorsw Pipelined executionw SIMD: Single instruction, multiple dataw Example: modern ISA extensions
2. Graphics Processing Units (GPUs)w Dense grid of ALUsw SIMT: Single instruction, multiple threadsw Integrated vs. discrete
EPFL CS-206 – Spring 2015 Lec.11 - 35
u Tens of coresu Mostly control logicu Large cachesu Regular threads (e.g., Java)
u Thousands of tiny coresu Mostly ALUu Little cacheu Special threads (e.g., CUDA)
CPU vs. GPU
Cache
Cache
EPFL CS-206 – Spring 2015 Lec.11 - 36
GPUs are highly concurrent!
G FL
OPS
/sec
Perf
orm
ance
gap
EPFL CS-206 – Spring 2015 Lec.11 - 37
Integrated (e.g., AMD)u Shared cache hierarchyu One memory
Discrete (e.g., nVidia)u Specialized GPU memoryu Must move data back/forth
Integrated vs. Discrete GPU
Memory Memory GPU Memory
CPU I/O Bus
GPU
EPFL CS-206 – Spring 2015 Lec.11 - 38
Integrated (e.g., AMD)u Shared cache hierarchyu One memory
Discrete (e.g., nVidia)u Specialized GPU memoryu Must move data back/forth
This course: Discrete GPU
Memory Memory GPU Memory
CPU I/O Bus
GPU
EPFL CS-206 – Spring 2015 Lec.11 - 39
Warning! CPU/GPU connection is a bottleneck
Memory GPU Memory
CPU GPU
300 GB/s
30 GB/s
3 GB/s
EPFL CS-206 – Spring 2015 Lec.11 - 40
Sequential Execution Model / SISD int a[N]; // N is large
for (i =0; i < N; i++) a[i] = a[i] * fade;
tim
e Flow of control / Thread One instruction at the time Optimizations possible at the machine level
EPFL CS-206 – Spring 2015 Lec.11 - 41
Data Parallel Execution Model / SIMD int a[N]; // N is large
for all elements do in parallel a[i] = a[i] * fade;
tim
e
EPFL CS-206 – Spring 2015 Lec.11 - 42
Single Program Multiple Data / SPMD int a[N]; // N is large
for all elements do in parallel if (a[i] > threshold) a[i]*= fade;
tim
e
Code is statically identical across all threads Execution path may differ The model used in today’s Graphics Processors
EPFL CS-206 – Spring 2015 Lec.11 - 43
Killer app? 3D GraphicsExample apps:
w Gamesw Engineering/CAD
Computation:w Start with triangles (points in 3D space)w Transform (move, rotate, scale)w Paint / Texture mappingw Rasterize à convert into pixelsw Light & Hidden “surface” elimination
Bottom line:w Tons of independent calculationsw Lots of identical calculations���
EPFL CS-206 – Spring 2015 Lec.11 - 44
Target Applications
int a[N]; // N is large
for all elements of an array a[i] = a[i]* fade
u Lots of independent computationsw CUDA threads need not be completely independent
Kernel
THREAD
EPFL CS-206 – Spring 2015 Lec.11 - 45
Programmer’s View of the GPU
u GPU: a compute device that:w Is a coprocessor to the CPU or hostw Has its own DRAM (device memory)w Runs many threads in parallel
u Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
EPFL CS-206 – Spring 2015 Lec.11 - 46
GPU vs. CPU Threads
u GPU threads are extremely lightweightw Little creation overhead (unlike Java)w e.g., ~microsecondsw All done in hardware
u GPU needs 1000s of threads for full efficiencyw Multi-core CPU needs only a few
EPFL CS-206 – Spring 2015 Lec.11 - 47
GPU threads help in two ways!
…..* fade
Parallelize computation
Overlap memory access Memory
GPU Memory
… = a[i]…. a[i] = …
CPU
GPU
EPFL CS-206 – Spring 2015 Lec.11 - 48
Execution Timelineti
me
1. Copy to GPU mem 2. Launch GPU Kernel
GPU / Device
2’. Synchronize with GPU 3. Copy from GPU mem
CPU / Host
EPFL CS-206 – Spring 2015 Lec.11 - 49
Programmer’s viewu First create data in CPU memory
CPU
Memory
GPU
GPU Memory
EPFL CS-206 – Spring 2015 Lec.11 - 50
Programmer’s viewu Then Copy to GPU
CPU
Memory
GPU
GPU Memory
EPFL CS-206 – Spring 2015 Lec.11 - 51
Programmer’s viewu GPU starts computation à runs a kernelu CPU can also continue
CPU
Memory
GPU
GPU Memory
EPFL CS-206 – Spring 2015 Lec.11 - 52
Programmer’s viewu CPU and GPU Synchronize
CPU
Memory
GPU
GPU Memory
EPFL CS-206 – Spring 2015 Lec.11 - 53
Programmer’s viewu Copy results back to CPU
CPU
Memory
GPU
GPU Memory
EPFL CS-206 – Spring 2015 Lec.11 - 54
Programming Languages
u CUDAw nVidiaw Has market lead
u OpenCLw Many including nVidiaw CUDA supersetw Targets many different devices, e.g., CPUs + programmable
acceleratorsw Fairly new
u Both are evolving
EPFL CS-206 – Spring 2015 Lec.11 - 55
Computation partitioning
u Think of computation as a series of loopsu Think of data as an array
for (i = 0; i < big_number; i++)a[i] = some function
for (i = 0; i < big_number; i++)a[i] = some other function
for (i = 0; i < big_number; i++)a[i] = some other function
Kernels
EPFL CS-206 – Spring 2015 Lec.11 - 56
What is the kernel here?
EPFL CS-206 – Spring 2015 Lec.11 - 57
My first CUDA Program__global__ void fadepic(int *a, int fade, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] * fade; } int main() { int h[N]; int *d; cudaMalloc ((void **) &d, SIZE); ….. cudaThreadSynchronize (); cudaMemcpy (d, h, SIZE, cudaMemcpyHostToDevice)); fadepic<<< n_blocks, block_size >>> (d, 10.0, N); cudaDeviceSynchronize (); cudaMemcpy (h, d, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (d)); }
GPU
CPU
EPFL CS-206 – Spring 2015 Lec.11 - 58
Per Kernel Computation Partitioning
Threads within a block can communicate/synchronizew Run on the same core
Threads across blocks can’t communicatew Shouldn’t touch each others data (undefined behavior)
Block
thread
EPFL CS-206 – Spring 2015 Lec.11 - 59
Per Kernel Computation Partitioning
u One thread can process multiple data elementsu Other mappings are possible and often desirable
w We will talk about this later
Block
thread
EPFL CS-206 – Spring 2015 Lec.11 - 60
Fade exampleu Each thread will process one pixelfor all elements do in parallel
a[i] = a[i] * fade;
EPFL CS-206 – Spring 2015 Lec.11 - 61
Code Skeleton
u CPU:w Initialize image from filew Allocate buffer on GPUw Copy image to bufferw Launch GPU kernel
w Reads and writes into bufferw Copy buffer back to CPUw Write image to a file
{ int v = a[x][y]; v = v * fade; a[x][y] = v;}u This is the program for one threadu It processes one pixel
EPFL CS-206 – Spring 2015 Lec.11 - 63
Which thread computes which pixel?
blockDim.y
gridDim.y
blockDim.x gridDim.x
threadIdx.y
threadIdx.x
EPFL CS-206 – Spring 2015 Lec.11 - 64
gridDim
u gridDim.x = 7, gridDim.y = 6u How many blocks per dimension?
EPFL CS-206 – Spring 2015 Lec.11 - 65
blockIdxu blockIdx = coordinates of block in the gridu blockIdx.x = 2, blockIdx.y = 3u blockIdx.x = 5, blockIdx.y = 1
(0,0)
EPFL CS-206 – Spring 2015 Lec.11 - 66
blockDim
u blockDim.x= 7, blockDim.y = 7u How many threads in a block per dimension?
EPFL CS-206 – Spring 2015 Lec.11 - 67
threadIdxu threadIdx = coordinates of thread in the blocku threadidx.x= 2, threadIdx.y = 3u threadIdx.x = 5, threadIdx.y = 4
(0,0)
EPFL CS-206 – Spring 2015 Lec.11 - 68
Which thread computes which pixel?
blockDim.y
gridDim.y
blockDim.x gridDim.x
threadIdx.y
threadIdx.x
x = blockIdx.x * blockDim.x + threadIdx.x y = blockIdx.y * blockDim.y + threadIdx.y
EPFL CS-206 – Spring 2015 Lec.11 - 69
GPU Kernel pseudo-code
__global__ void fade (int *a, ��� int fade, ��� int N)
{ int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y int offset = y * (blockDim.x * gridDim.x) + x; // offset within unidimensional array int v = a[offset]; v = v * fade; a[offset] = v;}
EPFL CS-206 – Spring 2015 Lec.11 - 70
GPU Kernel pseudo-code w/ limits
__global__ void fade (int *a, ��� int fade, ��� int N)
{ int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y int offset = y * (blockDim.x * gridDim.x) + x; if (offset > N) return; int v = a[offset]; v = v * fade; a[offset] = v;}
EPFL CS-206 – Spring 2015 Lec.11 - 71
Grids of Blocks of Threads
Cores and caches are clustered on chip for fast connectivity Hardware partitioned naturally into grids
Tim
e
EPFL CS-206 – Spring 2015 Lec.11 - 72
Programmer’s view: Memory Model
EPFL CS-206 – Spring 2015 Lec.11 - 73
Device Grid 1
Block (0, 0) Block
(1, 0) Block (2, 0)
Block (0, 1) Block
(1, 1) Block (2, 1)
Block (1, 1)
Thread (0, 1) Thread
(1, 1) Thread (2, 1) Thread
(3, 1) Thread (4, 1)
Thread (0, 2) Thread
(1, 2) Thread (2, 2) Thread
(3, 2) Thread (4, 2)
Thread (0, 0) Thread
(1, 0) Thread (2, 0) Thread
(3, 0) Thread (4, 0)
Grids of Thread Blocks: Dimension Limits
u Grid of Blocks 1D, 2D, or 3Dw Max x, y and z: 232-1w Machine dependent
u Block of Threads: 1D, 2D, or 3Dw Max number of threads: 1024w Max x: 1024w Max y: 1024w Max z: 64
EPFL CS-206 – Spring 2015 Lec.11 - 74
Thread Batching
u Kernel executed as a grid of thread blocks
u Threads in block cooperatew Synchronize their executionw Efficiently share data in block-
local memory
u Threads across blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device Grid 1
Block (0, 0) Block
(1, 0) Block (2, 0)
Block (0, 1) Block
(1, 1) Block (2, 1)
Grid 2
Block (1, 1)
Thread (0, 1) Thread
(1, 1) Thread (2, 1) Thread
(3, 1) Thread (4, 1)
Thread (0, 2) Thread
(1, 2) Thread (2, 2) Thread
(3, 2) Thread (4, 2)
Thread (0, 0) Thread
(1, 0) Thread (2, 0) Thread
(3, 0) Thread (4, 0)
EPFL CS-206 – Spring 2015 Lec.11 - 75
Thread Coordination Overview
u Race-free access to data
Only across threads within the same block No communication across blocks
EPFL CS-206 – Spring 2015 Lec.11 - 76
Programmer’s view: Memory Model: Thread vs. Host
Arrows show whether read and/or write is possible
EPFL CS-206 – Spring 2015 Lec.11 - 77
Memory Model Summary
Memory Location Access Scope
Local off-chip R/W thread
Shared on-chip R/W all threads in a block
Global off-chip R/W all threads + host
Constant off-chip RO all threads + host
Texture off-chip RO all threads + host Surface off-chip R/W all threads + host
EPFL CS-206 – Spring 2015 Lec.11 - 78
Memory Model: ���Global, Constant, and Texture Memories
u Global memory– Communicating R/W data between host and device– Contents visible to all threads– May be cached (machine dependent)
u Texture and Constant Memories– Constants initialized by host – Contents visible to all threads– May be cached (machine dependent)
EPFL CS-206 – Spring 2015 Lec.11 - 79
Execution Model: Ordering
u Execution order is undefinedu Do not assume and use:
w block 0 executes before block 1w thread 10 executes before thread 20w and any other ordering even if you can observe it
u Future implementations may break this orderingu It’s not part of the CUDA definitionu Why? More flexible hardware options
EPFL CS-206 – Spring 2015 Lec.11 - 80
Reasoning about CUDA call ordering
u Access GPU via cuda…() calls and kernel invocationsw cudaMalloc, cudaMemCpy
u Asynchronous from the CPU’s perspectivew CPU places a request in a “CUDA” queuew requests are handled in-order
EPFL CS-206 – Spring 2015 Lec.11 - 81
Execution Model Summary (for your reference)u Grid of blocks of threads
w 1D/2D/3D grid of blocks of 1D/2D/3D threadsw Threads and blocks have IDs
u Block execution order is undefined
u Same block threads can shared data fast
u Across blocks, threads:w Cannot cooperatew Communicate (slowly) through global memory
u Blocks do not migrate: execute on the same processor
u Several blocks may run over the same core
EPFL CS-206 – Spring 2015 Lec.11 - 82
CUDA API: Exampleint a[N]; for (i =0; i < N; i++)
a[i] = a[i] + x; 1. Allocate CPU Data Structure2. Initialize Data on CPU3. Allocate GPU Data Structure4. Copy Data from CPU to GPU5. Define Execution Configuration6. Run Kernel7. CPU synchronizes with GPU8. Copy Data from GPU to CPU9. De-allocate GPU and CPU memory
EPFL CS-206 – Spring 2015 Lec.11 - 83
1. Allocate CPU data structure
float *ha; main (int argc, char *argv[])
{
int N = atoi (argv[1]);
ha = (float *) malloc (sizeof (float) * N);
...
}
EPFL CS-206 – Spring 2015 Lec.11 - 84
2. Initialize CPU data (dummy)
float *ha;
int i;
for (i = 0; i < N; i++)
ha[i] = i;
EPFL CS-206 – Spring 2015 Lec.11 - 85
3. Allocate GPU data structure float *da;
cudaMalloc ((void **) &da, sizeof (float) * N);
u Notice: no assignment sidew NOT: da = cudaMalloc (…)
u Assignment is done internally:w That’s why we pass &da
u Space is allocated in Global Memory on the GPU
EPFL CS-206 – Spring 2015 Lec.11 - 86
GPU Memory Allocation
u The host manages GPU memory allocation:w cudaMalloc (void **ptr, size_t nbytes)
w Must explicitly cast to (void **) w cudaMalloc ((void **) &da, sizeof (float) * N);
w cudaFree (void *ptr); w cudaFree (da);
w cudaMemset (void *ptr, int value, size_t nbytes); w cudaMemset (da, 0, N * sizeof (int));
u Check the CUDA Reference Manual
EPFL CS-206 – Spring 2015 Lec.11 - 87
4. Copy Initialized CPU data to GPU
float *da;
float *ha;
cudaMemCpy ((void *) da, // DESTINATION
(void *) ha, // SOURCE
sizeof (float) * N, // #bytes
cudaMemcpyHostToDevice);
// DIRECTION
EPFL CS-206 – Spring 2015 Lec.11 - 88
Host/Device Data Transfers
The host initiates all transfers:u cudaMemcpy( void *dst, void *src,
size_t nbytes, enum cudaMemcpyKind direction)
u Asynchronous from the CPU’s perspectivew CPU thread continues
u In-order processing with other CUDA requestsu enum cudaMemcpyKind