1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. First need to introduce paged-locked memory as streams need page-locked memory These materials come from Chapter 10 of “CUDA by Example” by Jason Sanders and Edwards Kandrot.
23
Embed
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013Streams.pptx
Page-Locked Memory and CUDA Streams
These notes introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations.
First need to introduce paged-locked memory as streams need page-locked memory
These materials come from Chapter 10 of “CUDA by Example” by Jason Sanders and Edwards Kandrot.
2
Page-locked host memory(also called “pinned host” memory)
Page-locked memory is not paged in and out main memory by the OS through paging but will remain resident.
• Host memory can be mapped to device address space (Compute capability > 1.0)
• Memory bandwidth is higher• Uses real addresses rather than virtual addresses• Does not need to intermediate copy buffering
3
Questions
What is paging?
What are real and virtual addresses?
A process is stored as one or more distributed pages
One process (application)
4
Paging and virtual memoryrecap
Main memory
Hard drive (disk)
PageReal address– the actual physical address of the location
Virtual address – the address , allocated to a process by the paging/virtual memory mechanism to allow the pages to reside anywhere, allocated to a process
Real-virtual address translation done by a look up table, partly in hardware (translation look aside buffer, TLB) for recently used pages and partly in software
Page - a block of memory using with virtual memoryPages are transferred to and from disk to make space
Paging
RA = 0,VA = 45 say
RA = 2,VA = 46 say
More information in an undergraduate Computer Architecture and Operating system courses
5
Note on using page-locked memory
Using page-locked memory will reduce memory available to the OS for paging and so need to be careful in allocating it
6
Allocating page locked memory
cudaMallocHost ( void ** ptr, size_t size ) Allocates page-locked host memory that is accessible to device.
cudaHostAlloc (void ** ptr, size_t size, unsigned int flags)
Allocates page-locked host memory that is accessible to device – seems to have more options
Notes: “The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy () Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().” http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g9f93d9600f4504e0d637ceb43c91ebad.html
7
Freeing page locked memory
cudaFreeHost (void * ptr) “Frees the memory space pointed to by ptr, which must have been returned by a previous call to cudaMallocHost() or cudaHostAlloc().”
Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1026.7
Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1108.1
Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 84097.6
[bandwidthTest] - Test results:PASSED
Press <Enter> to Quit...-----------------------------------------------------------
Using NVIDIA bandwidthTest
Coit-grid07
bandwidthTest Starting...
Running on...
Device 0: Tesla C2050 Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4773.7
Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4060.4
Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 84254.9
[bandwidthTest] - Test results:PASSED
Press <Enter> to Quit...-----------------------------------------------------------
12
CUDA Streams
A CUDA Stream is a sequence of operations (commands) that are executed in order.
Multiple CUDA streams can be created and executed together and interleaved although the “program order” is always maintained within each stream.
Streams provide a mechanism to overlap memory transfer and computations operations in different stream for increased performance if sufficient resources are available.
13
Creating a stream
Done by creating a stream object and associated it with a series of CUDA commands that then becomes the stream. CUDA commands have a stream pointer as an argument: