Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Wang, Yuan-Kai (王元凱) Electrical Engineering Department, Fu Jen Catholic University (輔仁大學電機工程系) [email protected]http://www.ykwang.tw 2014/07/17 Parallelize Computer Vision by GPGPU Computing 1
203
Embed
2014/07/17 Parallelize computer vision by GPGPU computing
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Wang, Yuan-Kai (王元凱)Electrical Engineering Department, Fu Jen Catholic
for (i=height*2/3; i<height; i++)for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
ij
OpenMPCUDA(SPMD)
fork(threads)
join(barrier)
i=0i=1i=2i=3
i=4i=5i=6i=7
i=8i=9i=10i=11
subdomain1 subdomain2 subdomain3
64
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming: SIMT model
❖ CPU (“host”) program often written in C or C++
❖ GPU code is written as a sequential kernel in (usually) a C or C++ dialect
65
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU ProgrammingTechniques
CUDA
OpenCL
C++ AMP
Rednerscript
66
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming Techniques
67
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
68
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA❖ CUDA: Compute Unified Device
Architecture❖ Parallel programming
for nVidia's GPGPU❖ Use C/C++ language
• Java, Fortran, Matlab are OK❖ When executing CUDA programs,
the GPU operates as coprocessor to the main CPU
69
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Hardware Environment: CPU+GPU
❖ CPU• Organizes, interprets, and
communicates information❖ GPU
• Handles the core processing on large quantities of parallel information
• Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU
CPU GPUPCI-E
70
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Software Stack71
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Processing Flow on CUDA
Copyprocessingdata
2
Copytheresult
5 Instructtheprocessing
3Main
Memory CPU
Memoryfor GPU Execute
parallelineachcore
4
Releasedevicememory
6
Allocatedevicememory
1
72
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Programming withMemory Hierarchy
❖ Locality principle• Temporal
locality• Spatial
locality
73
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Offline Compiler Flow99
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Renderscript Compiler: libbcc100
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Renderscript Project Framework
101
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/8)102
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/8)HelloWorld.java
103
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/8)HelloWorld.java
104
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(4/8)HelloWorldView.java
105
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(5/8)HelloWorldView.java
106
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(6/8)HelloWorldRS.java
107
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)HelloWorldRS.java
108
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)ScriptC_helloworld.java
109
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)ScriptC_helloworld.java
110
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(8/8)HelloWorld.rs
111
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Comparison (1/2)❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)
Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." in Proc. First Asia-Pacific Programming Languages and Compilers Workshop (APPLC). 2012.
112
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Comparison(2/2)❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw) constant memory
OpenCL’s portability does not fundamentally affect its performance
Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A comprehensive performance comparison of CUDA and OpenCL." in Proc. International Conference Parallel Processing (ICPP), 2011.
113
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
• No iterative process and is suitable for parallelization• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
140
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
MSRCR Algorithm
• : the MSRCR output
• : the original image distribution in the ith spectral band
• : the kth Gaussian Surround function
• : the convolution operation
• : the weight
• : the color restoration factor in the ith spectral band
N : the number of spectral bands: the gain constant: controls the strength of the nonlinearity
141
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The Method
Gaussian Blur
Log-domain Processing
Normalization
Copy Data from CPU to
GPGPU
Copy Data from GPGPU to
CPU
GPGPUCPU
Histogram Stretching
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on. IEEE, 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
142
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Intrinsic Example(1/2)
177
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Intrinsic Example(2/2)
178
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Blur Intrinsic Performance Analysis
179
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Performance of RenderScript Intrinsics
❖On new Nexus 7❖Relative to equivalent multithreaded C
implementations.
180
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image Processing Benchmarks(1/2) ❖CPU only on a Galaxy Nexus device.
181
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image Processing Benchmarks(2/2)
182
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Acceleration of Retinex Using RenderScript
❖This paper presents an implementation of rsRetinex, an optimized Retinex algorithm by using Renderscript technique.
❖The experimental results show that rsRetinexcould gain up to five times speedup when applied to different image resolution.
Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for ImageProcessing on Android Device Using Renderscript." in Proc. The 8th InternationalConference on Robotic, Vision, Signal Processing & Power Applications, 2014.
183
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Mobile GPGPU ListAdoption OpenCL/ CUDA OpenCV Renderscript
Qualcomm Adreno
Google Nexus 10, Google new Nexus 7, SONY Xperia Tablet Z2
1.2(302~420) OCL module
Android 4.0 later
ARM Mali Nexus 10, Samsung Note 3, Samsung Note PRO 12.2, Meizu MX3
OpenCL 1.1 (T604~T678)
OCL module
Android 4.0 later
nVIDIATegra
Google Project Tango, HTC Nexus 9, Microsoft Surface 2, Nvidia Shield Note 7
CUDA, OpenCL1.2(K1 only)
GPU module
Android 4.0 later(K1 only)
AnandTechPowerVR
iPad Air, iPad mini OpenCL 1.2 OCL module
none
Intel HD Graphics
Microsoft Surface Pro 3, Sony VAIO Tap 11
OpenCL 1.1 OCL module
none
Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.
184
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
7. Summary
185
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP• OpenCL
❖Java• OpenCL, RenderScript
❖Notice that OpenCL and RenderScript is • Not Efficient in parallelization.• Efficient in CV algorithmic design.
187
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (1/2)❖Ver. 2.4.x
• gpu module: CUDA, PC• ocl module: OpenCL, mobile
❖Ver. 3.0 (2014/6)• Transparent API for GPGPU
acceleration
188
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (2/2)189
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL 2.0❖Released in 2013❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory management
❖Dynamic (Nested) Parallelism • Allows a device to enqueue kernels onto
itself – no round trip to host required❖Disadvantage
• Strong hardware support• Not well supported in current GPGPUs
190
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA still Dominant in the Near Future
❖ However, we have to manually parallelize the algorithm: more design overhead
❖ We need expertise in• Algorithms of image and signal processing
• Filtering, frequency analysis, compression, feature extraction, recognition, ...
• Theory, tools and methodology of parallel computing• Communication, synchronization, resource
management, load balancing, debugging, ...
191
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Multimedia
Motion Estimation forH.264/AVC on Multiple GPUs
Using NVIDIA CUDA
10 XCUDA JPEG Decoder
10 XDivideFrame GPU Decoder
Hyperspectral Image Compression on NVIDIA GPUs
10 XGPU Decoder
(Vegas/Premiere) -Using the Power of
NVIDIA Graphic Card to Decode H.264 Video Files
26 X
PowerDirector7 Ultra
3.5X
192
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(1/2)
87 XCUDA SURF – A Real-
timeImplementation for SURF
TU Darmstadt
26 XLeukocyte Tracking:
ImageJ PluginUniversity of Virginia
200 XReal-time SpatiotemporalStereo Matching Using theDual-Cross-Bilateral Grid
100 XImage Denoising with
Bilateral Filter Wlroclaw University
of Technology
85 XDigital BreastTomosynthesisReconstruction
Massachusetts General Hospital
100 XFast Optical Flow on GPUAt Video Rate for Full HD
ResolutionOnera
8 XA Framework for Efficientand Scalable Execution of
Domain-specific TemplatesOn GPU
NEC Labs, Berkeley, Purdue
13 XAccelerating Advanced MRI
ReconstructionsUniversity of Illinois
193
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(2/2)
20 XGPU for Surveillance
13 XFast Human Detection with
Cascaded Ensembles
109 XFast Sliding-Window
Object Detection
263 XGPU Acceleration of Object
Classification AlgorithmUsing NVIDIA CUDA
10 XReal-time
Visual Tracker byStream Processing
45 XA GPU Accelerated
Evolutionary Computer Vision System
3 XCanny Edge Detection
300 XAudience Measurement –Real-time Video Analysisfor Counting People, Face Detection and Tracking
194
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The Embedded VisionAlliance
195
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Readings (1/2)• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features." Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors to reconfigurable logic: a case study." Computers, IEEE Transactions on 59.4 (2010): 433-448.
196
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Readings (2/2)❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,” http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of heterogeneous systems,” https://www.khronos.org/opencl/
197
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
acceleration, AMD Developer Central, 2013.❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012): 61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated framework for image processing and computer vision." Advances in Visual Computing. Springer Berlin Heidelberg, 2008. 430-439.
198
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA❖ CUDA Programming guide. nVidia.❖ CUDA Best Practices Guide. nVidia.❖ CUDA Reference Manual. nVidia.❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html❖ Applications of CUDA for Imaging and Computer
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." First Asia-Pacific Programming Languages and Compilers Workshop. 2012.
❖ "High Performance Apps Development with RenderScript," 12th Kandroid Conference, 2013.
201
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing withGPGPU
❖Programming Massively Parallel Processors – A Hands-on Approach• D. B. Kirk, W. M. Hwu• Morgan Kaufmann, 2010• http://www.nvidia.com/object/promotion_david_kirk_book.html