Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox {tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu 2nd International Workshop on GPUs and Scientific Applications Galveston Island, TX
38
Embed
Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Optimizing OpenCL Kernelsfor Iterative Statistical Applications on GPUs
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
{tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu2nd International Workshop on GPUs and Scientific Applications
Galveston Island, TX
Iterative Statistical Applications
• Consists of iterative computation and communication steps
• Growing set of applications– Clustering, data mining, machine learning & dimension
reduction applications– Driven by data deluge & emerging computation fields
Compute Communication Reduce/ barrier
New Iteration
Iterative Statistical Applications
• Data intensive• Larger loop-invariant data• Smaller loop-variant delta between iterations– Result of an iteration– Broadcast to all the workers of the next iteration
• High memory access to floating point operations ratio
Compute Communication Reduce/ barrier
New Iteration
Motivation
• Important set of applications• Increasing power and availability of GPGPU computing• Cloud Computing – Iterative MapReduce technologies– GPGPU computing in clouds
• Reusing of loop-invariant data• Leveraging local memory• Optimizing data layout• Sharing work between CPU & GPU
OpenCL experience
• Flexible programming environment• Support for work group level synchronization
primitives• Lack of debugging support• Lack of dynamic memory allocation• Compilation target than a user programming
environment?
Future Work
• Extending kernels to distributed environments• Comparing with CUDA implementations• Exploring more aggressive CPU/GPU sharing• Studying more application kernels• Data reuse in the pipeline
Acknowledgements
• This work was started as a class project for CSCI-B649:Parallel Architectures (spring 2010) at IU School of Informatics and Computing.
• Thilina was supported by National Institutes of Health grant 5 RC2 HG005806-02.
• We thank Sueng-Hee Bae, BingJing Zang, Li Hui and the Salsa group (http://salsahpc.indiana.edu/) for the algorithmic insights.