“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant Professor in Computer Architecture and Director of CAPPLab Department of Electrical Engineering and Computer Science (EECS) Wichita State University (WSU), USA June 2, 2014
51
Embed
“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!”
Bogazici UniversityIstanbul, Turkey
Presented by:
Dr. Abu AsaduzzamanAssistant Professor in Computer Architecture and Director of CAPPLabDepartment of Electrical Engineering and Computer Science (EECS)
Wichita State University (WSU), USA
June 2, 2014
Dr. Zaman 2
“SMT/GPU Provides High Performance;at WSU CAPPLab, we can help you!”
Outline ►■ Introduction
Single-Core to Multicore Architectures
■ Performance Improvement Simultaneous Multithreading (SMT) (SMT enabled) Multicore CPU with GPUs
integrated circuits doubles approximately every 18 months.
Dr. Zaman 6
Introduction
Amdahl’s law Vs. Gustafson’s law■ The speedup of a program using multiple processors in parallel
computing is limited by the sequential fraction of the program.■ Computations involving arbitrarily large data sets can be parallelized.
Dr. Zaman 7
Introduction
Law of diminishing returns■ In all productive processes, adding more of one factor of production,
while holding all others constant, will at some point yield lower per-unit returns.
Dr. Zaman 8
Introduction
Koomey's law■ The number of computations
per joule of energy dissipated has been doubling approximately every 1.57 years. This trend has been remarkably stable since the 1950s.
Dr. Zaman 9
Introduction
Single-Core to Multicore Architecture■ History of Computing
Word “computer” in 1613 (this is not the beginning) Von Neumann architecture (1945) – data/instructions memory Harvard architecture (1944) – data memory, instruction memory
■ Single-Core Processors In most modern processors: split CL1 (I1, D1), unified CL2, … Intel Pentium 4, AMD Athlon Classic, …
■ Popular Programming Languages C, …
Dr. Zaman 10
(Single-Core to) Multicore Architecture
Courtesy: Jernej Barbič, Carnegie Mellon University
Input Process/Store Output
Multi-tasking Time sharing (Juggling!)
Cache not shown
Introduction
Dr. Zaman 11
Single-Core “Core”
Introduction
a single core
Courtesy: Jernej Barbič, Carnegie Mellon University
■ CAPPLab “People First” Resources Research Grants/Activities
■ Discussion
Dr. Zaman 22
Parallel/Concurrent Computing
Parallel Processing – It is not fun!Let’s play a game: Paying the lunch bill together
Started with $30; spent $29 ($27 + $2)Where did $1 go?
Friend Before Eating
Total Bill
Return Tip After Paying
A $10 $1
B $10 $25 $5 $2 $1
C $10 $1
Total $30 $2
Total Spent
$9
$9
$9
$27
SMT enabled Multicore CPU with Manycore GPU for Ultimate Performance!
Dr. Zaman 23
Performance Improvement
Simultaneous Multithreading (SMT)■ Thread
A running program (or code segment) is a process Process processes / threads
■ Simultaneous Multithreading (SMT) Multiple threads running in a single-processor at the same time Multiple threads running in multiple processors at the same time
■ Multicore Programming Language supports OpenMP, Open MPI, CUDA, …C
Dr. Zaman 24
Performance Improvement
Simultaneous Multithreading (SMT)■ Example:
■ Generating/Managing Multiple Threads OpenMP, Open MPI …C
A GPU card with 16 streaming multiprocessors (SMs)
Inside each SM:• 32 cores
• 64KB shared memory
• 32K 32bit registers
• 2 schedulers
• 4 special function units
■ CUDA GPGPU Programming Platform
Performance Improvement
Dr. Zaman 27
Performance Improvement
CPU-GPU Technology■ Tasks/Data exchange mechanism
Serial Computations – CPU Parallel Computations - GPU
Dr. Zaman 28
Performance Improvement
GPGPU/CUDA Technology■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 1) CPU allocates and copies data to GPUOn CUDA API:
cudaMalloc()cudaMemCpy()
Dr. Zaman 29
Performance Improvement
GPGPU/CUDA Technology■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 2) CPU Sends function parameters and instructions to GPU
CUDA API:
myFunc<<<Blocks, Threads>>>(parameters)
Dr. Zaman 30
Performance Improvement
GPGPU/CUDA Technology■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 3) GPU executes instruction as scheduled in warps
(Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy()
Dr. Zaman 31
Performance Improvement
Case Study 1 (data independent computation without GPU/CUDA)
■ Matrix Multiplication
Matrices Systems
Dr. Zaman 32
Performance Improvement
Case Study 1 (data independent computation without GPU/CUDA)
■ Matrix Multiplication
Execution Time Power Consumption
Dr. Zaman 33
Performance Improvement
Case Study 2 (data dependent computation without GPU/CUDA)
■ Heat Transfer on 2D Surface
Execution Time Power Consumption
Dr. Zaman 34
Performance Improvement
Case Study 3 (data dependent computation with GPU/CUDA)
■ Fast Effective Lightning Strike Simulation The lack of lightning strike protection for the composite materials
limits their use in many applications.
Dr. Zaman 35
Performance Improvement
Case Study 3 (data dependent computation with GPU/CUDA)
■ Fast Effective Lightning Strike Simulation■ Laplace’s Equation■ Simulation
CPU OnlyCPU/GPU w/o shared memoryCPU/GPU with shared memory
Dr. Zaman 36
Performance Improvement
Case Study 4 (MATLAB Vs GPU/CUDA)■ Different simulation modelsTraditional sequential programCUDA program (no shared memory)CUDA program (with shared memory)Traditional sequential MATLABParallel MATLAB
CUDA/C parallel programming of the finite difference method based Laplace’s equation demonstrate up to 257x speedup and 97% energy savings over a parallel MATLAB implementation while solving a 4Kx4K problem with reasonable accuracy.
Dr. Zaman 37
Identify More Challenges■ Sequential data-independent problems
Kishore Konda Chidella, PhD Student Mark P Allen, MS Student Chok M. Yip, MS Student Deepthi Gummadi, MS Student
■ Collaborators Mr. John Metrow, Director of WSU HiPeCC Dr. Larry Bergman, NASA Jet Propulsion Laboratory (JPL) Dr. Nurxat Nuraje, Massachusetts Institute of Technology (MIT) Mr. M. Rahman, Georgia Institute of Technology (Georgia Tech) Dr. Henry Neeman, University of Oklahoma (OU)
2 CUDA PCs – CPU: Xeon E5506, … Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64
GB DDR3, Kepler card) via remote access to WSU (HiPeCC) 2 CUDA enabled Laptops More …
■ Software CUDA, OpenMP, and Open MPI (C/C++ support) MATLAB, VisualSim, CodeWarrior, more (as may needed)
Dr. Zaman 49
WSU CAPPLab
Scholarly Activities■ WSU became “CUDA Teaching Center” for 2012-13
Grants from NSF, NVIDIA, M2SYS, Wiktronics Teaching Computer Architecture and Parallel Programming
■ Publications Journal: 21 published; 3 under preparation Conference: 57 published; 2 under review; 6 under preparation Book Chapter: 1 published; 1 under preparation
■ Outreach USD 259 Wichita Public Schools Wichita Area Technical and Community Colleges Open to collaborate
Dr. Zaman 50
WSU CAPPLab
Research Grants/Activities■ Grants
WSU: ORCA NSF – KS NSF EPSCoR First Award M2SYS-WSU Biometric Cloud Computing Research Grant Teaching (Hardware/Financial) Award from NVIDIA Teaching (Hardware/Financial) Award from Xilinx
■ Proposals NSF: CAREER (working/pending) NASA: EPSCoR (working/pending) U.S.: Army, Air Force, DoD, DoE Industry: Wiktronics LLC, NetApp Inc, M2SYS Technology
Bogazici University; Istanbul, Turkey; 2014
“SMT/GPU Provides High Performance;at WSU CAPPLab, we can help you!”