Vivek Sarkar Department of Computer Science Rice University [email protected]August 27, 2007 COMP 635: Seminar on Heterogeneous Processors www.cs.rice.edu/~vsarkar/comp635 2 COMP 635, Fall 2007 (V.Sarkar) Course Goals • Gain familiarity with heterogeneous processor systems by studying a few sample design points in the spectrum • Study and critique current software environments for these designs (programming models, compilers, tools, runtimes) • Discuss research challenges in advancing the state of the art of software for heterogeneous processors • Target audience: software, hardware, and application researchers interested in building or using heterogeneous processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas
18
Embed
Course Goals - eecis.udel.educavazos/cisc879... · processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas. COMP635,F
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Gain familiarity with heterogeneous processor systems bystudying a few sample design points in the spectrum
• Study and critique current software environments for thesedesigns (programming models, compilers, tools, runtimes)
• Discuss research challenges in advancing the state of the artof software for heterogeneous processors
• Target audience: software, hardware, and applicationresearchers interested in building or using heterogeneousprocessor systems, or understanding strengths andweaknesses of heterogeneous processors w.r.t. their researchareas
conference week)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week
• Time & Place— Default: Mondays, 3:30pm - 4:30pm, DH 2014— Exception: time & place for 9/20 (Thurs) lecture TBD— 30 minutes reserved after lecture for discussion (optional)
• Office Hours (DH 3131)— 11am - 12noon, Fridays from 8/31/07 to 12/7/07
• OWL-Space repository: COMP 635 F07
• Grading— Satisfactory/unsatisfactory grade for students taking seminar for credit
– Others should register officially as auditors, if possible— For a satisfactory grade, you need to
1. Attend at least 50% of lectures2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just
plan on 4 pages/person in that case)— Optional in-class presentation of project/study report on 12/3/07
4COMP 635, Fall 2007 (V.Sarkar)
Course Content• Introduction to Heterogeneous Processors and their Programming
Models (1 lecture)
• Cell Processor and Cell SDK (2 lectures)
• Nvidia GPU and CUDA programming environment (2 lectures)
• DRC FPGA Coprocessor Module and Celoxica ProgrammingEnvironment (1 lecture)
• Clearspeed Accelerator and SDK (1 lecture)
• Imagine Stream Processor (1 lecture)
• Microsoft Accelerator Library (1 lecture)
• Vector and SIMD processors -- a historical perspective (1 lecture)
• Programming Model and Runtime Desiderata for futureHeterogeneous Processors (1 lecture)
• Student presentations (1 lecture)
5COMP 635, Fall 2007 (V.Sarkar)
COMP 635 Lecture 1: Introduction toHeterogeneous Processors and their
• Storage Models: Shared vs. Local vs. Partitioned Memories
• Hybrid combinations of above
Only a limited subset of these models are in production usetoday ==> programming model implementations forheterogeneous processors will have to grow to accommodatenew application domains and new classes of programmers
Spectrum of Programmers for HeterogeneousProcessors
• Application-level Users— Plug & play experience by using ISV frameworks such as
MATLAB and Mathematica, etc
• Library-level Programmers— Portable library interface that works across homogeneous and
heterogeneous processors
• Language-level Programmers— Portable programming language that works across
homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous
processors, especially languages with managed runtimes!
• SDK-level Programmers— C-based compilers and tools that are specific to a given
heterogeneous processor
13COMP 635, Fall 2007 (V.Sarkar)
Spectrum of Programmers for HeterogeneousProcessors
• Application-level Users— Plug & play experience by using ISV frameworks such as
MATLAB and Mathematica, etc
• Library-level Programmers— Portable library interface that works across homogeneous and
heterogeneous processors
• Language-level Programmers— Portable programming language that works across
homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous
processors, especially languages with managed runtimes!
• SDK-level Programmers— C-based compilers and tools that are specific to a given
heterogeneous processor
Focus of this course
14COMP 635, Fall 2007 (V.Sarkar)
Cell Broadband Engine (BE)
15COMP 635, Fall 2007 (V.Sarkar)
Cell Performance
16COMP 635, Fall 2007 (V.Sarkar)
Cell Temperature Distribution
Power and heat are key constraints
17COMP 635, Fall 2007 (V.Sarkar)
Code Partitioning for Cell
Flow Graph Node
Call Graph Node
Flow Graph Edge
Call Graph Edge
Key
Outlining Cloning
Compile forPPE
Compilefor SPE
• Outlining: extract parallel loop into a separate procedure• Cloning: make separate copies for PPE and SPE, including clones of allprocedures called from loop• Coordination: insert operations on signal registers and mailbox queues in PPEand SPE codes• Reference: “Using advanced compiler technology to exploit the performance ofthe Cell Broadband Engine architecture”, A. Eichenberger et al, IBM SystemsJournal, Vol 45, No 1, 2006
18COMP 635, Fall 2007 (V.Sarkar)
• A quiet revolution and potential build-up— Calculation: 367 GFLOPS vs. 32 GFLOPS— Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s— Until last year, programmed through graphics API
— GPU in every PC and workstation – massive volume and potential impact
Why GPUs?
19COMP 635, Fall 2007 (V.Sarkar)
Sample GPU Applications
16%931,365Finite-Difference Time Domain analysis of2D electromagnetic wave propagation
FDTD
>99%33490Computing a matrix Q, a scanner’sconfiguration in MRI reconstruction
MRI-Q
96%98536Two Point Angular Correlation FunctionTRACF
>99%31952Single-precision implementation of saxpy,used in Linpack’s Gaussian elim. routine
SAXPY
>99%160322Petri Net simulation of a distributed systemPNS
>99%2851,481SPEC ‘06 version, change to single precisionand print fewer reports
LBM
35%19434,811SPEC ‘06 version, change in guess vectorH.264
% timeKernelSourceDescriptionApplication
20COMP 635, Fall 2007 (V.Sarkar)
Performance of Sample Kernels and Applications
• GeForce 8800 GTX vs. 2.2GHz Opteron 248• 10× speedup in a kernel is typical, as long as the kernel can occupy enough
parallel threads• 25× to 400× speedup if the function’s data requirements and control flow suit
the GPU and the application is optimized• Keep in mind that the speedup also reflects how suitable the CPU is for
executing the kernelSource: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt
21COMP 635, Fall 2007 (V.Sarkar)
FPGAs: Basics of FPGA Offload
Source: “Compiling Software Code to FPGA-based Accelerator Processors for HPC Applications” by Doug Johnson,[email protected], gladiator.ncsa.uiuc.edu/PDFs/rssi06/presentations/14_Doug_Johnson.pdf
22COMP 635, Fall 2007 (V.Sarkar)
FPGA Acceleration Examples
23COMP 635, Fall 2007 (V.Sarkar)
ClearSpeed Multi-Threaded Array Processor (MTAP)
• Hardware multi- threading forlatency tolerance
• Asynchronous, overlapped I/O
• Poly execution unit contains 96Processor Elements (PE’s) orcores.
• Array of PE’s operates in asynchronous manner, i.e. eachPE executes the sameinstruction on its data.
Source: “Accelerating HPC Applications with ClearSpeed”by Daniel Kliger, [email protected],www.cse.scitech.ac.uk/disco/mew17/talks/ClearSpeed%20Daresbury%20MEW%202006.pdf
16GB memory per node– Single server: 34 GFLOPS– Four node cluster: 136 GFLOPS– Power consumption: 1,940 Watts– Benchmark runtime: 48.4 minutes
• ClearSpeed Accelerated System—Add two Advance accelerator boards per node (25W per board!)
– Single server: 90.1 GFLOPS– Four node cluster: 364.2 GFLOPS– Power consumption: 2,140 Watts– Benchmark runtime: 18.4 minutes
25COMP 635, Fall 2007 (V.Sarkar)
ClearSpeed’s CSXL acceleration library
The CSXL acceleration library intercepts and accelerates calls tofunctions in the Basic Linear Algebra Subprograms (BLAS) library.These include Level 3 BLAS DGEMM calls and LAPACK DGETRFcalls.
26COMP 635, Fall 2007 (V.Sarkar)
Imagine Stream Processor
27COMP 635, Fall 2007 (V.Sarkar)
Transforming Memory Accesses to Communicationfor Scalability
Software challenge: deliver productivity of shared memory model, combined with scalability of communication model
Opportunity for new languages to reducecompiler effort and
broaden applicability
29COMP 635, Fall 2007 (V.Sarkar)
Code Partitioning for Heterogeneous Processors
• Factors to consider when extracting a region of code for executionon an accelerator— Matching operations in code region with primitives in
accelerator (includes instruction selection and FPGA synthesis)— Establishing coherence between main and local memories— Obeying local memory size constraints— Volume of data to be communicated— Granularity of region relative to overhead of thread creation— Structural constraints of task/thread being extracted— Cloning of code that needs to be executed on multiple elements— Coordination with rest of the program (coroutine vs. macro-
dataflow models)— . . .
30COMP 635, Fall 2007 (V.Sarkar)
Reading List for Next Lecture (Sep 10th)
1. “Using advanced compiler technology to exploit the performance of the CellBroadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal,Vol 45, No 1, 2006,http://researchweb.watson.ibm.com/journal/sj/451/eichenberger.pdf
2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevicet al, PPoPP 2007 Best Paper, March 2007,http://portal.acm.org/ft_gateway.cfm?id=1229445&type=pdf&coll=portal&dl=ACM&CFID=14018324&CFTOKEN=91433508
31COMP 635, Fall 2007 (V.Sarkar)
Announcement: Kickoff Meeting for HabaneroMulticore Software Research Project
Habanero is a new research project focused onMulticore Software. Its scope will span programminglanguages, compilers, virtual machines, and low-levelruntime systems, and is synergistic with the expertisewe have in various CS groups at Rice including theParallel Compilers, Scalar Compilers, ProgrammingLanguage Technologies, and Systems groups. Akickoff meeting for the Habanero project is scheduledfor 1pm - 2:30pm on Wednesday, August 29th in DH3076. Cookies will be served!