Vivek Sarkar Department of Computer Science Rice University [email protected]September 10, 2007 COMP 635: Seminar on Heterogeneous Processors Lecture 2: Introduction to the Cell Processor www.cs.rice.edu/~vsarkar/comp635 2 COMP 635, Fall 2007 (V.Sarkar) Announcements • Class dates (REMINDER) — 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3 — No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007) — No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week — Time & Place – Default: Mondays, 3:30pm - 4:30pm, DH 1075 – Exception: 9/20 (Thurs) lecture, 3:30pm - 4:30pm, DH 3076 – 30 minutes reserved after each lecture for discussion (optional) — Office Hours (DH 3131) – 11am - 12noon, Fridays from 8/31/07 to 12/7/07 • Volunteers needed to lead discussion of papers in next lecture (9/20) 1. “Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006 2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevic et al, PPoPP 2007 Best Paper, March 2007. • CELL HACK-A-THON II, Austin, September 22 - 25 — See http://www.hpc-consortium.net/events/200709/ for details — Contact me if you’re interesting in attending so as to work on a class project
25
Embed
COMP 635: Seminar on Heterogeneous Processors …cavazos/cisc879-spring2008/...— No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007) — No class on
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
2007)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week— Time & Place
– Default: Mondays, 3:30pm - 4:30pm, DH 1075– Exception: 9/20 (Thurs) lecture, 3:30pm - 4:30pm, DH 3076– 30 minutes reserved after each lecture for discussion (optional)
— Office Hours (DH 3131)– 11am - 12noon, Fridays from 8/31/07 to 12/7/07
• Volunteers needed to lead discussion of papers in next lecture (9/20)1. “Using advanced compiler technology to exploit the performance of the Cell
Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45,No 1, 2006
2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevic et al,PPoPP 2007 Best Paper, March 2007.
• CELL HACK-A-THON II, Austin, September 22 - 25— See http://www.hpc-consortium.net/events/200709/ for details— Contact me if you’re interesting in attending so as to work on a class project
3COMP 635, Fall 2007 (V.Sarkar)
Acknowledgments
• MIT 6.189 IAP 2007, Jan 2007, Lecture 2, “Introduction to the Cell Processor”,Michael Perrone, http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Georgia Tech, Sony/Toshiba/IBM Workshop on Software and Applications for theCell/B.E. processor, June 18-19, 2007, http://sti.cc.gatech.edu/program.html—Code and Data Partitioning for the Local Stores on the Cell/B.E. processor,
Kevin O'Brien, Kathryn O'Brien, Zehra Sura, Tao Zhang and Tong Chen,http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• U. Penn. Systems Seminar on “Cell Processor,” Diana Palsetia, 11/21/2006,www.cis.upenn.edu/~palsetia/cellproc.ppt
4COMP 635, Fall 2007 (V.Sarkar)
Outline
• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• Yuan Zhao -- Experiences with compiling for Cell
5COMP 635, Fall 2007 (V.Sarkar)
Cell History• IBM, SCEI/Sony, Toshiba Alliance formed in 2000• Design Center opened in March 2001
— Based in Austin, Texas• Single Cell BE operational Spring 2004• 2-way SMP operational Summer 2004• February 7, 2005: First technical disclosures• October 6, 2005: Mercury Announces Cell Blade• November 9, 2005: Open Source SDK & Simulator Published• November 14, 2005: Mercury Announces Turismo Cell Offering• February 8, 2006 IBM Announced Cell Blade
Systems and Technology Group
6COMP 635, Fall 2007 (V.Sarkar)
Cell Chip
7COMP 635, Fall 2007 (V.Sarkar)
Cell Features
• Heterogeneousmulticore systemarchitecture— Power Processor
Element for controltasks
— Synergistic ProcessorElements for data-intensive processing
LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit) CL - Cache Management / Bandwidth Class
DMA Commands
Command Parameters
Put - Transfer from Local Store to EA spacePuts - Transfer and Start SPU executionPutr - Put Result - (Arch. Scarf into L2)Putl - Put using DMA List in Local StorePutrl - Put Result using DMA List in LS (Arch)Get - Transfer from EA Space to Local StoreGets - Transfer and Start SPU executionGetl - Get using DMA List in Local StoreSndsig - Send Signal to SPU Command Modifiers: <f,b>f: Embedded Tag Specific Fence
Command will not start until all previous commandsin same tag group have completed
b: Embedded Tag Specific BarrierCommand and all subsiquent commands in sametag group will not start until previous commands in sametag group have completed
SL1 Cache Management Commandssdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint)sdcrz - Data cache region zerosdcrs - Data cache region storesdcrf - Data cache region flush
getllar - DMA 128 bytes from EA to LS and set Reservationputllc - Conditionally DMA 128 bytes from LS to EAputlluc - Unconditionally DMA 128 bytes from LS to EA
barrier - all previous commands complete before subsiquentcommands are started
mfcsync - Results of all previous commands in Tag groupare remotely visible
mfceieio - Results of all preceding Puts commands in samegroup visible with respect to succeeding Get commands
Memory Flow Controller Commands
11COMP 635, Fall 2007 (V.Sarkar)
PPE StructurePower Processor Element (PPE):
— General purpose, 64-bitRISC processor(Power/PowePC binarycompatible)
CELL Software Design Considerations• Four Levels of Parallelism
—Blade Level: Two Cell processors per blade—Chip Level: 9 cores run independent tasks—Instruction level: Dual issue pipelines on each SPE—Register level: Native SIMD on SPE and PPE VMX
• 256KB local store per SPE: data + code + stack• Communication
—DMA and Bus bandwidth– DMA granularity – 128 bytes– DMA bandwidth among LS and System memory
—Traffic control– Exploit computational complexity and data locality to lower data traffic
• Other optimization considerations—PPE SIMD, system bottleneck, load balance
18COMP 635, Fall 2007 (V.Sarkar)
Programming the cell is challenging
Issues• Dividing program among different cores
• Creating instructions in a different language for the 8 SPEsthan for the PowerPC core.
• Need to think in terms of SIMD nature of dataflow to getmaximum performance from SPUs
• SPU local store needs to perform coherent DMA access foraccessing system memory
19COMP 635, Fall 2007 (V.Sarkar)
ProgrammerExperience
DevelopmentTools Stack
Hardware orSystem Level Simulator
Linux PPC64 with Cell Extensions
SPE Management LibApplication Libs
SamplesWorkloads
Demos
Code Dev Tools
Debug Tools
Performance Tools
Standards: Language extensionsABI
Verification Hypervisor
DevelopmentEnvironment
End-UserExperience
ExecutionEnvironment
Miscellaneous Tools
Cell Software Environment
20COMP 635, Fall 2007 (V.Sarkar)
Manually compiling and binding a Cell BE program
Copyright: IBM
21COMP 635, Fall 2007 (V.Sarkar)
Outline
• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• Yuan Zhao -- Experiences with compiling for Cell
22COMP 635, Fall 2007 (V.Sarkar)
Shared Memory Processor
• CBE can be explicitly programmed as a shared-memory multiprocessorusing two different instruction sets
• The SPEs and the PPE can be programmed to fully inter-operate in a cache-coherent Shared-Memory Multiprocessor Model— Cache-coherent DMA operations for SPEs— DMA operations use effective address common to all PPE and SPEs— SPE shared-memory store instructions are replaced
– A store from the register file to the LS– DMA operation from LS to shared memory
— SPE shared-memory load instructions are replaced– DMA operation from shared memory to LS– A load from LS to register file
• Of course … a compiler could provide much of this functionality.
23COMP 635, Fall 2007 (V.Sarkar)
foo1 ();
#pragma omp parallel forfor (i=0; i < N; i++) A[i] = x * B[i];
foo2 ();
Single sourcefor (i=LB; i < UB; i++) A[i] = x * B[i];
foo3(LB,UB)
outline
foo3_SPU (LB,UB)
clone
for (i=LB; i < UB; i++) A[i] = x * B[i];Runtime barrier
foo1 ();Runtime distribution of work: invoke foo3, for i=[0,N)Runtime barrierfoo2 ();
Runtime barrier
In SPE code:A, B, and x are shared
Compiling a single source file for the Cell (w/o buffers)
24COMP 635, Fall 2007 (V.Sarkar)
foo1 ();
#pragma omp parallel forfor (i=0; i < N; i++) A[i] = x * B[i];
foo2 ();
Single source
foo1 ();Runtime distribution of work: invoke foo3 and foo3_SPU, for i=[0,N)Runtime barrierfoo2 ();
for (i=LB; i < UB; i++) A[i] = x * B[i];Runtime barrier
foo3(LB,UB)
outline
foo3_SPU (LB,UB)
/** buffers A´[M], B´[M] **/
for ( k=LB; k < UB; k+=M) { DMA M elements of B into B´ for (j=0; j<M; j++) { A´[j] = cache_lookup(x) * B´[j]; } DMA M elements of A out of A´}
Runtime barrier
clone
Compiling a single source file for the Cell (w/ buffers)
25COMP 635, Fall 2007 (V.Sarkar)
Data Partitioning
• Single Source assumption: all data lives in System Memory
• Naïve implementation, every load and store requires a dmaoperation
—Too costly (~700 cycles per load or store)
—MP will require locking on every reference
• What can be done to make this acceptable?
26COMP 635, Fall 2007 (V.Sarkar)
Prefetching
• Example:
for(i=0;i<100000;i++)
a[i]=b[i]+c[i];
for(i=0;i<100000;i+=100) {
dma_get(b’,b[i],400);
dma_get(c’,c[i],400);
for(ii=0;ii<100;ii++)
a’[ii]=b’[ii]+c’[ii];
dma_put(a[i],a’,400);
}
Original Code
Blocked, with prefetch
dma_get(b’,b[0],400);
dma_get(c’,c[0],400);
for(i=0;i<99900;i+=100) {
dma_get(b”,b[i+100],400);
dma_get(c”,c[i+100],400);
for(ii=0;ii<100;ii++)
a’[ii]=b’[ii]+c’[ii];
dma_put(a[i],a’,400);
swap(a’,a”);
swap(b’,b”);
swap(c’,c”);
}
for(ii=0;ii<100;ii++)
a”[ii]=b”[ii]+c”[ii];
dma_put(a[i],a”,400);
Software Pipelined Prefetch
27COMP 635, Fall 2007 (V.Sarkar)
Irregular Accesses
• b and c can be prefetched, but dhas an irregular access pattern,thus we cannot predict whatelements of d are required
• we seem to be thrown back onthe naïve implementation, d[f(i)]must be fetched on eachiteration with a consequentlarge slowdown of the loop
• observation: it’s as if everyaccess to d incurred a cachemiss
What do we do about this?
for(i=0;i<100000;i++)
a[i]=b[i]+c[i]*d[f(i)];
28COMP 635, Fall 2007 (V.Sarkar)
Software Caching
for(i=0;i<100000;i++)
= … d[f(i)];
for(i=0;i<100000;i++)
t=cache_lookup(d[f(i)];
= … t;
Original CodeCode with explicit Cache Lookup
inline vector cache_lookup(addr) {
if (cache_directory[addr&key_mask] != (addr&tag_mask))
• There is no coherence problem for this static buffer in the loop• Runtime coherence maintenance is needed only
— At the entry of loop: DMA read and check whether the software controlled cachehas updated data
— At the exit of loop:– Write-through: update the hit cache line and DMA write– Write-back: put the static buffer content into cache
• Pros/Cons— Requires local data dependence info, which may be more likely to be available— The structure of software controlled cache remains unchanged
• References are put into static buffer in a loop only when there is no data dependencebetween the reference and any other reference accessed by software controlledcache or another static buffer in the loop.
— The coherence maintenance can be overlapped with DMA operations— Candidates for static buffer may be lost if the data dependence information is too
conservative
33COMP 635, Fall 2007 (V.Sarkar)
Global Coherence Avoidance Analysis
• Runtime coherence maintenance can be avoided by compileranalysis—At entry: if there is no updated cache line for this static buffer—At exit: if there is no cache line for this static buffer already in
cache that will be referenced later• How the compiler predicts cache contents
—No lines in cache after flush— If data is carefully aligned or padded, compiler can assume
different variables will never be in the same cache line—Can not predict the replacement. A line will be assumed to stay
in cache until flush
34COMP 635, Fall 2007 (V.Sarkar)
Optimization with Flushes
• When runtime coherence maintenance is needed by theprevious analysis, it may be profitable to insert extra cacheflushes to avoid the coherence maintenance
• Flush can be a flush for one variable or combine them as flushall
• The previous analysis can provide information about thepossible insertion points for flush—Move in the control flow graph to reduce the overhead—Similar to the algorithm of partial redundant elimination.—Branch profiling may help
35COMP 635, Fall 2007 (V.Sarkar)
SPU Code Partition Manager Overview
ActivePartition m
PartitionManager
(long resident)
SPU Processor
(partition m)
…….
call to Partition n
…….
Main Memory
Partition 1
Partition 2
…...
Partition n
…...
36COMP 635, Fall 2007 (V.Sarkar)
SPU Code Partition Manager Overview
PartitionManager
(long resident)
SPU Processor
(partition m)
…….
call to Partition n
…….
Main Memory
Partition 1
Partition 2
…...
Partition n
…...
37COMP 635, Fall 2007 (V.Sarkar)
OVERLAY command effect: Binary View
Header
……
Partition 1
Partition 2
Program Binary Image
…...
Partition n
…...
offset: 0x1000
offset: 0x2000
offset: 0x3000
38COMP 635, Fall 2007 (V.Sarkar)
OVERLAY command effect: Execution View
Program MemoryImage
Header
……
Partition 1 Partition 2 …... Partition n
virtual address:0x1000
39COMP 635, Fall 2007 (V.Sarkar)
Call Graph Partitioning Algorithm
• Build an affinity graph based on the global call graph.—Each global call graph node becomes a node in the affinity graph
and costs some memory—Each call graph edge becomes an edge in the affinity graph
• Each call graph edge is weighted.—Estimated through static program analysis—Profiling
• Apply maximum spanning tree algorithm to the affinity graph.—Process edges by the order of the weight—If merging the two nodes of the edge does not exceed the memory
limitation, then merge, and so on.
• Each (merged) node left is a program partition.
40COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
Assume Memory Limitation is 1000
300
300 300
300300 300
100
1000200
10
5050
41COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
300
600 300
300 300
100
200
10
5050
Assume Memory Limitation is 1000
42COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
300
900 300
300
100 10
50
50
Assume Memory Limitation is 1000
43COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
600
300
900
100 10
50
Assume Memory Limitation is 1000
44COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
900900 150
Assume Memory Limitation is 1000
45COMP 635, Fall 2007 (V.Sarkar)
Optimizations
• Profiling to get accurate call edge frequencies—Especially with the presence of a lot of indirect calls
through function pointers
• Get the accurate function code size—Currently estimated—Conservative, very rough
• Leaf function duplication—Some leaf functions are referenced by two non-
coalescable partitions—May be beneficial to duplicate the function
• Double Buffering—Rely on good prefetching to be beneficial—Prefetching is a difficult problem
46COMP 635, Fall 2007 (V.Sarkar)
• NAS and SPEC OMP benchmarks, speedups against 1 SPE
• Scalability generally very good
—IS and equake not good due to non-parallelized loops
Performance Normalized to one SPU
1
2
3
4
5
6
7
8
1SPU 2SPU 4SPU 8SPU
Sp
ee
du
p
CG
EP
FT
IS
MG
equake
swim
47COMP 635, Fall 2007 (V.Sarkar)
Outline
• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• Yuan Zhao -- Experiences with compiling for Cell
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available inother countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBMofferings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained inthis document.Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions onthe capabilities of non-IBM products should be addressed to the suppliers of those products.IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give youany license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY10504-1785 USA.All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guaranteeseither expressed or implied.All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and theresults that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurationsand conditions.IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisionsworldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipmenttype and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawalwithout notice.IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, pleasecheck: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.htmlAny performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and aredependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in thisdocument may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document shouldverify the applicable data for their specific environment.
Special Notices
49COMP 635, Fall 2007 (V.Sarkar)
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks,BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBMBusiness Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage,xSeries; Advanced Micro-Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, PowerArchitecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks,System p, System p5, System Storage, VideoCharger, Virtualization Engine.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, othercountries, or both.Rambus is a registered trademark of Rambus, Inc.XDR and FlexIO are trademarks of Rambus, Inc.UNIX is a registered trademark in the United States, other countries or both.Linux is a trademark of Linus Torvalds in the United States, other countries or both.Fedora is a trademark of Redhat, Inc.Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.AMD Opteron is a trademark of Advanced Micro Devices, Inc.Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimapand SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).AltiVec is a trademark of Freescale Semiconductor, Inc.PCI-X and PCI Express are registered trademarks of PCI SIG.InfiniBand™ is a trademark the InfiniBand® Trade AssociationOther company, product and service names may be trademarks or service marks of others.
Revised July 23, 2006
Special Notices (Cont.) -- Trademarks
50COMP 635, Fall 2007 (V.Sarkar)
(c) Copyright International Business Machines Corporation 2005.All Rights Reserved. Printed in the United Sates April 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document areNOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resultin death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or changeIBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnityunder the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specificenvironments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be reliedupon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liablefor damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division The IBM home page is http://www.ibm.com1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page isHopewell Junction, NY 12533-6351 http://www.chips.ibm.com