Porting Reconstruction Algorithms Porting Reconstruction Algorithms to the Cell Broadband Engine to the Cell Broadband Engine S. Gorbunov S. Gorbunov 1 , , U. Kebschull U. Kebschull 1 , I. Kisel I. Kisel 1,2 1,2 , V. Lindenstruth , V. Lindenstruth 1 , W.F.J. , W.F.J. Müller Müller 2 1 Kirchhoff Institute for Physics, University of Heidelberg, Germany Kirchhoff Institute for Physics, University of Heidelberg, Germany 2 Gesellschaft für Schwerionenforschung mbH Gesellschaft für Schwerionenforschung mbH , Darmstadt, Germany , Darmstadt, Germany ACAT08, ACAT08, Erice, Sicily Erice, Sicily November 3-7, 2008 November 3-7, 2008
18
Embed
Porting Reconstruction Algorithms to the Cell Broadband Engine S. Gorbunov 1, U. Kebschull 1,I. Kisel 1,2, V. Lindenstruth 1, W.F.J. Müller 2 S. Gorbunov.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Porting Reconstruction Algorithms Porting Reconstruction Algorithms to the Cell Broadband Engine to the Cell Broadband Engine
S. GorbunovS. Gorbunov11, , U. KebschullU. Kebschull11,, I. KiselI. Kisel1,21,2, V. Lindenstruth, V. Lindenstruth11, W.F.J. Müller, W.F.J. Müller22 11Kirchhoff Institute for Physics, University of Heidelberg, GermanyKirchhoff Institute for Physics, University of Heidelberg, Germany
22Gesellschaft für Schwerionenforschung mbHGesellschaft für Schwerionenforschung mbH, Darmstadt, Germany, Darmstadt, Germany
Processing time per event increased from 1 sec to 5 min Processing time per event increased from 1 sec to 5 min for double-sided strip detectors with 85% of fake space points !for double-sided strip detectors with 85% of fake space points !
Modern (Pentium, AMD, PowerPC, …) CPUs have vector units operating 2 d.p. or 4 s.p. scalars Modern (Pentium, AMD, PowerPC, …) CPUs have vector units operating 2 d.p. or 4 s.p. scalars in one go !in one go !SIMDSIMD = = SSingle ingle IInstruction nstruction MMultiple ultiple DDataata
truetruefakefake
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
Ivan Kisel, GSI/KIPIvan Kisel, GSI/KIP 88/18/18
Kalman Filter for Track Fit Kalman Filter for Track Fit 2/2 2/2
arbitrary large errors
non-homogeneous magnetic fieldas large map
multiple scattering in
material
small errors
weight for update
not enough accuracy not enough accuracy in single precisionin single precision
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
Ivan Kisel, GSI/KIPIvan Kisel, GSI/KIP 99/18/18
Kalman Filter: Conventional and Square-RootKalman Filter: Conventional and Square-Root
The square-root KF provides the same precision as the conventional KF, but roughly 30% slower.The square-root KF provides the same precision as the conventional KF, but roughly 30% slower.
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
Ivan Kisel, GSI/KIPIvan Kisel, GSI/KIP 1010/18/18
Kalman Filter Instability in Single PrecisionKalman Filter Instability in Single Precision
Illustrative example:Illustrative example: 2D fit of straight line, no MS, 15 detectors2D fit of straight line, no MS, 15 detectors
Conclusion for single precision:Conclusion for single precision: Use the square-root version of KF orUse the square-root version of KF or Use double precision for the critical part of KF orUse double precision for the critical part of KF or Use a proper initialization in the conventional KFUse a proper initialization in the conventional KF
Red Hat (Fedora Core 4)Red Hat (Fedora Core 4) Cell Simulator:Cell Simulator:
PPEPPE SPESPE
3.3. Cell Blade Cell Blade
SSE2SSE2
SSE2SSE2
AltiVecAltiVec
Specialized Specialized SIMDSIMD
Data Types:Data Types:
Platform:Platform:
Use headers to overload +, -, *, / operators --> the source code is Use headers to overload +, -, *, / operators --> the source code is unchanged !unchanged !
Use headers to overload +, -, *, / operators --> the source code is Use headers to overload +, -, *, / operators --> the source code is unchanged !unchanged !
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
Ivan Kisel, GSI/KIPIvan Kisel, GSI/KIP 1414/18/18
SPE StatisticsSPE Statistics
Timing profile ! Timing profile !
No need to checkNo need to checkthe assembler code ! the assembler code !
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
Ivan Kisel, GSI/KIPIvan Kisel, GSI/KIP 1515/18/18
Kalman Filter Track Fit on Kalman Filter Track Fit on Intel XeonIntel Xeon, , AMD OpteronAMD Opteron and and CellCell
Motivated, but not restricted by Cell !Motivated, but not restricted by Cell !Motivated, but not restricted by Cell !Motivated, but not restricted by Cell !
lxg1411@GSI
eh102@KIP
blade11bc4 @IBM
• 2 Intel Xeon Processors with Hyper-Threading enabled and 512 kB cache at 2.66 GHz;• 2 Dual Core AMD Opteron Processors 265 with 1024 kB cache at 1.8 GHz;• 2 Cell Broadband Engines with 256 kB local store at 2.4G Hz.
Inte
l P4
Inte
l P4
Cell
Cell
10000 10000 faster!faster!
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
06 November 2008, ACAT08, Er06 November 2008, ACAT08, Ericeice
Ivan Kisel, GSI/KIPIvan Kisel, GSI/KIP 1818/18/18
SummarySummary
• Think about using SIMD units in the nearest future (many-cores, TF/s, …)Think about using SIMD units in the nearest future (many-cores, TF/s, …)• Use single-precision floating point if possibleUse single-precision floating point if possible• In critical parts use double precision if necessaryIn critical parts use double precision if necessary• Avoid accessing main memory, no maps, no look-up-tablesAvoid accessing main memory, no maps, no look-up-tables• New parallel languages appear: Ct, CUDA, …New parallel languages appear: Ct, CUDA, …• Keep portability of the code (Intel, AMD, Cell, GPGPU, …)Keep portability of the code (Intel, AMD, Cell, GPGPU, …)• Try the auto-vectorization option of the compilers Try the auto-vectorization option of the compilers