Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2
Jan 31, 2016
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Key Computation IssuesKey Computation Issues
Large volume of data ( disk / memory / network )
Significant number of solvers iterations due to numerical intractability
Redundant memory accesses coming from interleaving data dependencies
Use of double precision because of accuracy need (hardware penalty)
Misaligned data (inherent to specific data structures)
Exacerbates cache misses (depending on cache size)
Becomes a serious problem when consider accelarators
Leads to « false sharing » with Shared-Memory paradigm (Posix, OpenMP)
Padding is one solution but would dramatically increase memory requirement
Memory/Computation compromise in data organization (e.g. gauge replication)
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Why the CELL Processor ?Why the CELL Processor ?
Highest computing power in a single « computing node »
Fast memory access
Asynchronysm between data transfers and computation
Issues with the CELL Processor ?Issues with the CELL Processor ?
Data alignment (both for calculations and transfers)
Heavy use of list DMA
Small size of the Local Store (SPU local memory)
Ressources sharing with Dual Cell Based Blade
Integration into an existing standard framework
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
What we have doneWhat we have done
Implementation of each critical kernel on the CELL processor
SIMD version of basic operators
Appropriate DMA mechanism (efficient list DMA and double buffering)
Merging of consecutive operations into a unique operator (latency & memory reuse)
Aggregation of all these implementations into a single and standalone library
Effective integration into the tmLQCD package
Successful tests (QS20 and QS22)
A single SPU thread holds the whole set of routines
SPU thread remains « permanently » active during a working session
Data re-alignment
Routine calls replacement (invoke CELL versions in place of native ones) This should be the way to commit this back to tmLQCD (external library and « IsCELL » switch)
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Global OrganizationGlobal Organization
Task partitioning, distribution, and synchorization are done by the PPU
Each SPE operates on its data portion by a typical loop of the form
(DMA get + SIMD Computation + DMA put)
The SPE, always active, switches to the appropriate operation on each request
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Optimal list DMA organization for the Wilson-Dirac OperatorOptimal list DMA organization for the Wilson-Dirac Operator
The computation of Wilson-Dirac action for a set of K contigous spinors required to get 8K spinors (Example below with 32x163 lattice and even-odd )
S[0] P[2048] P[63488] P[128] P[1920] P[8] P[120] P[0] P[7]
S[1] P[2049] P[63489] P[129] P[1921] P[9] P[121] P[1] P[0]
S[2] P[2050] P[63490] P[130] P[1922] P[10] P[122] P[2] P[1]
S[3] P[2051] P[63431] P[131] P[1923] P[11] P[123] P[3] P[2]
A direct list DMA to get this « spinors matrix » involves 8x4 DMA items
A list DMA to get the « transpose » involves 7 + 1 + 1 = 9 DMA items
Generally, our list DMA is of size 8 + ck instead of 8K ( bin packing )
No impact on SPU performance because of the uniform access to the LS
Significant improvment in global performance and scalability
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Performance resultsPerformance results
We consider a 32x163 lattice and CELL-accerated version of tmLQCD
QS20
#SPE Time(s) Speedup GFlops
1 0.109 1.00 0.95
2 0.054 2.00 1.92
3 0.036 3.00 2.89
4 0.027 3.99 3.85
5 0.022 4.98 4.73
6 0.018 5.96 5.78
7 0.015 6.93 6.94
8 0.013 7.88 8.01
QS22
#SPE Time(s) Speedup GFlops
1 0.0374 1.00 2.76
2 0.0195 1.91 5.31
3 0.0134 2.79 7.76
4 0.0105 3.56 9.90
5 0.0090 4.15 11.56
6 0.0081 4.61 12.84
7 0.0076 4.92 13.88
8 0.0075 5.75 14.02
INTEL i7 quadcore 2.83 Ghz
Without SSE With SSE
1 core 4 cores 1 core 4 cores
0.0820
0.0370 0.040 0.0280
INTEL i7 CELL (8 SPEs)
SSE + 4c QS20 QS22
GCR (57 iters) 11.05 s 3.78 s 2.04 s
CG (685 iters) 89.54 s 42.25 s 22.78 s
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
CommentsComments
We observed a factor 2 between QS20 and QS22
We observed a factor 4 between QS22 and Intel i7 quadcore 2.83 Ghz
Good scalability on QS20
Scalability on QS22 is alterated beyond 4 SPEs (probably a binding issue on the Dual Cell Based Blade, which should be easy to fix)
Fixing this scalability issue on QS22 will double actual performances
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
Ways for improvmentWays for improvment
Implement the « non GAUGE COPY » version (significant memory reduction / packing)
Outstanding performance expected
Explore the SU(3) reconstruct approach at SPE level (memory and bandwith savings)
Having the PPU participate in the calculations (makes sens in double precision)
Try to scale up to the 16 SPEs on the QS22 Dual Cell Based Blade
Experiment with a cluster of CELL processors
Aurora/PetaQCD/QPACE MettingRegensburg University, April 14-15, 2010
ENDEND