with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA Alexander van Amesfoort @ TUD Rob van Nieuwpoort @ VU/ASTRON Parallel Applications for Multi-core Processors Ana Lucia Vârbănescu TUDelft / Vrije Universiteit Amsterdam
59
Embed
With acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA Alexander van Amesfoort @ TUD Rob van Nieuwpoort @ VU/ASTRON Parallel.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA
Alexander van Amesfoort @ TUDRob van Nieuwpoort @ VU/ASTRON
Parallel Applications for Multi-core Processors
Ana Lucia Vârbănescu TUDelft / Vrije Universiteit Amsterdam
►Views on parallel applications… and multiple conclusions
3/79.95
One introduction
4/79.95
The history: STI Cell/B.E.
► Sony: main processor for PS3
► Toshiba: signal processing and video streaming
► IBM: high performance computing
5/79.95
The architecture
► 1 x PPE 64-bit PowerPC L1: 32 KB I$+32 KB
D$ L2: 512 KB
► 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector
registers ► Hybrid memory
model: PPE: Rd/Wr SPEs: Async DMA
6/79.95
The Programming
►Thread-based model, with push/pull data flowThread scheduling by userMemory transfers are explicit
►Five layers of parallelism to be exploited: Task parallelism (MPMD) Data parallelism (SPMD)Data streaming parallelism (DMA double buffering) Vector parallelism (SIMD – up to 16-ways)Pipeline parallelism (dual-pipelined SPEs)
7/79.95
Sweep3D application
►Part of the ASCI benchmark►Solves a three-dimensional particle transport
problem►It is a 3D wavefront computation
IPDPS 2007: Fabrizio Petrini, Gordon Fossum, Juan Fernández, Ana Lucia Varbanescu, Michael Kistler, Michael Perrone: Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine
8/79.95
Sweep3D computationSUBROUTINE sweep()
DO iq=1,8 ! Octant loop DO m=1,6/mmi ! Angle pipelining loop DO k=1,kt/mk! K-plane loop
! JK-diagonals with MMI pipelining DO jkm=1,jt+mk-1+mmi-1! I-lines on this diagonal DO il=1,ndiag! Solve Sn equation IF .NOT. do_fixups DO i=1,it ENDDO! Solve Sn equation with fixups ELSE DO i=1,it ENDDO ENDIF
ENDDO ! I-lines on this diagonal ENDDO ! JK-diagonals with MMI
1 common concept detection, repeated for each feature
CH
CC
EH
TX
CD
23/79.95
MarCell – Porting
1
2
3
4
Detect & isolate kernels to be ported
Replace kernels with C++ stubs
Implement the data transfers and move kernels on SPEs
Iteratively optimize SPE code
ICPP 2007: A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu, An Effective Strategy for Porting C++ Applications on Cell.
24/79.95
Experiments
►Run on a PlayStation3 1Cell processor, 6 SPEs available 3.2GHz, 256MB RAM
►Double-checked with a Cell blade Q202 Cell processors, 16 SPEs available 3.2GHz, 1GB RAM
►Mapping and scheduling:High-level parallelization
• Essential for “seeing” the influence of kernel optimizations
• Platform-oriented MPI-inheritance may not be good enough
Context switches are expensive Static scheduling can be replaced with dynamic
(PPE-based) scheduling
33/79.95
Radioastronomy
► Very large radiotelescopes LOFAR, ASKAP, SKA, etc.
► Radioastronomy features Very large data sets Off-line (files) and On-line processing (streaming) Simple computation kernels Time constraints
• Due to streaming• Due to storage capability
► Radioastronomy data processing is ongoing research Multi-core processors are a challenging solution
34/79.95
Getting the sky image
► The signal path from the antenna to the sky image We focus on imaging
35/79.95
Data imaging
► Two phases for building a sky image Imaging: gets measured visibilities and creates dirty image Deconvolution “cleans” the dirty image into a sky model.
► The more iterations, the better the model But more iterations = more measured visibilities
36/79.95
Gridding/Degridding
gridding
degridding
(u,v)-tracks sampled data (visibilities) Gridded data (all baselines)
V(b(ti))
V(b(ti)) = data read at time ti on baseline b Dj(b(ti)) contributes to a certain region in the final grid.
► Both gridding and degridding are performed by convolution
37/79.95
The code
forall (j =0..Nfreq;i=0..Nsamples−1) // for all samples
//the kernel position in C
compute cindex=C_Offset((u,v,w)[i],freq[j]);
//the grid region to fill
compute gindex=G_Offset((u,v,w)[i],freq[j]);
//for all points in the chosen region
for(x=0;x<M;x++) // sweep the convolution kernel
if (gridding) G[gindex+x]+=C[cindex+x]V[i,j]; if (degridding) V’[i,j]+=G[gindex+x]C[cindex+x];
► All operations are performed with complex numbers !
38/79.95
The computation
► Computation/iteration:M * (4ADD + 4MUL) = 8 * M
► Memory transfers/iteration: RD: 2* M * 8B ; WR: M * 8B
► Arithmetic intensity [FLOPs/byte]: 1/3 => memory intensive app!► Two consecutive data points “hit” different regions in C/G =>
dynamic!
Read (u,v,w)(t,b)V(t,b,f)
ComputeC_ind,G_ind
Read SC[k], SG[k]
ComputeSG[k]+D x SC[k]
WriteSG[k] to G
k = 1.. m x m
Samples x baselines x frequency_channels
HDD Memory
39/79.95
The data
► Memory footprint: C: 4MB ~ 100MB V: 3.5GB for 990 baselines x 1 sample/s x 16 fr.channels G: 4MB
► For each data point: Convolution kernel: from 15 x 15 up to 129 x 129
40/79.95
Data distribution
►“Round-robin”
►“Chunks”
►Queues
123456789101112
987654321 121110
63
12987
112
541
41/79.95
Parallelization
Read (u,v,w)(t,b)V(t,b,f)
ComputeC_ind,G_ind
Rd SC[k], SG[k]
ComputeSG[k]+D x SC[k]
Wr SG[k] to localG
k = 1.. m x m
Samples x baselines x frequency_channels
HDD Memory
DMA DMA
Add localGto finalG
►A master-worker model“Scheduling” decisions on the PPESPEs concerned only with computation
42/79.95
Optimizations
►Exploit data localityPPE: fill the queues in a “smart” way SPEs: avoid unnecessary DMA
►Tune queue sizes ►Increase queue filling speed
2 or 4 threads on the PPE
►Sort queuesBy g_ind and/or c_ind
43/79.95
Experiments set-up
► Collection of 990 baselines 1 baseline Multiple baselines
► Run gridding and degridding for: 5 different support sizes Different core/thread configurations
► Report: Execution time / operation (i.e., per gridding and per
degridding):Texec/op = Texec/(NSamples x NFreqChans x KernelSize x #Cores)
44/79.95
Results – overall evolution
45/79.95
Lessons from Gridding
►SPE kernels have to be as regular as possibleDynamic scheduling works on the PPE side
►Views on parallel applications… and multiple conclusions
49/79.95
Other platforms
► General Purpose MC Easier to program (SMP machines) Homogeneous Complex, traditional cores, multi-threaded
► GPU’s Hierarchical cores Harder to program (more parallelism) Complex memory architecture Less predictable
50/79.95
A Comparison
► Different strategies are required for each platform Core-specific optimization are the most important for GPP Dynamic job/data allocation are essential for Cell/B.E. Memory management for high data parallelism is critical for GPU
51/79.95
Efficiency (case-study) ► We have tried to the most “natural” programming model for each
platform ► The parallelization effort
GPP: 4 days • A Master-Worker model may improve performance here as well
Cell/B.E.: 3-4 months • Very good performance, complex solution
►Views on parallel applications… and multiple conclusions
53/79.95
A view from Berkeley
54/79.95
A view from Holland
55/79.95
Overall …
► Cell/B.E. is NOT hard to program unless … High performance or high productivity are required
► Still in the case-studies phase Everything can run on the Cell … but how and when ?!
► Optimizations Low-level: may be delegated to a compiler (difficult) High-level: must be user-assisted
► Programming models Offer partial solutions, but none seems complete Various approaches, with limited loss in efficiency
56/79.95
… but …
► Cell/B.E. is NOT the only option Choosing a multi-core platform is *highly* application
dependent Efficiency is essential, more so than performance
► In-core optimizations pay off for *all* platforms Are roughly predictable too
► Higher level optimizations make the difference Data management and distribution Task scheduling Isolation and proper implementation of dynamic behavior
(e.g., scheduling)
57/79.95
Take-home messages [1/2]
►It’s not multi-core processors that are difficult to program, but more that applications are difficult to parallelize.
►There is no silver bullet for all applications to perform great on the Cell/B.E., but there are common practices for getting there.
58/79.95
Take-home messages [2/2]
►The application design, implementation, and optimization principles for the Cell/B.E. hold for most multi-core platforms.
►Applications must have a massive influence on next generation multi-core processor design