System-Level Heterogeneity with Intel ® Xeon Phi TM Processors Estela Suarez Jülich Supercomputing Centre This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreements 287530 (DEEP) and 610476 (DEEP-ER).
16
Embed
System-Level Heterogeneity with Intel Xeon … Heterogeneity with Intel® Xeon PhiTM Processors Estela Suarez Jülich Supercomputing Centre This project has received funding from the
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
System-Level Heterogeneity
with Intel® Xeon PhiTM Processors
Estela Suarez
Jülich Supercomputing Centre
This project has received funding from the European Union's Seventh Framework Programme for research, technological
development and demonstration under grant agreements 287530 (DEEP) and 610476 (DEEP-ER).
Collaborative R&D in DEEP & DEEP-ER
European Union Exascale projects 20 partners Total budget: 28.3 M€ EU-funding: 14.5 M€ Combined term: 5 years
– Need for low latency, spatial application structures 3D Torus direct-connected network (EXTOLL)
– Network bridging and KNC control Booster Interface layer
Both parts – Efficiency needs use of liquid cooling
& dense packaging
DEEP Prototype Systems
Eurotech Aurora Prototype
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 7
Cluster part 128 dual-socket
Intel Xeon E5-2403
nodes
QDR InfiniBandTM
Eurotech Aurora
liquid cooling &
packaging
Booster part 384 Intel Xeon Phi
7120X nodes
FPGA
implementation of
EXTOLL
interconnect
24 Booster interface
nodes with Intel
Xeon processor
Eurotech Aurora
liquid cooling &
packaging Cluster
Booster
DEEP/DEEP-ER Programming Model
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 8
ParaStation Global MPI layer
Expert-level programming
Efficient communication across the whole system
Dynamic process spawning an control in both
directions
Task-based OmpSs programming model
Pragma based, emphasizes ease of use
Efficient communication across the whole system
Dynamic spawning of massively parallel tasks in
both parts
Massively Parallel Tasks in OmpSs
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 9
Published in: Sainz, F, Bellón, J, Beltran, V, Labarta, J, “Collective Offload for Heterogeneous Clusters”, IEEE 22nd International Conference on High Performance
Computing (HiPC), 2015
Rank 0
master
Rank 0-15
slave0
Rank 0
Worker Rank 1
Worker Rank 2
Worker Rank 3
wk63
Rank 16-31
slave1
Rank 0
Worker Rank 1
Worker Rank 2
Worker Rank 3
wk127
Rank 240-255
slave15
Rank 0
Worker Rank 1
Worker Rank 2
Worker Rank 3
wk1023
x16
x16
x16
Figure 7: FWI hierarchical MPI architecture
64
128
256
512
1024
64 128 256 512 1024
Sp
ee
d-u
p
# nodes (16 cores)
IdealOmpSs Offload
OmpSs Offload (no I/O)
Figure 8: Scalability of FWI application on up to 1024 nodes
VI. Concl usions and fut ur e wor k
This paper presents the OmpSs O✏oad model that was
originally developed to ease the porting of complex ap-
plications to the highly heterogeneous cluster architecture
proposed on the DEEP Exascale project. The OmpSs O✏oad
model has completely fulfilled its design goals, combining
the ease of use of Intel O✏oad with the flexibility, per-
formance and scalability of the native MPI Comm spawn
API. Moreover, our approach is fully integrated with the
rest of features provided by OmpSs, such as support for
OpenMP codes and CUDA or OpenCL kernels. Although
it was originally conceived for heterogeneous clusters we
have also successfully used it to develop hierarchical MPI
applications such as FWI. We think that these hierarchical
MPI architectures will play an important role in exploiting
future Exascale systems. Hence, tools such as OmpSs Of-
fload will be essential for designing such architectures and
helping with their implementation for complex and large
applications.
As future work, we plan to integrate our allocation API
with a resource manager/job scheduler to avoid the need
to reserve all the resources that will be required before
the program is launched. We also plan to investigate the
potential of OmpSs O✏oad to improve the malleability of
existing MPI applications, as well as the implications of
using this o✏oad model from the resilience point of view.
Refer ences
[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lip-pert, and E. Suarez, “On the scalability of the clusters-boosterconcept: a critical assessment of the DEEP architecture,” inProceedings of the Future HPC Systems: the Challenges ofPower-Constrained Performance. ACM, 2012, p. 3.
[2] A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell,X. Martorell, and J. Planas, “OmpSs: a proposal for pro-gramming heterogeneous multi-core architectures,” ParallelProcessing Letters, vol. 21, no. 02, pp. 173–193, 2011.
[3] K. O. W. Group et al., “The OpenCL specification,” A.Munshi, Ed, 2008.
[4] C. Nvidia, “Compute Unified Device Architecture program-ming guide,” 2007.
[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty,R. Narayanaswamy, J. Wiegert, F. Chinchilla, andR. McGuire, “O✏oad compiler runtime for the Intel R
Xeon Phi R coprocessor,” in Supercomputing. Springer,2013, pp. 239–254.
[7] O. W. Group et al., “TheOpenACC application programminginterface,” 2011.
[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell,and E. Ayguade, “Leveraging OmpSs to exploit hardwareaccelerators,” in 26th IEEE International Symposium onComputer Architecture and High Performance Computing,SBAC-PAD 2014, Paris, France, October 22-24, 2014.IEEE, 2014, pp. 112–119. [Online]. Available: http://dx.doi.org/10.1109/SBAC-PAD.2014.26
[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-Orti, “ rCUDA: Reducing the number of GPU-based accel-erators in high performance clusters,” in High PerformanceComputing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.
[10] A. Barak and A. Shiloh, “Themosix Virtual OpenCLl (VCL)cluster platform,” in Proc. Intel European Research andInnovation Conference, 2011.
[11] F. Sainz and V. Beltran. (2015) OmpSs CollectiveO✏oad. User Manual. [Online]. Available: http://pm.bsc.es/ompss-docs/user-guide/run-programs-archs-o✏oad.html
Measurements for BSC FWI (full waveform inversion) code
From DEEP to DEEP-ER
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 10
Simplified Interconnect
On-Node NVM
Self-Booting Nodes
Network attached memory
DEEP-ER Scalable I/O
Leverage presence of fast local
NVM storage
– Scalable caching of read/write data
close to requesting node
– Prefetching stages read data into
caches
– Write-back scheme saves data to
permanent storage
– Synchronous (done) and
asynchronous (WIP) versions/APIs
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 11
DEEP-ER Resiliency Scheme
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 12
BSC Full Waveform Inversion
Results
Using 60 cores per Xeon Phi coprocessor node with 180 threads
0
2
4
6
8
10
12
020406080
100120140160
Spe
ed
up
Gfl
op
s/s
Impact of different optimizations of wave propagator on Xeon Phi
Gflops/s
speedup
DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 14
INRIA MAXW-DGTD Results
0
1
2
3
4
16 64 256 1024Spe
ed
up
ove
r in
itia
l ve
rsio
n
# of cores
Before
After
0.50
0.75
1.00
1.25
1.50
16 64 256 1024
Par
alle
l Eff
icie
ncy
# of cores
Before
After
Improvements applied below:
• Non-blocking communication
• Renumbering scheme
• Vectorisation and locality
Performance improvement up to 3.3x Almost perfect parallel efficiency now
Setup: - Human head
- DEEP Cluster
- Mesh: 1.8 million cells
- 16 processes per node
- Pure MPI.
- P1 approximation.
15 DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 *Leger R., Alvarez Mallon D., Duran A., Lanteri S., “Assessing the DEEP-ER Cluster/Booster Architecture with a fine-element
type solver for bioelectromagnetics”, Submitted to PARCO2015, Contribution ID: 25. www.parco2015.org