YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: A leap forward with UTK’s Cray XC30

A leap forward with UTK’s Cray

XC30Mark R. Fahey

NICS Deputy DirectorIndustrial and Systems Engineering

University of Tennessee Knoxville

Page 2: A leap forward with UTK’s Cray XC30

2 Deploying Darter - ISC14, June 24, 2014

Outline

· Overview· Architecture· Deployment· Results

– MPI Benchmarks– Rank placement– Applications

· Conclusions

Page 3: A leap forward with UTK’s Cray XC30

3 Deploying Darter - ISC14, June 24, 2014

Overview· To meet the ever increasing computational science

research needs at the University of Tennessee, Knoxville (UTK) and its collaborating institutions, the Joint Institute for Computational Sciences (JICS) acquired a new Cray XC30 supercomputer, called Darter

· Peak performance of 248.9 Teraflops · The Cray XC30 combines Intel Xeon processors, the Cray

Aries interconnect in a Dragonfly network topology, and enhancements to the Cray operating system and programming environments

· Also configured with an integrated storage solution (Sonexion)

Page 4: A leap forward with UTK’s Cray XC30

4 Deploying Darter - ISC14, June 24, 2014

Darter architectureComponentprocessor Intel Xeon E5

Cores/processor 8

Hyperthreads/core 2

Core frequency 2.6 GHz

Sockets/node 2

Memory/node 32 GB

Mem bandwidth/node 25.6 GB/s

Total number of nodes 748 *

Total number of processors 1,496 *

Total number of cores 11,968 *

NIC Aries

Topology Dragonfly

4 cabinets

248.9 TF/s peak

2 external login nodes

5 services blades

* Recent IO upgrades have reduced compute

Page 5: A leap forward with UTK’s Cray XC30

5 Deploying Darter - ISC14, June 24, 2014

Schedule

Stage DateHardware arrived March 19, 2013System booted March 20, 2013Hardware diagnostics completed March 21, 2013Acceptance testing began March 22, 2013Acceptance testing completed March 28, 2013Early user access April 2, 2013

• System was deployed, accepted, and placed into production in only 2 weeks.

• Spring Experiment for the Center for Analysis and Prediction of Storms largely drove the accelerated

timeline, as the experiment was scheduled to start in mid-April

• HPC System Administrators and Computational Scientists worked around the clock to meet the April

2nd deliverable

• Scientists from CAPS immediately began functionality testing as they worked to port their scientific

models to this new computing architecture

• Cray provided remote access to the system at their Chippewa Falls location, so that much of the

preliminary work could be accomplished prior to delivery

Page 6: A leap forward with UTK’s Cray XC30

6 Deploying Darter - ISC14, June 24, 2014

Acceptance Testing - Darter· Functionality and performance phases were combined into a single set of tests

that were run continuously for 12 hours. The stability phase was run for a total of 72 hours, which included 12 hours of combined functionality-performance.

· A series of benchmarks including HPL, MPI-Stream, G-FFTE, Intel MPI Benchmark (IMB), and IOR were utilized to test aspects of the system’s processors, memory, interconnect, and filesystem for proper performance.

· Scientific applications (PARATEC and WRF) and the HPCC benchmarks were run on the system to test for proper system operation. These applications and benchmarks along with IMB were run continuously during the stability test phase of the application acceptance test. – Over 700 tests were executed obtaining a pass percentage of over 99.8%. – A single job failure was observed during stability testing, which was due to a node failure. – Other job failures not represented in the previous statistics (ten total) were due to issues not related

to system or application operation.

Page 7: A leap forward with UTK’s Cray XC30

7 Deploying Darter - ISC14, June 24, 2014

Application Metric Observed

HPLUsing at least 90% of the compute cores achieve >= 65% of the theoretical peak of the processors used

200.057 TF on 744 nodes,~ 80.8% of peak

MPI-StreamTest on at least 90% of the nodes achieving 50% of peak.

Ran on 675 nodes (over 90% of all computes) and measured an aggregate of 52308.3 GB/s, or 77.49 GB/s per node, which is 75.7% of peak.

G-FFTEUsing 8192 cores should achieve at least 1.5 TF performance. 1.99 TF on 8192 cores

IOR

File per task7 GB/s

write rate: 10.32GB/s.read rate: 4.89GB/s.average of the write and read rates 7.61GB/s.

No more than 5 shared files4.5 GB/s

write rate: 6.85GB/s.read rate: 6.10GB/s

Acceptance Testing: Performance

Page 8: A leap forward with UTK’s Cray XC30

8 Deploying Darter - ISC14, June 24, 2014

Acceptance Testing – Results· Significant improvements in application performance due to the improvements in

processor, memory, and interconnect performance compared to previous Cray generations.– The Intel Xeon processors provide a factor of two improvement in both peak performance and the

Highly Parallel Linpack (HPL) benchmark on a per-core basis over the AMD Operton processors utilized in Kraken (Cray XT5)

– Improvements in memory bandwidth provide another three-fold increase in performance over the Cray XT5. A set of six applications including WRF were run between these systems showing average per core performance improvements by factors between 1 and 4.

– WRF showed the most improvement (nearly 4 times) while all but one additional applications demonstrated per core performance improvements nearly equal to or greater than 2.

· Considering only acquisition and operating costs, the total cost of ownership (TCO) for the Cray XC30 is roughly 2.5 times that of the Cray XT5 on a per core basis. The performance improvements are typically greater than this factor. Of course, this reported TCO for the XC30 will decrease with time as acquisition costs are amortized and therefore the performance improvement gains over the XT5 will increase relative to the TCO.

Page 9: A leap forward with UTK’s Cray XC30

9 Deploying Darter - ISC14, June 24, 2014

MPI Benchmarks· The XC30 is the first Cray line that utilizes the Dragonfly network

topology. – This hierarchical network topology has obvious advantages with respect to local

communication – However, the trade-offs between bandwidth and latency are less apparent for different

communication patterns as they extend further down the hierarchy– To gain some insight XC30 network performance, JICS staff ran several experiments

using Intel MPI Benchmarks (IMB)

· To provide the flexibility in running different MPI benchmarks and experiments, a custom code was used that creates MPI communicators – This code wraps IMB, and passes the created MPI communicators to IMB for it to run

the benchmarks

· Next slide just shows point-to-point benchmarks. More information in the paper

Page 10: A leap forward with UTK’s Cray XC30

10 Deploying Darter - ISC14, June 24, 2014

Bandwidth and latency of two typical point-to-point MPI communication patterns, Sendrecv and Exchange

For both Sendrecv and Exchange, communication latency remains more or less constant for message sizes up to 1 Kilobytes. A possible explanation for this is that 1024 bytes is the default cut-off point for the short message algorithm in Cray’s communication libraries. For both of these communication patterns, bandwidth continues to increase with the size of the messages. Even at the largest message sizes utilized in these benchmarks, the available bandwidth of the network links in the system has not been saturated.

Page 11: A leap forward with UTK’s Cray XC30

11 Deploying Darter - ISC14, June 24, 2014

Rank placement· One premise of the XC30 system is that parallel job efficiency is

independent of its placement on the system– In other words, applications “can be placed anywhere on the system without having

significant performance degradation”– This premise is supported by at least two features of the XC30 interconnects: the high

global bandwidth and extensive use of adaptive routing to minimize the number of connections and avoid network congestion

· JICS tested this premise using two applications on Darter. – One application is a domain-decomposed, hydrodynamics solver in which every time

step requires several ghost exchanges and global time-step reductions via MPI AllReduce to follow an evolution of a Sedov-Taylor Blast Wave test problem

– The second application is an FFT-based Poisson’s equation solver called PSPFFT in which every step requires three- dimensional transpose via MPI Alltoall on multiple MPI sub-communicators

Page 12: A leap forward with UTK’s Cray XC30

12 Deploying Darter - ISC14, June 24, 2014

Rank placement (2)· The relative time difference as a function of

“average node distance” is shown. – Avg node distance increases as nodes are further apart

in physical location and varies linearly with node ID, although independent of network distance since topological information is ignored.

· Figures show that for small job sizes, job placement hardly affected job run-time. For larger job sizes, the data suggests that there are benefits when a job occupies nodes that are on average further apart in the network – This result is somewhat counter-intuitive, and speculate

that by using nodes that are less packed together, network traffic has the possibility to take different routes on this topology and due to the dynamic routing algorithm of the interconnect better avoid congestion

– However, further work is warranted to definitively explain this behavior. A more rigorous and controlled study of the dynamic routing behavior is warranted when a mechanism to access this algorithm is available, but no mechanism is available from Cray yet to do this

Page 13: A leap forward with UTK’s Cray XC30

13 Deploying Darter - ISC14, June 24, 2014

Extreme Climate Events· effects on sensitive ecosystems will be more

pronounced and observable as greenhouse gas levels increase– Understanding the linkages among climate change on the

global scale, shifting weather patterns and air quality on the regional scale and extreme events on the basin scale will be key to the characterization of ecological impacts of climate change

– In particular, most of countries in Asia (e.g. China and South Asia and Southeast Asia) are encountering more serious green-house and other pollutants (SO2, NOx and PM2.5) emission problems due to its rapid urbanization process and economic blooming

· global climate model Community Earth System Model (CESM), regional climate model Weather Research and Forecasting (WRF) and regional chemistry model Community Multiscale Air Quality Modeling System (CMAQ) have been adopted to investigate the dynamical downscaling technique on fine horizontal resolution– With the high resolution, it is easier and more reasonable to

examine the frequency and intensity of extreme events and air quality in different regions and cities

2-3x runtime speedup

Figure shows the comparison of average surface temperature between the present and future cases. The average temperature for most of Eastern China has increased by the end of 2050s, especially the south and east parts. Further analysis about the intensity and duration of extreme weather events like heat wave and extreme precipitation will be performed and evaluated statistically for the interested areas.

Page 14: A leap forward with UTK’s Cray XC30

14 Deploying Darter - ISC14, June 24, 2014

Fusion Microturbulence - GYRO· Simulation of fusion microturbulence· Computes the turbulent radial transport

of particles and energy in tokamak plasmas

· Solves 5-D coupled time-dependent nonlinear gyrokinetic-Maxwell equations with gyrokinetic ions and electrons

· Can operate as a flux-tube (local) code, or as a global code, with electrostatic or electromagnetic fluctuations

· Propagates system forward using either 4th-order explicit RK integrator or a 2nd-order, implicit-explict RK integrator with fourth-order, explicit Eulerian algorithm

Page 15: A leap forward with UTK’s Cray XC30

15 Deploying Darter - ISC14, June 24, 2014

Applications – storm prediction· The Center for Analysis and Prediction of Storms (CAPS) at the

University of Oklahoma ran Storm-Scale Ensemble Forecasts (SSEF)– supporting the National Oceanic and Atmospheric Administration (NOAA)

Hazardous Weather Testbed (HWT) 2013 Spring Experiment– a yearly high profile experiment that investigates the use of convection-

allowing model forecasts as guidance for the prediction of hazardous convective weather

· The 2013 CAPS SSEF runs lasted seven weeks between April 22 and June 7, 2013, overlapping with the HWT 2013 Spring Experiment from May 6 to June 7, 2013– a thirty-member storm-scale ensemble forecast of 48 hours, at 4-km horizontal

grid spacing over the entire continental United States, was produced daily– Three state-of-the-science numerical weather prediction (NWP) models were

used.· Advanced Research version of the Weather Research and Forecast model (WRF-ARW)· Advanced Regional Prediction System (ARPS)· Navy COAMPS model system.

– Each ensemble member forecast has unique initial condition and lateral condition perturbations and model physics options

– Over 140 WSR-88D Doppler weather radar and conventional observations over the CONUS were assimilated in real-time to each member using the ARPS 3DVAR and Complex Cloud Analysis package

– Ensemble forecast products, including probabilistic severe weather guidance, tornadic weather potential and intensity, flash flood and damaging wind guidance, were made available to HWT participants 12-36 hours in advance

20 hour forecast initiated from 00 UTC May 20, 2013, valid at 20 UTC May 20 at the time of the Moore tornado touch down

3x runtime speedup

Page 16: A leap forward with UTK’s Cray XC30

16 Deploying Darter - ISC14, June 24, 2014

Applications – light water reactor modeling· The Consortium for Advanced Simulation of Light Water Reactors (CASL)

was established as the first U.S. Department of Energy (DOE) Innovation Hub, and was created to accelerate the application of advanced modeling and simulation (M&S) to the analysis of nuclear reactors.

· CASL applies existing M&S capabilities and develops advanced capabilities to create a usable environment for predictive simulation of light water reactors (LWRs).

· The CASL vision is to predict the performance of nuclear reactors through comprehensive, science-based modeling and simulation technology that is deployed and applied broadly throughout the nuclear energy industry to enhance safety, reliability, and economics.

· Usage of Darter for CASL was as a development platform for a physics component known as MPACT, being developed primarily by staff and students at the University of Michigan.

– MPACT is one of two components being developed to simulate the behavior of neutrons within reactor cores

– MPACT performed multiple simulations that used over 3,000 cores for roughly 3.5 hours.– A result is depicted at right

· One of the most significant milestones CASL has delivered, simulation of the Watts Bar Unit 1 zero power physics test, was completed successfully in early June of 2013.

Thermal Neutron Flux Distribution Generated by MPACT on Darter

2x runtime speedup compared to XE6

Page 17: A leap forward with UTK’s Cray XC30

17 Deploying Darter - ISC14, June 24, 2014

Applications – astrophysics· UTK-ORNL Astrophysics Theory Group uses their multidimensional radiation

hydrodynamics code, CHIMERA, to make substantial strides in furthering our understanding of the explosion mechanism in core-collapse supernovae. – CHIMERA has been used to do supernovae modeling on previous generations of Cray machines

and has easily been ported to run on Darter.

· Core-collapse supernovae are the violent death of massive stars and are among the most powerful explosions in the universe– core-collapse supernovae are an important link in our chain of origins from the Big

Bang to the present. – During their lifetime, massive stars are the dominant sites where elements in the

periodic table between oxygen and iron are synthesized in the process known as stellar nucleosynthesis.

– Explosion events then disperse these elements to the rest of the universe, making life as we know it possible.

· Nucleosynthesis does not cease during the explosion event of core-collapse supernovae. – Growing evidence that core-collapse supernova are responsible for producing half the

elements heavier than iron in a process known as r-process nucleosynthesis.

Page 18: A leap forward with UTK’s Cray XC30

18 Deploying Darter - ISC14, June 24, 2014

Applications – astrophysics· UTK-ORNL Astro Group is evolving a two-dimensional core-collapse supernova

model from the inner core of a 15 solar mass, non-rotating progenitor on a polar axisymmetric grid with 720 radial zones and 240 angular zones and a 150-species nuclear reaction network. – Early results are promising, and show quantitative differences with simulations from the same

model with an α-network. – Qualitative comparisons, however, must be postponed until the explosion energy asymptotes and

tracer particle fates are determined, marking the “completion” of the model. – Prior experience has shown that this is typically achieved after roughly 1.5 seconds of evolution.

After 10 days of computation on 1,920 cores the model has evolved 650 ms in total, and the average shock radius has reached approximately 880 km

Entropy plot of the model with

150-species nuclear network at

300 ms after core bounce,

where some of the

characteristic behavior of the

explosion, such as SASI [3] is

apparent.

3x runtime speedup compared to XE6

Page 19: A leap forward with UTK’s Cray XC30

19 Deploying Darter - ISC14, June 24, 2014

Conclusions· UTK acquired a Cray XC30 supercomputer, called Darter, with a

peak performance of 248.9 Teraflops · Deployed in late March of 2013 with a very aggressive production

timeline· Science productivity immediately

– Even supported a nightly operational projects (storm prediction)– Most applications have seen 2-3x speedup with just a simple recompile (no

optimization)

· Application teams have benefited highly – Consistency of the Cray multigenerational software stack– Quick ports and immediate speedups due to processor, interconnect, and compiler

Page 20: A leap forward with UTK’s Cray XC30

20 Deploying Darter - ISC14, June 24, 2014

Acknowledgments

· We would like to thank – Joshua Fu for the extreme climate events highlight– Ming Xue for contribution of the CAPS highlight– John Turner for the contribution of the CASL highlight– James Austin for the contribution of the Astrophysics highlight

· This material is based upon work performed using computational resources supported by the University of Tennessee and Oak Ridge National Laboratory Joint Institute for Computational Sciences

Page 21: A leap forward with UTK’s Cray XC30

21 Deploying Darter - ISC14, June 24, 2014

Thank you

· Questions?– [email protected]

Page 22: A leap forward with UTK’s Cray XC30

22 Deploying Darter - ISC14, June 24, 2014

Extra

Page 23: A leap forward with UTK’s Cray XC30

23 Deploying Darter - ISC14, June 24, 2014

Acceptance Testing· All production JICS resources are put through a defined set of acceptance tests to determine

that their functionality and performance baselines meet expected parameters. · Based on the size and expected use of the system, varying criteria are defined as

components of the acceptance test.· In general, the acceptance tests are divided into hardware and application tests. The

hardware tests include initial diagnostics and testing that confirm the delivered hardware and firmware are correctly configured and functioning.

· Additionally, the system is configured to utilize the center’s infrastructure including necessary file systems, user authentication, and network.

· The application tests are further subdivided into functionality, performance, and stability phases.

· At the end of each phase the application acceptance tests are continuously run on the system for a specified amount of time.

· This “continuous system integration” step ensures that all parts of the system are tested under conditions similar to production utilization.


Related Documents