Top Banner
WRF Performance and Scaling Assessment Christopher Kruse Mentor - Davide Del Vento
28

WRF Performance and Scaling Assessment

Jan 24, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WRF Performance and Scaling Assessment

WRF Performance and Scaling Assessment

Christopher Kruse Mentor - Davide Del Vento

Page 2: WRF Performance and Scaling Assessment

Research Interests Atmospheric Waves

Mountain Meteorology

High resolution simulation!

– High resolution required with complex terrain!

– Requires lots of computational power!

Page 3: WRF Performance and Scaling Assessment

Internship Research Main Goals:

– Document the performance of scaling WRF over

nearly the entire Yellowstone (YS) Super Computer

– Investigate performance of WRF with different configurations, compilers, MPI implementations

• hybrid parallelization and scaling • domain decomposition • IBM Parallel Environment and Intel MPI Library MPI

implementations • Hyperthreading

Page 4: WRF Performance and Scaling Assessment

The Weather Research and Forecasting (WRF) Model

Open source

Extensible software architecture

Large user base

– 20,000 users in 130 countries

Used in operational, research, and educational settings

www.vapor.ucar.edu

Page 5: WRF Performance and Scaling Assessment

Previous Scaling Assessments WRF has been benchmarked on many systems

There are a number of ways to quantify model performance

– E.g., Gflops, Simulation speed, time steps completed/time

WRF Scales well to 8K cores

Standard workloads are “Small” – 2.5 km CONUS: ≈ 63

million grid points

http://www.mmm.ucar.edu/wrf/WG2/benchv3/

Page 6: WRF Performance and Scaling Assessment

1 Patch

Inter-processor communication

How can a workload be small? As patches get smaller, the ratio of perimeter points to interior points increases

– Surface to volume ratio argument

MPI Communication time to compute time ratio increases, performance decreases Ran a high resolution Hurricane Katrina simulation

– ≈371 million grid points Logical Domain

1 Tile within Patch

Page 7: WRF Performance and Scaling Assessment

Hybrid Parallelization

Tested hybrid parallelization at a relatively “small” core count

– Hybrid = distributed and shared memory parallelism

Cores were not over subscribed

Any number of threads per task decreased performance!

256 Nodes = 4096 cores

Page 8: WRF Performance and Scaling Assessment

Hybrid Parallelization Tested MPI task affinity/process binding on Yellowstone

Significantly increases performance for hybrid jobs with # threads > 1 Fastest/most stable runs with 2 (8) tasks (threads)

256 Nodes = 4096 cores

MP_TASK_AFFINITY=core:$OMP_NUM_THREADS

Page 9: WRF Performance and Scaling Assessment

Scaling Assessment Was a computer hog

– Frequently “flooded the queues”

Scaled the 1-km Katrina workload from 512 to 65,536 cores!

– ≈ 91% of Yellowstone!

Intel Compiler performed best

Issues with seg faults using PGI at core counts < 16K

Linear scaling through 16K cores

Page 10: WRF Performance and Scaling Assessment

Per Core Scaling Why does WRF scaling cease to be linear above 16K cores?

– That perimeter to area ratio

These data indicate scaling becomes non-linear for # grid points/core ≤ ≈ 700

– This drop off in performance occurs nearly at the same point

Page 11: WRF Performance and Scaling Assessment

Per Core Scaling Why does WRF scaling cease to be linear above 16K cores?

– That perimeter to area ratio

These data indicate scaling becomes non-linear for # grid points/core ≤ ≈ 700

– This drop off in performance occurs nearly at the same point

Drop off in scaling at extreme core counts due to increased communication in application, not YS

Page 12: WRF Performance and Scaling Assessment

Total Times • WRF runs broken down:

– Total time

– Initialization Time – reading and processing input

– Compute Time – computation and MPI communication

– Write Time – time to write one time step (29G) to disk

• Disk I/O quickly dominates total run

time…

Page 13: WRF Performance and Scaling Assessment

I/O Discussion • Tried Quilting (Asynchronous I/O)

– Works great when it works

• Tried Splitting output and input files – For IBM PE at 8K cores, decreased time to write

from ≈2700 s to 0.2 s – No supported post processing utilities available

• Support also exists for PNetCDF and PHDF5

– Did not test

Page 14: WRF Performance and Scaling Assessment

Domain Decomposition Communication affects performance

Domain decomposition can be specified

– E.g., can make patches very rectangular or close to square

Holding area constant, the perimeter of a rectangle is minimized when it’s aspect ratio is 1 (i.e., its square)

Square minimizes required communication. Square is best, right?

?

?

Page 15: WRF Performance and Scaling Assessment

Domain Decomposition Communication affects performance

Domain decomposition can be specified

– E.g., can make patches very rectangular or close to square

Holding area constant, the perimeter of a rectangle is minimized when it’s aspect ratio is 1 (i.e., its square)

Square minimizes required communication. Square is best, right? Wrong…

Page 16: WRF Performance and Scaling Assessment

Domain Decomposition (not finished) • 1-km Katrina workload

• Chose to run on 2520

cores, 15 per node – Many factor pair

permutations

• Best performance when patches were nearly 10 longer east-west than north-south

Page 17: WRF Performance and Scaling Assessment

Domain Decomposition • Repeated

decomposition with completely different simulation

• Again, square was not best

• In this case, decomposition could decrease performance by 50%!

Page 18: WRF Performance and Scaling Assessment

Conclusions WRF integration is efficient and scales well up to 64K cores

Initialization and writing output (i.e., disk I/O) with default I/O settings “anti-scale”

– Significantly limits realistic output frequencies at high core counts!

Domain decomposition can significantly reduce performance when domain is highly decomposed

– i.e., few grid points assigned to each core

Page 19: WRF Performance and Scaling Assessment

Acknowledgements Davide Del Vento - NCAR Raffaele Montouro - Texas A&M Scott McMillan - Intel Corporation Mark Lubin - Intel Corporation I could not have gathered and presented all my measurements without the work these people had completed before me.

Page 20: WRF Performance and Scaling Assessment

Questions?

Thanks

Page 21: WRF Performance and Scaling Assessment

References Barker, D., et al., 2012: The Weather Research and Forecasting model's community variational/ensemble data assimilation system. Bull. Amer. Meteor. Soc. , 93 , 831-843.

Intel, 2013a: Intel R Composer XE 2013. http://software.intel.com/en-us/intel-composer-xe.

Intel, 2013b: Intel R MPI Library. http://www.intel.com/go/mpi. Michalakes, J., et al., 2008: WRF nature run. J. of Physics: Conference Series , 125 , doi: 10.1088/1742-6596/125/1/012022.

MPI, 2012: MPI: A Message-Passing Interface standard version 3.0. http://www.mpi-forum.org/docs/mpi3.0/mpi30-report.pdf.

NCAR, 2013: NCAR-Wyoming Supercomping Center. http://www2.cisl.ucar.edu/resources/yellowstone. OpenMP Archetecture Review Board, 2011: OpenMP application program interface version 3.1. http://www.openmp.org/mpdocuments/OpenMP3.1.pdf.

Page 22: WRF Performance and Scaling Assessment

References Patricola, C. M., P. Chang, R. Saravanan, and R. Montouro, 2012: The effect of atmosphere-ocean-wave interactions and model resolution on Hurricane Katrina in a coupled regional climate model. Geophys. Res. Abs. , 14 , http://meetingorganizer.copernicus.org/EGU2012/EGU2012-11855.pdf. Quintero, D., A. Chaudhri, F. Dong, J. Higino, P. Mayes, K. Sacilotto de Souza, W. Moschetta, and X. T. Xu, 2013: IBM Parallel Environment (PE) Developer Edition. http://www.redbooks.ibm.com/redbooks/pdfs/sg248075.pdf.

Saha, S., et al., 2010: The NCEP climate forecast system reanalysis. Bull. Amer. Meteor. Soc. , 91 , 1015-1057. Skamarock, W. C., et al., 2008: A description of the advanced research wrf version 3. NCAR Tech. Note NCAR/TN-4751STR, 113 pp. [Available online at http://www.mmm.ucar.edu/wrf/users/docs/arw_v3.pdf. ].

Page 23: WRF Performance and Scaling Assessment

Additional – Kat 3-km Decomp • Did decomp testing

with 3-km workload

• Two curves apparent – lower on suspected due to bad node?

• Otherwise, top curve maxima predicted by previous decomp tests

Page 24: WRF Performance and Scaling Assessment

Additional – Scaling • Similar scaling analysis as before, but

performance vs. total grid points per patch/core

– Takes differing vertical levels between runs into account

• Comparable performance curves noted between runs

• Deviation from linearity at about 20000 grid points/core

• Plotting against total grid points vs horiz grid points might be advantageous as runs with more vertical levels require more computation in each patch, which might allow the workload to scale better to higher core counts compared to a run with less vlevs

Page 25: WRF Performance and Scaling Assessment

Additional – Hybrid Parallelization Decomposition

• This plot supplements previous domain decomp plots

• The two optimal task/thread decompositions are MPI-only and 5x3 (# TASKSx# THREADS)

Page 26: WRF Performance and Scaling Assessment

Additional – Hybrid Parallelization Decomposition with hyperthreading

• Same as plot in slide 8, but with hyperthreading

• NOTE: a typical MPI-only speed for this workload and core # is about 6.5

• This plot indicates some speed up (~8%) by using hyperthreading

4096 cores

Page 27: WRF Performance and Scaling Assessment

Additional – Hybrid Parallelization Decomposition 128 node

• While 2x8 decomp was optimal (best combo of performance and run to run stability), 4x4 decomp was optimal at 128 nodes

2048 cores

Page 28: WRF Performance and Scaling Assessment

Additional – Hybrid Parallelization Decomposition w/ hyperthreading

• Same as plot in previous slide but with hyperthreading

• Highest performance curve differs from data with hyperthreading on 256 nodes (slide 25)

2048 cores