Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2 , Davide Del Vento 2 ,David Gill 2 , Sam Elliot 3,2 ,and Srinath Vadlamani 4 1 1 University of Iowa 2 National Center for atmospheric Research(NCAR) 3 University of Colorado at Boulder 4 Paratools Inc
26
Embed
Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento 2,David Gill 2, Sam.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model
Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam Elliot3,2,and Srinath Vadlamani4
1
1 University of Iowa 2National Center for atmospheric Research(NCAR) 3University of Colorado at Boulder 4Paratools Inc
Microphysics Scheme Thompson et al. 2008 (mp_physics =8)
What does make WRF expensive?
Time(%)
But is this case representative of the significant effect of the dynamics on performance?
14
Microphysics options summary
Scheme mp_physics
Simulation Speed
# of Variables
#timesteps/s
Kessler 1 2493.6 3 13.8
Purdue Lin et al. 2 2043.8 6 11.3
WSM-3 3 2263.8 3 12.5
WSM-5 4 2012.3 5 11.2
Ferrier(current NAM) 5 2451.2 5 13.6
WSM-6 6 1859.5 6 10.3
Goddard 6 class 7 1929.9 6 10.6
Thompson et al. 8 1739.8 7 9.7
Milbrandt- Yau 2-moment
9 1189.3 13 6.6
Morison 2-moment 10 1475.5 10 8.2
WDM-5 14 1478.6 8 8.2
WDM-16 16 1358.8 9 7.5
Thompson microphysics is among the most expensive microphysics and it is widely used.
TAU tools
15
TAU (Tuning and Analysis Utilities) is a program and performance analysis tool framework for high-performance parallel and distributed computing
TAU can automatically instrument source code using a package called PDT for routines, loops, I/O, memory, phases, etc.
Tau uses wallclock time and PAPI metrics to read hardware counters for profiling and tracing
16
Using Tau/PAPI for Advection Module
1- PDT instrumentation for module_advect_em2- Manually instrumented code for higher granularity of desired loops
TAU/PAPI variables analyzed: Time L1 and L2 Data Cache Misses (DCM) Conditional branch instructions mispredicted Floating point instruction and operations Single and double precision vector/SIMD instructions
17
Identified Hotspots
1- Positive Definite Advection Loop (32 lines)
High Time High cache misses (both L1 and L2 Cache misses) High branch miss-prediction
2- x, y, z flux 5 advection equation loops High Time High Cache misses Repeated through the code for different advection
schemes
18
Moisture transport in ARW
• Until recently, many weather models did not conserve moisture because of the numerical challenges in advection schemes. high bias in precipitation
• WRF-ARW scheme is conservative but not all of the advection schemes are.
• This introduces new masses to the system.
Figure from Skamarock and Dudhia 2012
Advection schemes can introduce both positive and negative errors particularly at sharp gradients
19
Advection options in WRF
Explicit IFs to remove negative values and over shootsExplicit IFs to remove oscillations
• High number of explicit IFs are causing high branch mispredictions
1- Optimizing the positive definitive advection module
Test1 : WRF only caseTest 2: WRF-Chem caseCase Advected Variables Maximum performance
increase
WRF Moisture 13%
WRF-Chem Moisture- Tracers- Species-Scalars- Chemical concentration- Particles
21% *
This hotspot has a potential for being optimized and provides significant improvement in performance.
* The performance increase will be even significantly higher for dust and particle only WRF-Chem cases.
21
Identified Hotspots
1- Positive Definite Advection Loop High Time High cache misses (both L1 and L2 Cache misses) High branch miss-prediction
2- x, y, z flux 5 advection equation loops High Time High Cache misses Repeated through the code for different advection
schemes
22
The effect of optimization of advection equations
2- Flux 5 advection equations • High Time and High L1 and L2 Data Cache Misses• This loop is repeated throughout the code for x, y and z directions• Very similar loop repeated for all the advection schemes
Test1 : WRF 4 km benchmark with TAU instrumentation 58% time spent in advection is in these flux equations loops Many L1 data cache misses per iteration Many L2 data cache misses per iteration
This hotspot has a potential for being optimized and provides significant improvement in performance.
23
Conclusion
WRF shows good MPI scalability depending on the workload
Thread Binding should be used for improving the performance of the WRF hybrid runs
Intel Vtune Amplifier and Tau tools used for performance analysis of WRF code.
Dynamics is identified as the most expensive part of ARW
We identified the hotspots of the advection module and estimated the amount of performance increase from modifying these parts of the WRF code
24
Ongoing and Future Work
Performance Improvement of advection module• Analysis of hardware counters to fix branch
mispredictions and cache misses• Advection module vectorization for Intel Xeon Phi
Coprocessors• Reducing memory footprint by decreasing the number
of temporary variables• Exploring performance optimization with different
compiler flags
25
Acknowledgements
Davide Del Vento Rich Loft Srinath Vadlamani Dave Gill Greg Carmichael All SIParCS admins and staff
26
Microphysics Schemes
Provides atmospheric heat and moisture tendencies Includes water vapor, cloud and precipitation processes Microphysical rates Surface rainfall
Mielikainen et al. 2014
Runge-Kutta loop (steps 1, 2, and 3) (i) advection, p-grad, buoyancy using (ii) physics if step 1, save for steps 2 and 3 (iii) mixing, other non-RK dynamics, save… (iv) assemble dynamics tendencies Acoustic step loop (i) advance U,V, then W, (ii) time-average U,V,W End acoustic loop Advance scalars using time-averaged U,V,WEnd Runge-Kutta loopOther physics (currently microphysics)