Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento 2,David Gill 2, Sam.

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model

Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam Elliot3,2,and Srinath Vadlamani4

1

1 University of Iowa 2National Center for atmospheric Research(NCAR) 3University of Colorado at Boulder 4Paratools Inc

2

Outline

Introduction WRF MPI Scalability Hybrid Parallelization Profiling WRF

Intel Vtune Amplifier XE Taul tools

Identifying hotspots and suggested areas for improvement

3

The Weather Research & Forecasting(WRF) Model

Numerical weather prediction system

Designed for both operational forecasting and atmospheric research

Community model with large user base: More than 30,000 users in

150 countries

Figure from WRF-ARW Technical Note

5

TACC Stampede Supercomuter

Aggregate Peak Performance : ~10 PFLOPS(PF)

6400+ Dell PowerEdge (C8220z) server nodes

2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture) per each compute Node

Each computer node has 32 GB of “host” memory with an additional 8GB of memory on the Xeon Phi coprocessor card

2.2 PF from Xeon E5 processors and 7.4 PF from Xeon Phi coprocessors

Figures from tacc.utexas.edu

6

Hurricane Sandy Benchmark

1.Coarser resolution 40-km (50x50)

Time-step: 180 sec

2.Finer resolution 4-km (500x 500)

Time-step: 20 sec

Time Period for both Simulation:54 hour forecast Between 2012 Oct 27 12:00 UTC through 2012 Oct 29 18:00 UTC

60 vertical layers

Scalability Assessment (MPI Only)7

MPI Bound

500X500 horizontal grids

Compute Bound

Simulation Speed is duration of simulation per wall clock time

8

Scalability Assessment (MPI Only)

• Allinea Perfomance Reports

• Separate netcdf output file

(io_form_history=102 in WRF namelist)

87% of total time spent on computation

79% of total time spent on MPI

9

Domain Decomposition (MPI only)

Per Grid :1/4 Computation and 1/2 MPI

10

AVX compiler flag

• AVX (Intel® Advanced Vector Extensions) is a 256 bit instruction set extension

• More aggressive optimization

• Not working on intel 15

• This issue has been reported to Intel

11

Hybrid Parallelization

Hybrid : distributed and shared memory parallelism(dmpar+smpar)

As the number of threads increase the performance decreases

The cores have never been oversubscribed

Binding increases the performance significantly I_MPI_PROCESSOR_LIST= p1,p2 tacc_affinity script

12

Intel Vtune Amplifier XE

Intel profiling and performance analysis tool Profiling includes stack sampling, thread profiling and

hardware event sampling Collect performance statistics of different part of the

code

13

RadiationLongwave Radiation Scheme

RRTMG Scheme (ra_lw_physics =4)

Shortwave Radiation Scheme CAM Scheme(ra_sw_physics = 3)

Microphysics Scheme Thompson et al. 2008 (mp_physics =8)

What does make WRF expensive?

Time(%)

But is this case representative of the significant effect of the dynamics on performance?

14

Microphysics options summary

Scheme mp_physics

Simulation Speed

# of Variables

#timesteps/s

Kessler 1 2493.6 3 13.8

Purdue Lin et al. 2 2043.8 6 11.3

WSM-3 3 2263.8 3 12.5

WSM-5 4 2012.3 5 11.2

Ferrier(current NAM) 5 2451.2 5 13.6

WSM-6 6 1859.5 6 10.3

Goddard 6 class 7 1929.9 6 10.6

Thompson et al. 8 1739.8 7 9.7

Milbrandt- Yau 2-moment

9 1189.3 13 6.6

Morison 2-moment 10 1475.5 10 8.2

WDM-5 14 1478.6 8 8.2

WDM-16 16 1358.8 9 7.5

Thompson microphysics is among the most expensive microphysics and it is widely used.

TAU tools

15

TAU (Tuning and Analysis Utilities) is a program and performance analysis tool framework for high-performance parallel and distributed computing

TAU can automatically instrument source code using a package called PDT for routines, loops, I/O, memory, phases, etc.

Tau uses wallclock time and PAPI metrics to read hardware counters for profiling and tracing

16

Using Tau/PAPI for Advection Module

1- PDT instrumentation for module_advect_em2- Manually instrumented code for higher granularity of desired loops

TAU/PAPI variables analyzed: Time L1 and L2 Data Cache Misses (DCM) Conditional branch instructions mispredicted Floating point instruction and operations Single and double precision vector/SIMD instructions

17

Identified Hotspots

1- Positive Definite Advection Loop (32 lines)

High Time High cache misses (both L1 and L2 Cache misses) High branch miss-prediction

2- x, y, z flux 5 advection equation loops High Time High Cache misses Repeated through the code for different advection

schemes

18

Moisture transport in ARW

• Until recently, many weather models did not conserve moisture because of the numerical challenges in advection schemes. high bias in precipitation

• WRF-ARW scheme is conservative but not all of the advection schemes are.

• This introduces new masses to the system.

Figure from Skamarock and Dudhia 2012

Advection schemes can introduce both positive and negative errors particularly at sharp gradients

19

Advection options in WRF

Explicit IFs to remove negative values and over shootsExplicit IFs to remove oscillations

• High number of explicit IFs are causing high branch mispredictions

moist_adv_opt =0 moist_adv_opt =1 moist_adv_opt =2

Figure from Skamarock and Dudhia 2012

20

The effect of optimization of advection module

1- Optimizing the positive definitive advection module

Test1 : WRF only caseTest 2: WRF-Chem caseCase Advected Variables Maximum performance

increase

WRF Moisture 13%

WRF-Chem Moisture- Tracers- Species-Scalars- Chemical concentration- Particles

21% *

This hotspot has a potential for being optimized and provides significant improvement in performance.

* The performance increase will be even significantly higher for dust and particle only WRF-Chem cases.

21

Identified Hotspots

1- Positive Definite Advection Loop High Time High cache misses (both L1 and L2 Cache misses) High branch miss-prediction

2- x, y, z flux 5 advection equation loops High Time High Cache misses Repeated through the code for different advection

schemes

22

The effect of optimization of advection equations

2- Flux 5 advection equations • High Time and High L1 and L2 Data Cache Misses• This loop is repeated throughout the code for x, y and z directions• Very similar loop repeated for all the advection schemes

Test1 : WRF 4 km benchmark with TAU instrumentation 58% time spent in advection is in these flux equations loops Many L1 data cache misses per iteration Many L2 data cache misses per iteration

This hotspot has a potential for being optimized and provides significant improvement in performance.

23

Conclusion

WRF shows good MPI scalability depending on the workload

Thread Binding should be used for improving the performance of the WRF hybrid runs

Intel Vtune Amplifier and Tau tools used for performance analysis of WRF code.

Dynamics is identified as the most expensive part of ARW

We identified the hotspots of the advection module and estimated the amount of performance increase from modifying these parts of the WRF code

24

Ongoing and Future Work

Performance Improvement of advection module• Analysis of hardware counters to fix branch

mispredictions and cache misses• Advection module vectorization for Intel Xeon Phi

Coprocessors• Reducing memory footprint by decreasing the number

of temporary variables• Exploring performance optimization with different

compiler flags

25

Acknowledgements

Davide Del Vento Rich Loft Srinath Vadlamani Dave Gill Greg Carmichael All SIParCS admins and staff

26

Microphysics Schemes

Provides atmospheric heat and moisture tendencies Includes water vapor, cloud and precipitation processes Microphysical rates Surface rainfall

Mielikainen et al. 2014

Runge-Kutta loop (steps 1, 2, and 3) (i) advection, p-grad, buoyancy using (ii) physics if step 1, save for steps 2 and 3 (iii) mixing, other non-RK dynamics, save… (iv) assemble dynamics tendencies Acoustic step loop (i) advance U,V, then W, (ii) time-average U,V,W End acoustic loop Advance scalars using time-averaged U,V,WEnd Runge-Kutta loopOther physics (currently microphysics)

Begin time step

End time step

WRF Model Integration Procedure

,,t

,

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento 2,David Gill 2, Sam.

Documents

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento 2,David Gill 2, Sam.