Top Banner
WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento
33

WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

Dec 30, 2015

Download

Documents

Bryan Wilkerson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

WRF Performance Optimization Targeting Intel Multicore

and Manycore Architectures

Samm ElliottMentor – Davide Del Vento

Page 2: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

2

The WRF ModelThe Weather Research and

Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs.

Used by over 30,000 Scientists around the world.

Any Optimizations that can be made will make a significant impact on the WRF community as a whole.

Page 3: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

3

Stampede Supercomputer

6400 Nodes(Dell PowerEdge C8220)

Two CPU’s Per Node(Intel Xeon E5-2680 Processors Per Node)

32 GB Memory Per Node

1-2 Coprocessors Per Node(Intel Xeon Phi SE10P)

Page 4: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

4

Intel Xeon vs Xeon Phi Architecture

Xeon E5 2680 CPU

8 Cores

2-Way Hyperthreading

256 Bit Vector Registers

2.7 GHz

32 KB L1

256 KB L2

20 MB Shared L3

32 GB Main Memory

Xeon Phi SE10P Coprocessor

61 Cores

4-Way Hyperthreading (2 Hardware Threads Per

Core)

512 Bit Vector Registers

1.1 GHz

32 KB L1

512 KB L2

No L3 Cache

8 GB Main Memory

Page 5: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

5

Intro: WRF Gets Good Performance on Xeon Phi!

Memory Limitations

Page 6: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

6

Standard MPI Implementation Strong Scaling

Compute Bound MPI Bound

High EfficiencySlow Time to Solution

Low EfficiencyFast Time to Solution

Page 7: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

7

MPI Tile Decomposition

Page 8: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

8

Hybrid Tile Decomposition

MPI Tiles

OpenMP Tiles

WRF does NOT use loop-level OpenMP parallelization

Page 9: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

9

MPI vs. Hybrid Parallelization

What do we expect?

MPI Only Hybrid

Less Halo Layer Communication (White Arrows) Less MPI Overhead and Therefore Faster!

Page 10: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

10

MPI vs. Hybrid Strong Scaling

Hybrid parallelization of WRF is consistently better than strict MPI and is significantly better in the MPI bound regions.

Page 11: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

11

Core Binding Using processor domain binding allows all OpenMP threads

within an MPI task to share L3 cache.

Socket 0

Socket 1

Rank 0 Rank 1

Socket 0

Socket 1

Rank 0

Rank 1

Page 12: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

12

pNetcdf

Using separate output files (io_form_history=102) makes output write times negligible – very worthwhile!

SerialpNetcdfSeparate Output Files

Page 13: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

13

Host Node Optimization Summary Hybrid Parallelization of WRF consistently gives better

performance results than strict MPI Much faster than strict MPI in MPI bound region

Using separate output files requires post-processing but kills any time spent in writing history

Process/Thread binding is critical for hybrid WRF

Take into consideration memory limitations for hybrid WRF and set environment variable OMP_STACKSIZE to avoid memory issues.

Page 14: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

14

Xeon Phi IO

SerialpNetcdfSeparate Output Files

Page 15: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

15

How Does Varying The Number of MPI Tasks Per Xeon Phi Effect The Performance?

These results suggest that there are issues with WRF’s OpenMP strategies

More MPI Overhead

Less MPI Overhead

Why are we seeing better performance with more MPI overhead?

Total Number of Cores = 240

Page 16: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

16

Hybrid Tile Decomposition

MPI Tiles

OpenMP Tiles

WRF does NOT use loop-level OpenMP parallelization

Page 17: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

17

Open MP ImbalancingTwo types of Imbalancing:

1. Number of OpenMP Tiles > Number of OpenMP Threads

2. OpenMP Tiles are Different Sizes

Default Tiling Issue:When number of threads is equal to any multiple

of the number of MPI tile rows

Page 18: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

18

Example – 10 x 10 grid run with 2 MPI tasks and 8 OpenMP threads per MPI task

OpenMP Imbalance Type 1:

MPI Rank 0

MPI Rank 1Threads #1 and #2 compute 2 OpenMP tiles each

(twice as much work as other threads + context switch overhead)

Page 19: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

19

OpenMP Imbalance Type 2:

Example – 4 MPI Tasks, 4 OpenMP Threads Per Task

MPI Rank 0

MPI Rank 2

MPI Rank 1

MPI Rank 3

Thread #4 computes twice as many grid cells as all other threads!

Page 20: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

20

OpenMP Good BalancingExample: 8 by 8 grid run with 2 MPI Tasks and 8 OpenMP

threads per task

MPI Rank 0

MPI Rank 1

The logical tiling would be a 2 by 4 OpenMP tiling but WRF does a 3 by 4 creating an unnecessary imbalance

I was able to resolve this issue and will be fixed in the next version of WRF

Page 21: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

21

WRF OpenMP Tile Strategy

Page 22: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

22

What is an Optimal WRF Case For Xeon Phi?

a

Xeon Phi Approaches Xeon Performance for Large Workload/Core

Xeon Phi – Initial Scaling

Page 23: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

23

Scaling Gridsize

Xeon Phi exceeds Xeon Performance for> 30,000 Horizontal Gridpoints/CPU-Coprocessor

Xeon Hits Performance Limit Far Before Xeon Phi

Consistent Balancing for Symmetric Runs

Xeon Phi Hits Memory Limits

Page 24: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

24

Xeon Phi Optimization Summary:The Good

Xeon Phi can be more efficient than host CPU’s in extreme high efficiency/slow time to solution region

For highly efficient workloads, due to low MPI overhead and constant efficiency it is possible to have well balanced symmetric CPU-Coprocessor WRF runs that are significantly more efficient than running on either homogeneous architecture

Page 25: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

25

Xeon Phi Optimization Summary:The Bad

WRF strong scaling performance is significantly less than host node CPU runs.

WRF tiling strategies are not well optimized for manycore architectures

WRF/Fortran’s array allocation strategies result in much larger memory requirements and limit workloads that otherwise would have high efficiency on Xeon Phi

Page 26: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

26

Although Xeon Phi could be used for highly efficient WRF simulations, finding the correct:1. problem sizes

2. task-thread ratios

3. tile decompositions and

4. workload per core

5. while taking into consideration memory limitations

makes Xeon Phi extremely impractical for the vast majority of WRF users.

Xeon Phi Optimization Summary:The Ugly

Page 27: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

27

Why is it Important to Continue Researching WRF on Xeon Phi?

Core counts will continue to increase for future HPC architectures – This will require better hybrid strategies for our models.

Xeon Phi is very representative of these architectures and is a tool for exposing various issues that otherwise may not be noticed until further down the road.

Page 28: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

28

Future WorkCreate better MPI+OpenMP tile strategies for

low workload per core simulations

Better understand performance issues with heap allocation and overall memory access patterns

Assess any other concerns that may hinder WRF performance on future multicore/manycore architectures

Page 29: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

29

Acknowledgements Davide Del Vento – Mentor

Dave Gill – WRF Help

Srinath Vadlamani – Xeon Phi/Profiling

Mimi Hughes – WRF

XSEDE/Stampede – Project Supercomputer Usage

SIParCS/Rich Loft – Internship Opportunity

Thank you all for giving me so much help throughout this process. I am extremely thankful for my experience here at NCAR!

Page 30: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

30

Questions?

Page 31: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

31

Memory Issues For Small Core Counts

Ensure OMP_STACKSIZE is set correctly(default may be set much lower than

necessary)

Memory use by WRF variables is multiplied by the number of threads per MPI task. Not much of an issue for small number of threads per task

Page 32: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

32

Memory Issues For Small Core Counts

Memory use by WRF variables is multiplied by the number of threads per MPI task.

Temporary Solution: Force heap array allocation (-heap-array) Extremely slow in compute-bound regions (“speeddown”

proportional to number of threads per task) Potential Culprits:

Cache Coherency? Repeated temporary array allocation?

Page 33: WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures Samm Elliott Mentor – Davide Del Vento.

33

Forcing Heap Array Allocation