Performance(Analysis(and(Op?miza?on(of(the(Weather ...sc15.supercomputing.org/sites/all/themes/SC15images/src...poster and save valuable time placing titles, subtitles, text, and graphics.

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

(—THIS SIDEBAR DOES NOT PRINT—)

D E S I G N G U I D E

This PowerPoint 2007 template produces a 48”x96” presentation poster. You can use it to create your research poster and save valuable time placing titles, subtitles, text, and graphics.

We provide a series of online answer your poster production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK.

When you are ready to print your poster, go online to PosterPresentations.com

Need assistance? Call us at 1.510.649.3001

Q U I C K S TA R T

Zoom in and out As you work on your poster zoom in and out to the level that is more comfortable to you. Go to VIEW > ZOOM.

Title, Authors, and Affiliations

Start designing your poster by adding the title, the names of the authors, and the affiliated institutions. You can type or paste text into the provided boxes. The template will automatically adjust the size of your text to fit the title box. You can manually override this feature and change the size of your text.

T I P : The font size of your title should be bigger than your name(s) and institution name(s).

Adding Logos / Seals Most often, logos are added on each side of the title. You can insert a logo by dragging and dropping it from your desktop, copy and paste or by going to INSERT > PICTURES. Logos taken from web sites are likely to be low quality when printed. Zoom it at 100% to see what the logo will look like on the final poster and make any necessary adjustments.

T I P : See if your company’s logo is available on our free poster templates page.

Photographs / Graphics You can add images by dragging and dropping from your desktop, copy and paste, or by going to INSERT > PICTURES. Resize images proportionally by holding down the SHIFT key and dragging one of the corner handles. For a professional-looking poster, do not distort your images by enlarging them disproportionally.

Image Quality Check Zoom in and look at your images at 100% magnification. If they look good they will print well.

ORIGINAL DISTORTED

Corner handles

Good

prin

/ng qu

ality

Bad prin/n

g qu

ality

Q U I C K S TA RT ( c o n t . )

How to change the template color theme You can easily change the color theme of your poster by going to the DESIGN menu, click on COLORS, and choose the color theme of your choice. You can also create your own color theme. You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be sure to go to VIEW > NORMAL to continue working on your poster.

How to add Text The template comes with a number of pre-formatted placeholders for headers and text blocks. You can add more blocks by copying and pasting the existing ones or by adding a text box from the HOME menu.

Text size

Adjust the size of your text based on how much content you have to present. The default template text offers a good starting point. Follow the conference requirements.

How to add Tables

To add a table from scratch go to the INSERT menu and click on TABLE. A drop-down box will help you select rows and columns.

You can also copy and a paste a table from Word or another PowerPoint document. A pasted table may need to be re-formatted by RIGHT-CLICK > FORMAT SHAPE, TEXT BOX, Margins.

Graphs / Charts You can simply copy and paste charts and graphs from Excel or Word. Some reformatting may be required depending on how the original document has been created.

How to change the column configuration RIGHT-CLICK on the poster background and select LAYOUT to see the column options available for this template. The poster columns can also be customized on the Master. VIEW > MASTER.

How to remove the info bars

If you are working in PowerPoint for Windows and have finished your poster, save as PDF and the bars will not be included. You can also delete them by going to VIEW > MASTER. On the Mac adjust the Page-Setup to match the Page-Setup in PowerPoint before you create a PDF. You can also delete them from the Slide Master.

Save your work Save your template as a PowerPoint document. For printing, save as PowerPoint or “Print-quality” PDF.

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

© 2015 PosterPresenta/ons.com 2117 Fourth Street , Unit C Berkeley CA 94710 [email protected]

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10

Performance (?mesteps/s)

Number of Nodes

WRF Performance: Binding Threads to Socket

8 PPN Bound

4 PPN Bound

2 PPN Bound

8 PPN

4 PPN

2 PPN

All benchmarks in this study were run on the Texas Advanced Computing Center’s Stampede supercomputer. Each node is comprised of two Intel Xeon 2680 CPU’s and utilizes Intel Xeon Phi SE10P coprocessors.

Introduc?on Hybrid WRF Scaling and Task/Thread Binding The Weather Research and Forecasting Model (WRF) is a widely used mesoscale numerical weather prediction system used for both research and operational forecasting needs. General performance portability is essential for any such community code to take advantage of ever-changing HPC architectures as we move into the exascale era of computing. Efficient shared memory models are critical as core counts on shared memory systems will continue to increase. Intel’s many integrated core (MIC) architecture is very representative of such systems and therefore performance optimization of WRF on MIC architectures will greatly contribute to the overall performance portability of the model. In addition to performance portability for future systems, it is necessary to understand how the model behaves on current systems to best utilize our current HPC resources.

Goals 1.  Generalize optimizations and best practices to educate current

WRF users on how to best utilize HPC resources. 2.  Identify performance bottlenecks and issues with WRF hybrid

parallelization on both Xeon CPUs and Xeon Phi Coprocessors.

Benchmarking

Although initial scaling of WRF on Xeon Phi shows poor results (first plot), we can see that the performance approaches that of the host cpu’s in extreme high workload per core simulations. To better understand this we scaled up the gridsize while running on a constant number of nodes. We should expect that for either architecture, due to insufficient workloads per core, the performance per grid point should be low initially and level off to a maximum for sufficient workloads per core. What we find is that this maximum is initially reached for much larger workloads per core on Xeon than on Xeon Phi and for greater than approximately 60,000 horizontal gridpoints/processor, it actually becomes more efficient to use Xeon Phi over the host Xeon CPUs (second plot).

Xeon Phi Scaling Conclusions: WRF Best Prac?ces §  Hybrid WRF parallelization will give consistent performance benefits,

especially for MPI bound workloads.

§  Ensure utilization of task/thread binding using runtime scripts or environment runtime settings (this is specific to each system and MPI implementation).

§  Ensure that the environment variable OMP_STACKSIZE is set correctly based on the memory limitations of your system

§  Using parallel I/O options with PnetCDF will significantly decrease I/O time spent in I/O related processes.

§  Using separate output files for each MPI task makes writing history negligible.

§  Change namelist option: io_form_history = 102

§  This option does require post-processing but is very worthwhile as the I/O overhead dominates runtimes on larger core counts.

§  Understand the tradeoff between quick time to solution/low efficiency and high efficiency/slow time to solution.

§  Only use Xeon Phi if you are extremely focused on efficient utilization of HPC allocations on systems such as stampede.

•  Using optimal WRF I/O options is critical for utilization of Xeon Phi

•  Efficient utilization of Xeon Phi is limited to a very small window of workloads.

•  For specific cases, symmetric execution can be used for efficient utilization of HPC resources.

Future Work •  Further develop methods for better patching and tiling strategies

•  Extend studies onto various systems including the Knights Landing architecture which is more representative of future manycore HPC systems

•  Assess whether other shared memory models such as OpenMP task constructs will further WRF’s performance portability for current and future HPC architectures

•  Better understand performance issues associated with memory allocation that would otherwise dramatically decrease memory utilization in hybrid WRF simulations

Acknowledgements I would like to first thank my advisor Davide Del Vento (NCAR) as well as Dave Gill (NCAR), John Michalakes (NCAR), Indraneil Gokhale (Intel), and Negin Sobhani (NCAR) for their guidance and discussions. I would also like to thank NCAR, the SIParCS internship program, and XSEDE for providing me the opportunity, community and resources to work on this project.

The majority of WRF users initially write off using hybrid parallelization due to poor initial performance results. This is typically due to the lack of task/thread binding. Threads spawned across sockets do not share L3 cache, causing issues such as false sharing to become more prevalent. The following performance results show how critical using thread binding within MPI ranks is (First plot below). By using a hybrid implementation, we cut out a significant portion of the MPI communication while still utilizing the same number of cores. The following shows how using a hybrid implementation allows WRF to scale more efficiently to a larger number of nodes.

WRF parallelization is done by breaking down the grid into two-dimensional horizontal patches, which are distributed to the MPI tasks to execute. Each OpenMP thread then solves and updates a subset of this patch, which we call an OpenMP tile. We found that WRF’s patching and tiling strategies often cause imbalancing and categorized these as follows:

1.  Number of Patch Rows > Number of OpenMP Threads

In this case each thread will execute a single tile but since the number of rows in a patch may not break down evenly into the number of tiles, some tiles will consist of more rows than others (up to a 2x imbalance between threads).

2.  Number of Patch Rows < Number of OpenMP Threads

In this case WRF will decompose the tiles 2-dimensionally, each tile with a contiguous subset of a single row. The number of tiles will be greater than the number of threads, requiring that some threads execute two tiles while others only execute one (always 2x imbalance between threads).

3.  Number of OpenMP threads is any multiple of the number of patch rows

This case should be optimal for thread balancing but found that WRF’s default strategy would overdecompose the problem causing unnecessary imbalances. We have since changed the tiling algorithm to fix this issue which will be included in the next release of WRF.

These issues are much more prevalent on Xeon Phi as there are many more threads available (up to 244) opposed to the Xeon CPU’s (up to 8 on Stampede). The first plot below shows a case that whose performance is hindered by the first and second issues explained above. This case demonstrates the correlation between execution time and maximum workload per core. The second plot below shows the performance benefits in the third case after changing the default tiling algorithm.

WRF OpenMP Tile Decomposi?on

Samuel EllioM – Advisor: Davide Del Vento Na/onal Center for Atmospheric Research

University of Colorado, Boulder

Performance Analysis and Op?miza?on of the Weather Research and Forecas?ng Model (WRF) on Intel® Mul?core and Manycore Architectures

Xeon E5 2680 CPU

8 Cores

2-‐Way Hyperthreading

256 Bit Vector Registers

2.7 GHz

32 KB L1

256 KB L2

20 MB Shared L3

32 GB Main Memory

Xeon Phi SE10P Coprocessor

61 Cores

4-‐Way Hyperthreading

512 Bit Vector Registers

1.1 GHz

32 KB L1

512 KB L2

No L3 Cache

8 GB Main Memory

Socket 0

Socket 1

Rank 0 Rank 1

Socket 0

Socket 1

Rank 0

Rank 1

0

5

10

15

20

25

0 10 20 30 40 50 60 70


Number of Nodes

WRF Performance: Hybrid vs. MPI Scaling

MPI 8 PPN 4 PPN 2 PPN

Compute Bound MPI Bound

High Efficiency Slow Time to Solu/on

Low Efficiency Quick Time to Solu/on

I/O Considera?ons WRF I/O, considering time spent in pure I/O as well as MPI communication and data formatting that results directly from I/O reads and writes, can quickly take over a simulations runtime. This is extremely significant for larger problem sizes and when running on larger numbers of nodes. The following plot shows history write times using serial I/O, PnetCDF parallel I/O, and namelist option io_form_history=102, which writes separate output files for each MPI rank. Using PnetCDF significantly reduces I/O times and although the third option requires post processing (very cheap relative to I/O related processes during the simulation), writing to separate output files made history writes negligible during the simulation.

0

2

4

6

8

10

12

14

0 10 20 30 40 50


Number of Nodes

Xeon vs. Xeon Phi Strong Scaling Performance

Xeon Phi

Xeon

Xeon Phi Approaches Xeon Performance for Large Workloads Per Core

For large workloads per core, due to low MPI overhead and constant efficiency it is possible to have well balanced symmetric CPU-Coprocessor WRF runs that are more efficient than running on either homogeneous architecture. The following results demonstrate a case where over a 1.5X speedup can be achieved running on the same number of nodes symmetrically.

Symmetric Xeon + Xeon Phi Performance

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45


Symmetric Xeon + Xeon Phi Performance

Xeon CPU Xeon Phi Coprocessor Symmetric Xeon + Xeon Phi

0 1 2 3 4 5 6 7 8 9

0 100 200 300 400 500 600 700 Horizontal Grid Dimension

Scaling Problem Size

Xeon

Xeon Phi Xeon Phi exceeds Xeon Performance for

> 60,000 Horizontal Gridpoints/CPU-‐Coprocessor

Consistent Balancing for Symmetric Runs

Xeon Phi Hits Memory Limits

Performance Per Grid Point (millions of gridpoints computed per second)

0

50

100

150

200

250

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 10 12 15

Maximum Gridpoints Per Thread

Compute Time Per Time Step

Ranks Per Xeon Phi

Thread Imbalance-‐Performance Rela?on

Compute Time Per Timestep

Max Gridpoints Per Thread

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

1 2 3 4


Ranks Per Xeon Phi

WRF Default Tiling Strategy Op?miza?on

New Default

Previous Default

0

5

10

15

20

25

30

35

4 8 16 32 64

Output Write Time Per History Interval

Number of Nodes

Write Time Per History Interval (750x750x60 gridsize)

Seperate Output Files

PnetCDF Parallel I/O

Serial I/O

References Michalakes, J., J.Dudhia, D. Gill, J. Klemp, and W. Skamarock. Design of a next-generation weather research and forecast model. In proceedings of the Eighth Workshop on the Use of Parallel Processors in Meteorology, European Center for Medium Range Weather Forecasting. World Scientific, Singapore, 1999 Skamarock, William C., Joseph B. Klemp, Jimy Dudhia, David O. Gill, Dale M. Barker, Wei Wang, and Jordan G. Powers. A description of the advanced research WRF version 2. No. NCAR/TN-468+ STR. National Center For Atmospheric Research Boulder Co Mesoscale and Microscale Meteorology Div, 2005.

Performance(Analysis(and(Op?miza?on(of(the(Weather ...sc15.supercomputing.org/sites/all/themes/SC15images/src...poster and save valuable time placing titles, subtitles, text, and graphics.

Documents