Top Banner
Application of Mixed- Mode Programming in a Real-World Scientific Code Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki
83

Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Application of Mixed-Mode Programming in a Real-World Scientific Code

Case Study:

PRACE Autumn School in HPC Programming Techniques25-28 November 2014 Athens, Greece

Nikos Tryfonidis, Aristotle University of Thessaloniki

Page 2: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

What is this all about?

You are given a scientific code, parallelized with MPI.

Question: Any possible performance benefits by a mixed-mode implementation?

We will go through the steps of preparing, implementing and evaluating the addition of threads to the code.

Page 3: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

The Code

MG : General Purpose Computational Fluid Dynamics Code (~20000 lines). Written in C and parallelized with MPI, using communication library written by author.

Developed by Mantis Numerics and provided by Prof. Sam Falle (director of company and author of the code).

MG has been used professionally for research in Astrophysics and simulations of liquid CO2 in pipelines, non-ideal detonations, groundwater flow etc.

Page 4: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Outline

1. Preparation: Code description, initial benchmarks.

2. Implementation: Introduction of threads into the code. Application of some interesting OpenMP concepts:

Parallelizing linked list traversals OpenMP Tasks Avoiding race conditions

3. Results - Conclusion

Page 5: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

PreparationCode Description

Page 6: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Make It Hybrid: How do you start?

Step 1: Inspection of the code, discussion with the author.

Page 7: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Make It Hybrid: How do you start?

Step 1: Inspection of the code, discussion with the author.

Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling.

Page 8: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Make It Hybrid: How do you start?

Step 1: Inspection of the code, discussion with the author.

Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling.

Step 3: Use profiling to gain some insight into the code’s hotspots/bottlenecks.

Page 9: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

What does the code do?

Computational domain: Consists of cells (yellow boxes) and joins (arrows).

The code performs computational work by looping through all cells and joins.

…1st Cell 2nd Cell …… Last Cell(1D example)

1st Join

2nd

Join

Last

Join

Page 10: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

…and in parallel?

Cells are distributed to all MPI Processes, using a 1D decomposition (each Process gets a contiguous group of cells and joins).

Halo Communication

Proc. 1

Proc. 2

Page 11: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Structure of (the relevant part of) the code

Computational hotspot of the code: “step” function (~500 lines).

“step” determines the stable time step and then advances the solution over that time step.

Mainly consists of halo communication and multiple loops over cells and joins (separate).

Page 12: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Basic Structure of “step” Function

Halo Communication (Calls to MPI)

Loops through Cells and Joins (Computational Work)

1st Order Step

Page 13: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Basic Structure of “step” Function

Halo Communication (Calls to MPI)

Loops through Cells and Joins (Computational Work)

Halo Communication (Calls to MPI)

Loops through Cells , Halo Cells and Joins:Multiple Loops (Heavier Computational

Work)

1st Order Step

2nd Order Step

Page 14: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

PreparationInitial Benchmarks and Profiling

Page 15: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Initial Benchmarks

Initial benchmarks were run, using test case suggested by the code author.

A 3D computational domain was used. Various domain sizes were tested (100³, 200³ and 300³ cells), for 10 computational steps.

Representative performance results will be shown here.

Page 16: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Initial Benchmarks: Execution Time

Figure 1: Execution time (in seconds) versus number of MPI Processes (size: 300³)

Page 17: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Initial Benchmarks: Speedup

Figure 2: Speedup versus number of MPI Processes (all sizes).

Page 18: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Initial Profiling

Profiling of the code was done using CrayPAT.

Four profiling runs were performed, with different numbers of processors (2, 4, 128, 256) and a grid size of 200³ cells.

Most relevant result of the profiling runs for the purpose of this presentation: percentage of time spent in MPI functions.

Page 19: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Initial Profiling: % MPI Time

Figure 3: Percentage of time spent in MPI communication, for 2, 4, 128 and 256 processors (200³ cells)

Page 20: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Initial Benchmarks and Profiling Results

The performance of the code is seriously affected by increasing the number of processors.

Performance actually becomes worse after a certain point.

Profiling shows that MPI communication dominates the runtime for high processor counts.

Page 21: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Why It May Work Here

Smaller number of MPI Processes means: Fewer calls to MPI. Cheaper MPI collective communications.

MG uses a lot of these (written in communication library).

Page 22: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Why It May Work Here

Smaller number of MPI Processes means: Fewer calls to MPI. Cheaper MPI collective communications. MG

uses a lot of these (written in communication library).

Fewer halo cells (less data communicated, less memory required).

Note: Simple 1D decomposition of domain requires more halo cells per MPI process than 2D or 3D domain decompositions. Mixed-Mode, requiring fewer halo cells, helps here.

Page 23: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Why It May Not Work So Well Here.

Addition of OpenMP code: possible additional synchronization (barriers, critical regions etc) needed for threads – bad for performance!

Only one thread (master) used for communication, means we will not be using the system’s maximum bandwidth potential.

Page 24: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

ImplementationThe Actual Work

Page 25: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

So What’s The Plan?

Page 26: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

And What’s Wrong With It?

All loops in “step” function are linked list traversals!

pointer = first cell

while (pointer != NULL) { - Do Work on current pointer / cell –

pointer = next cell }

Linked List example (pseudocode) :

Page 27: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

And What’s Wrong With Linked List Traversals?

Linked list traversals use a while loop.

Iterations continue until the final element of the linked list is reached.

Page 28: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

And What’s Wrong With Linked List Traversals?

Linked list traversals use a while loop.

Iterations continue until the final element of the linked list is reached.

In other words: Next element that the loop will work on

is not known until the end of current iteration

No well-defined loop boundaries!

Page 29: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

And What’s Wrong With Linked List Traversals?

Linked list traversals use a while loop.

Iterations continue until the final element of the linked list is reached.

In other words: Next element that the loop will work on

is not known until the end of current iteration

No well-defined loop boundaries!

We can’t use simple OpenMP

“parallel for” dire

ctives to

parallelize these loops!

Page 30: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Plan B?

Page 31: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

ImplementationManual Parallelization of Linked List Traversals

Page 32: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

And How Do We Parallelize This?

Straightforward way to parallelize a linked list traversal: transform the while loop into a for loop.

This can be parallelized with a for loop!

Page 33: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization of Linked List Traversals

1. Count number of cells (1 loop needed)

2. Allocate array of pointers of appropriate size

3. Point to every cell (1 loop needed)

4. Rewrite the original while loop as a for loop

Page 34: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization of Linked List Traversals: Pseudocode

BEFORE:

pointer = first cell

while(pointer!= NULL) { - Do Work on current pointer / cell –

pointer = next cell}

AFTER:

pointer = first cellwhile (pointer != NULL) {

counter+=1 pointer = next cell}

Allocate pointer array (size of counter)

for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell}

for (i=0; i<counter; i++) {pointer = pointer_array[i]- Do Work -

}

Page 35: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization of Linked List Traversals: Extra Code

BEFORE:

pointer = first cell

while(pointer!= NULL) { - Do Work on current pointer / cell –

pointer = next cell}

AFTER:

pointer = first cellwhile (pointer != NULL) {

counter+=1 pointer = next cell}

Allocate pointer array (size of counter)

for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell}

for (i=0; i<counter; i++) {pointer = pointer_array[i]- Do Work -

}

Page 36: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization of Linked List Traversals: Adding OpenMP

After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote.

Page 37: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization of Linked List Traversals: Adding OpenMP

After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote.

In similar fashion to plain OpenMP, we must pay attention to: The data scope of the variables. Data dependencies that may lead to race

conditions.

Page 38: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization of Linked List Traversals: Adding OpenMP

1 #pragma omp parallel shared (cptr_ptr, ...)2 private (t, cptr, ...)

3 firstprivate (cptr_counter, ...)

4 default (none)5 {6 #pragma omp for schedule(type, chunk)7 for (t=0; t<cptr_counter; t++) {89 cptr = cptr_ptr[t];1011 / Do Work /12 / ( . . . ) /13 }14 }

Page 39: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization Performance Tests

After introducing OpenMP to the code and verifying correctness, performance tests took place, in order to evaluate performance as a plain OpenMP code.

Tests were run for different problem sizes, using different numbers of threads (1,2,4,8).

Page 40: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization Performance Results: Execution Time

Figure 4: Execution time versus number of threads, for second – order step loops (size: 200³ cells)

Page 41: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization Performance Results: Speedup

Figure 5: Speedup versus number of threads, for second – order step loops (size: 200³ cells)

Page 42: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization Performance Results: Thoughts

Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup.

Similar results for smaller problem size (100³ cells), only less speedup.

Page 43: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Manual Parallelization Performance Results: Thoughts

Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup.

Similar results for smaller problem size (100³ cells), only less speedup.

In mixed mode, cells will be distributed to processes: interesting to see if we will still have speedup there.

Page 44: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

ImplementationParallelization of Linked List Traversals Using OpenMP Tasks

Page 45: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Alternative Parallelization Method for Linked Lists: OpenMP Tasks

OpenMP Tasks: a feature introduced with OpenMP 3.0.

The Task construct basically wraps up a block of code and its corresponding data, and schedules it for execution by a thread.

OpenMP Tasks allow the parallelization of a more wide variety of loops, making OpenMP more flexible.

Page 46: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

What Can Tasks Do For Us Here?

The Task construct is the right tool for parallelizing a “while” loop with OpenMP.

Each iteration of the “while” loop can be a Task.

Using Tasks is an elegant method for our case, leading to cleaner code with minimal additions.

Page 47: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

pointer = first cell

while(pointer!= NULL) { - Do Work on current pointer / cell –

pointer = next cell}

OpenMP Tasks For Linked List Traversals - Pseudocode

BEFORE:

AFTER:

#pragma omp parallel { #pragma omp single {

pointer = first cell

while (pointer != NULL) {

#pragma omp task {

-Do Work on current pointer / cell– } pointer = next cell

} }

Page 48: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

OpenMP Tasks For Linked List Traversals

Using OpenMP Tasks, we were able to parallelize the linked list traversal by just adding OpenMP directives!

Fewer additions to the code, elegant method.

Usual OpenMP work still applies: data scope and dependencies need to be resolved.

Page 49: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

How Did Tasks Perform, Though?Execution Time (…)

Figure 6: Execution time versus number of threads, for second – order step loops, using Tasks (size: 200³ cells)

Page 50: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

How Did Tasks Perform, Though?Speedup (…)

Figure 7: Speedup versus number of threads, for second – order step loops, using Tasks (size: 200³ cells)

Page 51: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Why So Bad?

1. J.M. Bull, F. Reid, N. McDonnell - A Microbenchmark Suite for OpenMP Tasks. 8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, 2012. Proceedings

Figure 8: OpenMP Task creation and dispatch overhead versus number of Threads¹.

Page 52: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Why So Bad?

For the current code, performance tests show that creating the Tasks and dispatching them requires roughly the same time needed to complete them, for one thread.

With more threads, it gets much worse (remember the logarithmic axis in previous graph).

Page 53: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

OpenMP Tasks: Conclusion

The problem: very big number of Tasks, not heavy enough each to justify huge overheads.

Despite being elegant and clear, OpenMP Tasks are clearly not the way to go.

Could try different strategies (e.g. grouping Tasks together), but that would cancel the benefits of Tasks (elegance and clarity).

Page 54: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

And The Winner Is…

Manual Parallelization of linked list traversals will be used for our mixed-mode MPI+OpenMP implementation with this particular code.

It may be ugly and inelegant, but it can get things done.

In defense of Tasks: If the code had been written with the intent of using OpenMP Tasks, things could have been different.

Page 55: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

ImplementationAvoiding Race Conditions Without Losing The Race

Page 56: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Avoiding Race Conditions

Additional synchronization required by OpenMP can prove to be very harmful for the performance of the mixed-mode code.

While race conditions need to be avoided at all costs, this must be done in the least expensive way possible.

Page 57: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Race Condition Example: Find Maximum

At a certain point, the code needs to find the maximum value of an array.

While trivial in serial, with OpenMP this is a race condition waiting to happen.

for (i=0; i<n; i++){

if (a[i] > max){ max = a[i];}

}

Part of loop to be parallelized with OpenMP:

Page 58: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Race Condition Example: Find Maximum

At a certain point, the code needs to find the maximum value of an array.

While trivial in serial, with OpenMP this is a race condition waiting to happen.

Two ways to tackle this: 1. Critical Regions2. Manually (Temporary

Shared Arrays)

for (i=0; i<n; i++){

if (a[i] > max){ max = a[i];}

}

What happens if (when) 2 or more threads try to write to “max” at the same time?

Part of loop to be parallelized with OpenMP:

Page 59: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Race Condition Example: Using Critical Regions

With a Critical Region we can easily avoid the race conditions.

However, Critical Regions are very bad for performance.

Question: Include loop in critical region or not?

for (i=0;i<n;i++) {

#pragma omp critical if (a[i] > max) { max = a[i];}

}

Now only one thread at a time can be inside critical block.

Page 60: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

1 5 1 2 2

4 3 1 9

2 7 6 3

Avoiding Critical RegionsWith Temporary Shared Arrays

Data (Shared Array, 4 Threads):

4 8 5

Page 61: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

597

5 1 2 2

4 3 1 9

2 7 6 3

1

Avoiding Critical RegionsWith Temporary Shared Arrays

Data (Shared Array, 4 Threads):

4 8 5

Temp. Shared Array:

8

Thread 0 Thread 1 Thread 2Thread 3

Each thread writes its own maximum to corresponding element

Page 62: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

597

5 1 2 2

4 3 1 9

2 7 6 3

1

Avoiding Critical RegionsWith Temporary Shared Arrays

Data (Shared Array, 4 Threads):

4 8 5

Temp. Shared Array:

8

Thread 0 Thread 1 Thread 2Thread 3

Each thread writes its own maximum to corresponding element

Single Thread:

9 A single thread picks out the total maximum

Page 63: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Critical Region vs Temporary Shared Array

Benchmarks were carried out, measuring execution time for the “find maximum” loop only.

Three cases tested: Critical Region, single “find max”

instruction inside Critical Regions, whole “find max” loop

inside Temporary Arrays

Page 64: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Critical Region vs Temporary Shared Array: Time

Figure 9: Execution time versus number of threads (size: 200³ cells).

Page 65: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Critical Region vs Temporary Shared Array: Speedup

Figure 10: Speedup versus number of threads (size: 200³ cells).

Page 66: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Critical Region vs Temporary Shared Array: Results

The temporary array method is clearly the winner.

However: Additional code needed for this method. Smaller problem sizes give less

performance gains for more threads (nothing we can do about that, though).

Page 67: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

ResultsMixed-Mode Performance

Page 68: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode Performance Tests

The code was tested in mixed-mode with 2, 4 and 8 threads per MPI Process.

Same variation in problem size as before (100³, 200³, 300³ cells).

Representative results will be shown here.

Page 69: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Execution Time(size: 200³ cells)

Figure 11: Time versus number of threads, 2 threads per MPI Proc.

Page 70: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Execution Time(size: 200³ cells)

Figure 12: Time versus number of threads, 4 threads per MPI Proc.

Page 71: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Execution Time(size: 200³ cells)

Figure 13: Time versus number of threads, 8 threads per MPI Proc.

Page 72: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Speedup(size: 100³ cells)

Figure 14: Speedup versus number of threads, all combinations

Page 73: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Speedup(size: 200³ cells)

Figure 15: Speedup versus number of threads, all combinations

Page 74: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode: Speedup(size: 300³ cells)

Figure 16: Speedup versus number of threads, all combinations

Page 75: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode Performance: Results

Mixed-Mode outperforms the original MPI-only implementation for the higher processor numbers tested.

Page 76: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode Performance: Results

Mixed-Mode outperforms the original MPI-only implementation for the higher processor numbers tested.

MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested.

Page 77: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode Performance: Results

Mixed-Mode outperforms the original MPI-only implementation for the higher processor numbers tested.

MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested.

Mixed-Mode with 4 threads/MPI Process is the best choice for problem sizes tested.

Page 78: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Mixed-Mode versus pure MPI: Memory Usage

Figure 17: Memory usage versus number of PEs , 8 threads per MPI Process (200³ cells)

Page 79: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

ConclusionWas Mixed-Mode Any Good Here?

Page 80: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Conclusion: Mixed-Mode Versus Pure MPI

For problem sizes and processor numbers tested: Mixed-Mode performed better or equally compared to pure MPI.

Higher processor numbers: Mixed-Mode manages to achieve speedup where pure MPI slows down.

Mixed-Mode required significantly less memory.

Page 81: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

So, To Answer Our Very First Question…

Any possible performance benefits from a Mixed-Mode implementation?

Page 82: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

So, To Answer Our Very First Question…

Any possible performance benefits from a Mixed-Mode implementation for this code?

Answer: Yes, for larger numbers of processors (> 256), a mixed-mode implementation of this code: Provides Speedup instead of Slow-Down. Uses less memory

Page 83: Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Thank You!