Top Banner
Efficient Parallelization of MATLAB Stencil Applications for Multi-Core Clusters Johannes Spazier, Steffen Christgau, Bettina Schnor University of Potsdam, Germany WOLFHPC 2016, Salt Lake City, USA November 13, 2016 Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 1/29
29

Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Apr 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Efficient Parallelization of MATLAB StencilApplications for Multi-Core Clusters

Johannes Spazier, Steffen Christgau, Bettina Schnor

University of Potsdam, Germany

WOLFHPC 2016, Salt Lake City, USA

November 13, 2016

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 1/29

Page 2: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Outline

1 Introduction

2 Message Passing Interface

3 Hybrid Programming

4 Conclusion

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 2/29

Page 3: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Outline

1 Introduction

2 Message Passing Interface

3 Hybrid Programming

4 Conclusion

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 3/29

Page 4: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Introduction

MATLAB

approved as high-level language for scientific computing

significantly reduced implementation effort

enables fast prototyping of mathematical models

well-suited for stencil applications

drawback: slow execution through interpreter

no out-of-the-box parallelization

⇒ insufficient performance for large data sets

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 4/29

Page 5: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Introduction

StencilPaC Overview

MATLAB to parallel C compiler

C Compiler(gcc)

MATLABsource

StencilPaCCompiler

generatedC code

finalexecutable

MPIheaders

MPIlibraries

commandline options CMD

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 5/29

Page 6: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Introduction

StencilPaC Overview

automatic parallelization for matrix operations

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

support different architecturesI shared and distributed memory systems, accelerators

build on common programming APIsI OpenMP, MPI and OpenACC

Y

X

B

= Y1

X1

M1

⊗ ...⊗ Yn

Xn

Mn

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 6/29

Page 7: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Introduction

Applications

two grid-based stencil applications

domain update over multiple iterations

manual reference implementations in C/C++

EasyWave

tsunami simulation developed at theGerman Research Center for Geosciences

access pattern: 5-point-stencil

Cellular Automaton

idealized model for biological systems

9-point-stencil (moore neighborhood)

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 7/29

Page 8: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Introduction

StencilPaC Overview

generated C code is much faster than MATLABfor both applications

improvements of more than

I 7 times with sequential code

I 21 times on an 8 core shared memory system

I 187 times with an NVIDIA Tesla K40m

for the memory-bound tsunami simulation EasyWave

even better results for the Cellular Automaton

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 8/29

Page 9: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Introduction

StencilPaC Overview

distributed systems are most challenging

I automatic partitioning of matrices

I generic handling of communication between processes

I partial computation

small runtime overhead is essential

focus on MPI one-sided API in previous work

today: concepts of and comparison with

I two-sided communication

I hybrid programming

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 9/29

Page 10: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Outline

1 Introduction

2 Message Passing Interface

3 Hybrid Programming

4 Conclusion

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 10/29

Page 11: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Message Passing Interface

Principles

degree of parallelization is given by thenumber of processes

distribute matrices evenly among theprocesses

one-dimensional domain decomposition(block of columns)

compute local parts in parallel

set up communication at runtime

provide appropriate ghost zones

X

Y

Process

0

Process

1

Process

2

Lghost Rghost

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 11/29

Page 12: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Message Passing Interface

Distributed Computation

choose base matrix B

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

compute local part of B

for (j = 0; j < length(X); j++) {if (is_local( B, X(j) )) {

for (k = 0; k < length(Y); k++) {B( X(j), Y(k) ) = M1( X1(j), Y1(k) )

◦ ...◦ Mn( Xn(j), Yn(k) );

}}

}

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 12/29

Page 13: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Message Passing Interface

1. One-sided Communication

direct access on remote memory with MPI Get

ghost zones can be fetched without involving otherprocesses

ranks calculated based on equally sized partitioning

coarse-grained synchronization withMPI Win fence

⇒ simple API for generic data exchange

⇒ less administration at runtime

⇒ expensive synchronization

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 13/29

Page 14: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

One-sided Communication

Generic Data Exchange

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

MPI_Win_fence( 0, Mi.win );

for (j = 0; j < length(X); j++) {if (is_local( B, X(j))) {

if (!is_local(M1, X1(j)))MPI_Get(M1.win, X1(j) ...);

...if (!is_local(Mn, Xn(j)))

MPI_Get(Mn.win, Xn(j) ...);}

}

MPI_Win_fence( 0, Mi.win );

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 14/29

Page 15: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Message Passing Interface

2. Two-sided Communication

exchange ghost zones via messages

both sender and receiver are involved

send operations must also be provided

use non-blocking operations to avoid deadlocks(MPI ISend and MPI IRecv)

synchronize with MPI Waitall

⇒ pair-wise synchronization

⇒ additional administration required

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 15/29

Page 16: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Two-sided Communication

Generic Data Exchange

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

for (j = 0; j < length(X); j++) {if (is_local( B, X(j)))

if (!is_local(M1, X1(j)))MPI_Irecv(M1.vdata, X1(j), ...);

if (is_local(M1, X1(j)))if (!is_local(B, X(j)))

MPI_Isend(M1.vdata, X1(j), ...);

/* Repeat for other matrices. */}

MPI_Waitall( Mi.reqnr, Mi.requests, ...);Mi.reqnr = 0;

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 16/29

Page 17: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Evaluation: One- vs. two-sided

EasyWave

197.50

103.67

63.31

36.62

21.90

13.71

8.98

207.21

106.21

64.76

37.45

21.89

13.57

8.37

6.23

7.16

10

25

50

100

200

1 2 4 8 16 32 64 96Number of Processes

Run

time

in S

econ

ds

● EasyWave Manual MPI EasyWave Auto MPI One−Sided EasyWave Auto MPI Two−Sided

Platform

12 dual-socket nodes

4-core Intel Xeon CPUs

InfiniBand Network

Open MPI 1.8.2 andGCC 4.9.1

Results

generated codes can keepup with hand-written one

similar scaling of allversions

two-sided is 13% fasterthan one-sided

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 17/29

Page 18: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Evaluation: One- vs. two-sided

Cellular Automaton

362.79

183.01

93.13

45.81

23.08

11.78

6.31

4.27

363.65

183.14

92.91

45.85

22.98

11.61

5.97

4.00

10

25

50

100

200

400

1 2 4 8 16 32 64 96Number of Processes

Run

time

in S

econ

ds

● CA Auto Char MPI One−SidedCA Auto Char MPI Two−Sided

Results

overall adequate scaling

speedup of at least 85 on96 cores

6% improvement withtwo-sided version

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 18/29

Page 19: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Evaluation: One- vs. two-sided

Summary

similar findings for both applications

satisfying scaling even for larger core counts

runtime of hand-written codes almost reached

two-sided MPI implementation performs betterthan one-sided

I despite higher runtime overhead

I benefiting from fine-grained synchronization

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 19/29

Page 20: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Outline

1 Introduction

2 Message Passing Interface

3 Hybrid Programming

4 Conclusion

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 20/29

Page 21: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Hybrid Programming

Two-sided MPI + OpenMP

each MPI process spawns multiple threads

work is divided statically among thesethreads

simple combination leads to serious loadimbalances

process distribution and thread partitioninginterfere

amount of computational work varies

#pragma omp parallel for private(k)for (j = 0; j < length(X); j++)if (is_local( B, X(j) ))

for (k = 0; k < length(Y); k++)B( X(j), Y(k) ) = ...

0 1y

x

1 n n+1 2n

1000

1101

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 21/29

Page 22: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Hybrid Programming

1. Dynamic Scheduling

use dynamic scheduling of threads

chunks are assigned at runtime

better sharing of computational workexpected

easy implementation with

#pragma omp parallel schedule(dynamic)

additional runtime overhead

0 1y

x

1 n n+1 2n

1000

1000

1000

1000

1101

1101

1101

1101

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 22/29

Page 23: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Hybrid Programming

2. Intersection Approach

optimization for special matrix access

based on MATLAB’s range index

start:step:end = [start, start+step, ..., end]

local portion can be determined in advance

no locality check at runtime anymore

enables static thread scheduling again

Local matrix portion Global index range Local index range

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 23/29

Page 24: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Evaluation: Hybrid Programming

Cellular Automaton

182.59

91.54

46.32

23.99

11.86

7.79

5.96

4.06

182.90

92.33

46.68

24.50

12.37

8.266.58

4.57

363.65

4.00

10

25

50

100

200

400

1 2 4 8 16 32 64 96Number of Processes

Run

time

in S

econ

ds

● CA Auto Char MPI Two−SidedCA Auto Char Hybrid Two−Sided DynamicCA Auto Char Hybrid Two−Sided Intersect

Results

pure MPI versionperforms best

similar results with hybridintersection

overhead of 14% withdynamic scheduling

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 24/29

Page 25: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Evaluation: Hybrid Programming

EasyWave

106.79

65.96

41.97

24.25

14.82

10.298.80

6.54

107.30

56.22

34.43

19.52

11.25

8.82

6.42

4.76

207.21

6.23

10

25

50

100

200

1 2 4 8 16 32 64 96Number of Processes

Run

time

in S

econ

ds

● EasyWave Auto MPI Two−SidedEasyWave Auto Hybrid Two−Sided DynamicEasyWave Auto Hybrid Two−Sided Intersect

Results

pure MPI better thanhybrid intersection

dynamic approachoutperforms otherversions

improvement of 24%

better load balancing formemory-boundapplications

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 25/29

Page 26: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Outline

1 Introduction

2 Message Passing Interface

3 Hybrid Programming

4 Conclusion

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 26/29

Page 27: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Conclusion

Successful extension of StencilPaC for multi-core clusters.

Message Passing Interface

well suited for automatic parallelization

slightly better results with two-sided communication

speedups of up to 91 on 96 cores

Hybrid Programming

benefit of hybrid versions depends on application demands

dynamic scheduling can reduce load imbalancesI improvement of 24% on a memory-bound simulation

intersection approach does not show any benefit

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 27/29

Page 28: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Conclusion

Future Work

examine a wider range of applications

consider additional platforms

I use other MPI implementations (e.g. Open MPI 2.0)

I compare different architectures and network types

deeper analysis of observed effects in hybrid programming

evaluate possible use of generated C code on FPGAs

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 28/29

Page 29: Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Questions?

Thanks foryour attention.

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 29/29