Top Banner
Improving the Performance of the Scaled Matrix/Vector Multiplication with Vector Addition in Tau3P, an Electromagnetic Solver Michael M. Wolf, University of Illinois, Urbana-Champaign; Ali Pinar and Esmond G. Ng, Lawrence Berkeley National Laboratory September 8, 2004
40

Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

Mar 02, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

Improving the Performance of the Scaled Matrix/Vector Multiplication

with Vector Addition in Tau3P, an Electromagnetic Solver

Michael M. Wolf, University of Illinois, Urbana-Champaign; Ali Pinar and

Esmond G. Ng, Lawrence Berkeley National Laboratory

September 8, 2004

Page 2: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

2

Outline

• Motivation• Brief Description of Tau3P• Tau3P Performance• MPI 2-Sided Communication• Basic Algorithm• Implementation• Results• Conclusions• Future Work

Page 3: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

3

Challenges in E&M Modeling of Accelerators

• Accurate modeling essential for modern accelerator design

• Reduces Design Cost• Reduces Design Cycle

• Conformal meshes (Unstructured grid)• Large, complex electromagnetic structures

• 100’s of millions of DOFs• Small beam size

• Large number of mesh points• Long run time

• Parallel Computing needed (time and storage)

Page 4: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

4

Next Linear Collider (NLC)

Cell to cell variation of order microns to suppress short range wakes by detuning

Page 5: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

5

• NLC X-band structure showing damage in the structure cells after high power test

• Theoretical understanding of underlying processes lacking so realistic simulation is needed

End-to-end NLC Structure Simulation

Page 6: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

6

Parallel Time-Domain Field Solver – Tau3P

Coupler Matching

R TReflectedIncident

Transmitted

Coupler Matching

Wakefield Calculations Rise Time Effects

Page 7: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

7

Parallel Time-Domain Field Solver – Tau3P

The DSI formulation yields:

∫∫∫∫∫

∫∫∫

•+•∂∂=•

•∂∂−=•

*** dAjdA

tDdsH

dAtBdsE

eAhhAe

E

Hvv

vv

⋅⋅=+

⋅⋅=+

β

α • α, β are constants proportional to dt• AH,AE are matrices• Electric fields on primary grid• Magnetic fields on embedded dual grid• Leapfrog time advancement• (FDTD) for orthogonal grids

• Follows evolution of E and H fields inside accelerator cavity• DSI method on non-orthogonal meshes

Page 8: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

8

Tau3P Implementation

Example of Distributed Mesh

Typical Distributed Matrix

• Very Sparse Matrices– 4-20 nonzeros per row

• 2 Coupled Matrices (AH,AE)• Nonsymmetric (Rectangular)

Page 9: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

9

Parallel Performance of Tau3P (ParMETIS)

2.02.8

3.9 4.45.7

6.8

9.2

Parallel Speedup

• 257K hexahedrons• 11.4 million non-zeroes

Page 10: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

10

Communication in Tau3P (ParMETIS Partitioning)

Communication vs. Computation

Page 11: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

11

Improving Performance of Tau3P

• Performance greatly improved by better mesh partitioning– Previous work by Wolf, Folwell, Devine, and Pinar

• Possible improvements in scaled matrix/vector multiplication with vector addition algorithm– Different MPI communication methods– Different algorithm stage orderings– Thread algorithm stages

Page 12: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

12

MPI 2-Sided Communication

MPI_Send

MPI_Bsend

MPI_Rsend

MPI_Ssend

MPI_Isend

MPI_Issend

MPI_RecvMPI_Irecv

2-Sided

MPI_Ibsend

MPI_Irsend

MPI_SendRecv

MPI_Send_init

MPI_Bsend_init

MPI_Rsend_init

MPI_Ssend_init

MPI_Recv_init

Nonblocking

Nonblocking, Persistent

Blocking,Combined

Blocking

Standard

Buffered

Ready

Synchronous Synchronous

Ready

Buffered

Standard Standard

Buffered

Synchronous

Ready

Page 13: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

13

Blocking vs. Nonblocking Communication

• Blocking– Resources can be safely used after return of call– MPI_Recv does not return until mesg received– Send behavior depends on mode

• Nonblocking– Resources cannot be safely used after return– MPI_Irecv returns immediately – Enables overlapping of communication with other

operations– Additional overhead required– Used with MPI_Wait, MPI_Wait{all,any,some},

MPI_Test*• Blocking sends can be used with nonblocking

receives and vice versa.

Page 14: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

14

Buffered Communication Mode

• MPI_Bsend, MPI_Ibsend• A user defined buffer is explicitly attached

using MPI_Buffer_attach• Send posting/completion independent of

receive posting

Sending Process

Send Posted

Send Completed

Receiving Process

Receive Posted

Receive Completed

Receive Posted

Receive PostedData Movement

Page 15: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

15

Synchronous Communication Mode

• MPI_Ssend, MPI_Issend• Send can be posted independent of receive

posting• Send completion requires receive posting

Sending Process

Send Posted

Send Completed

Receiving Process

Receive Completed

Receive Posted

Receive PostedData Movement

Page 16: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

16

Ready Communication Mode

• MPI_Rsend, MPI_Irsend• Send posting requires receive to be

already posted

Sending Process

Send Posted

Send Completed

Receiving Process

Receive Completed

Receive Posted

Data Movement

Page 17: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

17

Standard Mode Send

• MPI_Send, MPI_Isend• Behavior is implementation dependent• Can act either as buffered (system buffer)

or synchronous

Sending Process

Send Posted

Send Completed

Receiving Process

Receive Posted

Receive Completed

Receive Posted

Receive PostedData Movement

Receive Completed

Receive Posted

Receive PostedOR

Page 18: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

18

Persistent Communication

• Used when communication function with same arguments repeatedly called

• Bind list of communication arguments to persistent communication request

• Potentially can reduce communication overhead

• Nonblocking• Argument list is bound using

– MPI_Send_init, MPI_Bsend_init, MPI_Ssend_init, MPI_Rsend_init

• Request is initiated using MPI_Start

Page 19: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

19

Basic Algorithm

• Scaled Matrix/Vector Multiplication with vector addition

hAe H

vv ⋅⋅=+ α• Row partitioning of matrix and vector so all

nonlocal operations are due to matrix/vector multiplication

• 3 main stages of algorithm– Multiplication and summation with

local nonzeros– Communication of remote vector

elements corresponding to nonlocalnonzeros

– Remote multiplication and summation with nonlocal nonzeros

Page 20: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

20

Implementation

• 42 different algorithms implemented– 3 Blocking send/recv algorithms– 3 Nonblocking send/recv algorithms– 36 Blocking send/nonblocking recv

algorithms• 6 different orderings• Standard, buffered, synchronous, persistent

standard, persistent buffered, persistent synchronous

Page 21: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

21

Blocking Send/Blocking Receive Algorithms

RCLCComm3

RCCommLC2

Comm/RCLC1

Stage 3Stage 2Stage 1Ordering

Sends: MPI_Send, MPI_SendrecvRecvs: MPI_Recv, MPI_Sendrecv

Page 22: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

22

Nonblocking Send/Nonblocking Receive Algorithms

RCLCWaitSendRecv6

RCWaitLCSend/Recv

5

RCWaitLCSendRecv4

Stage 5Stage 4Stage 3

Stage 2

Stage 1

Ordering

Sends: MPI_IsendRecvs: MPI_Irecv

Page 23: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

23

Blocking Send/Nonblocking Receive

LCSends/Waits/RCRecv12

Send/Waits/RC

LCRecv11

RCLCSends/WaitsRecv10RCSend/WaitsLCRecv9

RCWaitLCSendRecv8RCLCWaitSendRecv7

Stage 5

Stage 4

Stage 3Stage 2

Stage 1

Ordering

Sends: MPI_Send, MPI_Bsend,MPI_Ssend, MPI_Send_init, MPI_Bsend_init, MPI_Ssend_init

Recvs: MPI_Irecv

Page 24: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

24

Problem Setup

• Attempted to keep work per processor invariant

• Built nine 3D rectangular meshes using Cubit – External dimensions of meshes the same– Each mesh used for a particular number of

processors (2, 4, 8, 16, 32, 64, 128, 256, 512)– Number of elements of each mesh controlled so

that each matrix would have approximately 150,000 nonzeros per process for each mesh

• RCB 1D partitioning used– Meshes built to keep neighboring processors to

minimum• All runs on IBM SP at NERSC (seaborg)

Page 25: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

25

Blocking Communication

RCLCComm3RCCommLC2

Comm/RCLCorig

Stage 3

Stage 2

Stage 1

Method

Page 26: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

26

Nonblocking Communication

RCLCWaitSendRecvOrd. 6RCWaitLCSend/RecvOrd. 5

RCWaitLCSendRecvOrd. 4S5S4S3S2S1

Page 27: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

27

Blocking Send/Nonblocking Recv: Standard

Page 28: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

28

Orderings for Blocking Send/Nonblocking Recv

LCSends/Waits/RCRecv12

Send/Waits/RC

LCRecv11

RCLCSends/WaitsRecv10RCSend/WaitsLCRecv9

RCWaitLCSendRecv8RCLCWaitSendRecv7

Stage 5

Stage 4

Stage 3Stage 2

Stage 1

Ordering

• Orderings 7-10 performed significantly worse than 11-12 for all MPI modes

• Subsequent graphs show only best algorithms

Page 29: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

29

Blocking Send/Nonblocking Recv: Standard

Page 30: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

30

Blocking Send/Nonblocking Recv: Buffered

Page 31: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

31

Blocking Send/Nonblocking Recv: Synchronous

Page 32: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

32

Blocking Send/Nonblocking Recv: Persistent Standard

Page 33: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

33

Blocking Send/Nonblocking Recv: Persistent Buffered

Page 34: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

34

Blocking Send/Nonblocking Recv: Persistent Synchronous

Page 35: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

35

Ordering 11

0 100 200 300 400 500 6006

8

10

12

14

16

18

20Ordering 11 - All

# processors

Tim

e

origSTBUSYPSTPBUPSY

Page 36: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

36

Ordering 12

0 100 200 300 400 500 6006

8

10

12

14

16

18

20Ordering 12 - All

# processors

Tim

e

origSTBUSYPSTPBUPSY

Page 37: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

37

Best 5

50 100 150 200 250 300 350 400 450 5008

8.5

9

9.5

10

10.5

11

11.5

Best 5

# processors

Tim

e

orig11ST12ST11SY12SY

Page 38: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

38

Observations/Conclusions

• Slight variation of algorithm can give very different performance

• Algorithm (with Tau3P data) very sensitive to stage ordering

• Combined communication/remote computation steps very beneficial

• Standard and Synchronous modes good• Buffered modes costly• Persistent communication costly• Some factors could be machine dependent

Page 39: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

39

Future Work

• Threaded algorithms– Preliminary results not good

• Visualization of simulations• Real accelerator structure• Scalability of fixed size problem

Page 40: Improving the Performance of the Scaled Matrix/Vector ...mmwolf/presentations/CS591MH/CS591MH_20040908.pdfSep 08, 2004  · •Scaled Matrix/Vector Multiplication with vector addition

40

Acknowledgements

• LBNL– Ali Pinar, Esmond Ng

• SLAC (ACD)– Adam Guetz, Cho Ng, Nate Folwell, Kwok Ko

• MPI– Prof. Heath’s CS 554 Class Notes– Using MPI, Gropp, Lusk, Skjellum– MPI: The Complete Reference, Snir, et al.

• Work supported by U.S. DOE CSGF contract DE-FG02-97ER25308