1 In-Situ Model Checking of MPI Parallel Programs Ganesh Gopalakrishnan Joint work with Salman Pervez, Michael DeLisi Sarvani Vakkalanka, Subodh Sharma,

1

In-Situ Model Checkingof MPI Parallel Programs

Ganesh Gopalakrishnan

Joint work with Salman Pervez, Michael DeLisi

Sarvani Vakkalanka, Subodh Sharma, Yu Yang, Robert Palmer, Mike Kirby, Guodong Li

(http://www.cs.utah.edu/formal_verification)

School of ComputingUniversity of Utah

Supported by: Microsoft HPC Institutes

NSF CNS 0509379

http://www.cs.utah.edu/formal_verification

2

MPI is the de-facto standard for programming cluster machines

Our focus: Eliminate Concurrency Bugs from HPC Programs !

(BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, CSAFE, Utah)

3

Reason for our interest in MPI verification

Widely felt need – MPI is used on expensive machines for critical simulations

Potential for wider impact– MPI is a success as a standard

– What’s good for MPI may be good for OpenMP, Cuda, Shmem, …

Working in a less crowded but important area– Funding in HW verification decreasing

» We are still continuing two efforts: Verifying hierarchical cache coherence protocols Refinement of cache coherence protocol models to HW implementations

– SW verification in “threading / shared memory” crowded

» Whereas HPC offers LIBRARY BASED concurrent software creation as an unexplored challenge!

4

A highly simplistic view of MPI

Many MPI programs compute something like f o g o h ( x ) in a distributed manner

(think of maps on separate data domains, and later combinations thereof)

Compute h(x) on P1

Start g ( ) on P2

Fire-up f on P1

Use sends , receives , barriers , etc., to maximize computational speeds

This view may help compare it against PThread programs, for instance

5

Some high-level features of MPI Organized as a large library (API)– Over 300 functions in MPI-2 (was 128 in MPI-1)

Most MPI programs use about a dozen

Usually a different dozen for each program

6

MPI programming and optimization MPI includes Message Passing, Shared Memory, and I/O

– We consider C++ MPI programs, largely focussing on msg passing

MPI programs are usually written by hand – Automated generation has been proposed and still seems attractive

Source-to-source optimizations of MPI programs attractive– Break up communications and overlap with computations (ASPHALT)

Many important MPI programs do evolve– Re-tuning after porting to a new cluster, etc.

Correctness expectation varies – Some are throw-away programs; others are long-lasting libraries– Code correctness – not Model Fidelity – is our emphasis

7

Why MPI is Complex: Collision of features

– Send

– Receive

– Send / Receive

– Send / Receive / Replace

– Broadcast

– Barrier

– Reduce

– Rendezvous mode

– Blocking mode

– Non-blocking mode

– Reliance on system buffering

– User-attached buffering

– Restarts/Cancels of MPI Operations

– Non Wildcard receives

– Wildcard receives

– Tag matching

– Communication spaces

An MPI program is an interesting (and legal)combination of elementsfrom these spaces

8

Shared memory “escape” features of MPI

MPI has shared memory (called “one-sided”)

Nodes open shared region thru a “collective”

One process manages the region (“owner”)– Ensures serial access of the window

Within a lock/unlock, a process does puts/gets– There are more functions such as “accumulate” besides puts / gets

The puts/gets are not program-ordered !

9

A Simple Example of Msg Passing MPI Programmer expectation: Integration of a region

// Add-up integrals calculated by each process if (my_rank == 0) {

total = integral;

for (source = 0; source < p; source++) {

MPI_Recv(&integral, 1, MPI_FLOAT,source,

tag, MPI_COMM_WORLD, &status);

total = total + integral;

}

} else {

MPI_Send(&integral, 1, MPI_FLOAT, dest,

tag, MPI_COMM_WORLD);

}

04/18/23

10

A Simple Example of Msg Passing MPI Bug ! Mismatched send/recv causes deadlock

// Add-up integrals calculated by each process

if (my_rank == 0) {

total = integral;

for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source,

tag, MPI_COMM_WORLD, &status);

total = total + integral;

}

} else {

MPI_Send(&integral, 1, MPI_FLOAT, dest,

tag, MPI_COMM_WORLD);

}

04/18/23

p1:to 0 p2:to 0 p3:to 0

p0:fr 0 p0:fr 1 p0:fr 2

11

if (my_rank == 0) {

...

for (source = 1; source < p; source++) {

MPI_Recv(..)

..

} else {

MPI_Send(..)

}

04/18/23

Runtime Considerations Does System provide Buffering? What progress engine does MPI have? How does it schedule?

MPI Run-time; there is no separate thread for it…

• Does the system provide buffering ?• If not, a rendezvous behavior is enforced !

• When does the runtime actually process events?• Whenever an MPI operation is issued• Whenever some operations that “poke” the progress engine is issued

12

Differences between MPI and Shared Memory / Thread Parallel programs

Processes with local state communicate by copying– Processes sharing global state, heap

– Synchronization using locks, signals, notifies, waits

Not much dynamic process creation– PThread programs may spawn children dynamically

Control / data dependencies are well confined (often to rank variables and such). – Pervasive decoding of “data” (e.g. through heap storage).

Simple aliasing– Also aliasing relations may flow considerably across pointer

chains, across procedure calls.

13

Conventional debugging of MPI

Inspection– Difficult to carry out on MPI programs (low level notation)

Simulation Based– Run given program with manually selected inputs

– Can give poor coverage in practice

Simulation with runtime heuristics to find bugs– Marmot: Timeout based deadlocks, random executions

– Intel Trace Collector: Similar checks with data checking

– TotalView: Better trace viewing – still no “model checking”(?)

– We don’t know if any formal coverage metrics are offered

14

What should one verify ? The overall computation achieves some “f o g o h”

Symbolic execution of MPI programs may work (Siegel et.al.)

Symbolic execution has its limits – Finding out the “plumbing” of f, g, and h is non-trivial for optimized MPI

programs

So why not look for reactive bugs introduced in the process of erecting the plumbings ?

A common concern: “my code hangs”:– ISends without wait / test– Assuming that system provides buffering for Sends– Wildcard receive non-determinism is unexpected– Incorrect collective semantics assumed (e.g. for barriers)

ISP currently checks for deadlocks (not all procs reach MPI_Finalize). In future, we may check local assertions.

15

Static Analysis for violated usages of the API

Model Checking for Concurrency Bugs

Instrumentation and Trace Checking

Static Analysis to Support Model Checking– Loop transformations

– Strength reduction of code

– …

But, …who gives us the formal models to check !?

What approaches are cost-effective ? Some candidate approaches to MPI verification

16

Will look at C++ MPI programs– Gotta do C++ , alas ; C won’t do

Not ask user to hand-build Promela / Zing models– Do In-Situ Model Checking – run the actual code

May need to simplify code before running– OK, so complementary static analysis methods needed

LOTS of interleavings that do not matter!– Process memory is not shared!

When can we commute two actions?– Need a formal basis for Partial Order Reduction

» Need Formal Semantics for MPI

» Need to Formulate “Independence”

» Need viable model-checking approach

Our initial choices .. and consequences

17

POR

With 3 processes, the size of an interleaved state space is ps=27

Partial-order reduction explores representative sequences from each equivalence class

Delays the execution of independent transitions

In this example, it is possible to “get away” with 7 states (one interleaving)

04/18/23

18

Possible savings in one example

0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

P0 (owner of window) P1 (non-owner of window)


•These are the dependent operations• 504 interleavings without POR in this example• 2 interleavings with POR !!

19

We developed formal semantics of MPI for understanding MPI and also design a POR algorithm…

04/18/23

MPI 1.1 API

Point to Point Operations

Collective Operations

Requests

Communicator

Collective

Context Group

Constants

2020

Simplified Semantics of MPI_Wait

04/18/23

20

21

TLA+ Spec of MPI_Wait (Slide 1/2)

22

TLA+ Spec of MPI_Wait (Slide 2/2)

23

Executable Formal Specification can help validate our understanding of MPI …

04/18/23

TLA+ MPI Library Model

TLA+ Prog. Model

MPIC Program Model

Visual Studio 2005

Phoenix Compiler

TLC Model Checker MPIC Model Checker

Verification Environment

MPIC IR

FMICS 07 PADTAD 07

24

Even 5-line MPI programs may confound!Hence a Litmus-test outcome calculator based on formal semantics is quite handy

04/18/23

p0: { Irecv(rcvbuf1, from p1); Irecv(rcvbuf2, from p1); … }

p1: { sendbuf1 = 6; sendbuf2 = 7; Issend(sendbuf1, to p0); Isend (sendbuf2, to p0); … }

• In-order message delivery (rcvbuf1 == 6)

• Can access the buffers only after a later wait / test

• The second receive may complete before the first

• When Issend (synch.) is posted, all that is guaranteed is that Irecv(rcvbuf1,…) has been posted

25

The Histrionics of FV for HPC (1)

26

The Histrionics of FV for HPC (2)

27

Error-trace Visualization in VisualStudio

28

Alas, MPI’s dependence is not static

04/18/23

Dependencies may not be fully known, JUST by looking at enabled actions

Conservative Assumptions could be made (as in Siegel’s Urgent Algorithm)

The same problem exists with other “dynamic situations”– e.g. MPI_Cancel

Send(to Q) Recv(from *)

Send(to Q)

Some Stmt

Proc P: Proc Q: Proc R:

29

Dynamic Dependence due to MPI Wildcard Communication…

04/18/23

Illustration of a Missed Dependency that would have been detected, had Proc R been scheduled first…

Send(to Q) Recv(from *)

Send(to Q)

Some Stmt

Proc P: Proc Q: Proc R:

30

Dependance in MPI (partial results)

• Wildcard receives and the sends targeting it are dependent• Each send potentially provides a different value to the receive

• For Isend and Irecv, the dependency is induced by wait / test that help complete these operations

• Barrier entry order does not matter

• MPI Win_lock (owner) and Win_unlock (non-owner)

• Need to characterize more MPI ops (future)

04/18/23

31

Situation is similar to that discussed in Flanagan/Godefroid POPL 05 (DPOR)

a[ j ]++ a[ k ]--

• Action Dependence Determines COMMUTABILITY (POR theory is really detailed; it is more than commutability, but let’s pretend it is …)

• Depends on j == k, in this example

• Can be very difficult to determine statically

• Can determine dynamically

32

Hence we turn to their DPOR algorithm

Ample determinedusing “local” criteria

Current State

Next move of Red process

Nearest DependentTransitionLooking Back

Add Red Process to“Backtrack Set”

This builds the “Ample set” incrementally based on observed dependencies

Blue is in “Done” set

{ BT }, { Done }

33

How to instrument?– MPI provides the PMPI mechanism

– For MPI_Send, we have a PMPI_Send that does the same thing

» Over-ride MPI_Send

» Do instrumentation within it

» Launch PMPI_Send when necessary

How to orchestrate schedule?– MPI processes communicate with scheduler through TCP sockets

– MPI processes send MPI envelopes into scheduler

– Scheduler lets whoever it thinks must go

– Execute upto MPI_Finalize

» Naturally an acyclic state space !!

– Replay by restarting the MPI system

» Ouch !! but wait, … the Chinese Postman to the rescue ?

How to make DPOR work for MPI ? (I)

34

How to not get wedged inside MPI progress engine?– Understand MPI’s progress engine

– If in doubt, “poke it” through commands that are known to enter the progress engine

» Some of this has been demonstrated wrt. MPI one-sided

How to deal with system resource issues?

– If the system provides buffering for ‘send’, how do we schedule?» We schedule Sends as soon as they arrive

– If not, then how?» We schedule Sends only as soon as the matching Receives

arrive

How to make DPOR work for MPI ? (II)

35

Simple 1-sided Example…will show advancing computation by Blue marching




36





Record that owner has acquired window access

37





Treat non-owner’s win_lock as a no-op

38





Perform Accumulate from P1

39





Perform Accumulate from P0

40





P1 issues Win_unlock; Scheduler traps it; Notesthat P0 has locked window

41





So scheduler records P1 to be in a “blocked” state;But we do allow P1 to launch its PMPI_Win_unlock(there is nothing else that could be done!)

42





Now, P0 issues win_unlock

43





• Recall that P1’s PMPI_Win_unlock has been launched• But, P1 has not reported back to scheduler yet…

44





• To keep things simple, the scheduler works in “phases”• When Pi…Pj have been “let go” in one phase of the scheduler, no other Pk is “let go” till Pi..Pj have reported back.

45





• If we now allow P0’s PMPI_Win_unlock to issue, it may zip thru the progress engine and miss P1’s PMPI_Win_unlock• But we HAVE to allow P0 to launch, or else, P1 won’t get access to window!

46





• So P1 will likely be stuck in the progress engine• But P0 next enters the progress engine only at Barrier• But we don’t schedule P0 till P1 has reported back• But P1 won’t report back (stuck inside progress engine)

47





Deadlock inside scheduler !

48





Solution: When P0 comes to scheduler, we do notgive a ‘go-ahead’ to it; so it keeps poking the progressengine; this causes P1 to come back to scheduler; thenwe let P0’s PMPI_Win_unlock to issue

49

P0’s code to handle MPI_Win_unlock(in general, this is how every MPI_SomeFunc is structured…)

MPI_Win_unlock(arg1, arg2...argN) {

sendToSocket(pID, Win_unlock, arg1,...,argN);

while(recvFromSocket(pID) != go-ahead)

MPI_Iprobe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD...);

return PMPI_Win_unlock(arg1, arg2...argN);

} An innocuous Progress-Engine “Poker”

50

Assessment of Solution to forward-progress

• Solutions may be MPI-library specific

• This is OK so long as we know exactly how the progress engine of the MPI library works

• This needs to be advertised by MPI library designers

• Better still: if they can provide more “hooks”, ISP can be made more successful

04/18/23

51

So how well does ISP work ?• Trapezoidal integration deadlock

• Found in seconds• Total 33 interleavings in 9 seconds after fix• 8.4 seconds spent restarting MPI system

• Monte-carlo computation of Pi• Found three deadlocks we did not know about, in seconds• No modeling effort whatsoever• After fixing, took 3,427 interleavings taking 15.5 mins• About 15 mins restarting MPI system

• For Byte-Range Locking using 1-sided• Deadlock was found by us in previous work• Found again by ISP in 62 interleavings• After fix, 11,000 interleavings… no end in sight

04/18/23

52

How to improve the performance of ISP ?• Minimize restart overhead

• Maybe we don’t need to reset all data before restarting

• Implemented Chinese-Postman-like tour • “Collective goto” to initial state, just before MPI_Finalize

• Trapezoidal finishes in 0.3 seconds (was 9 seconds before)

• Monte-carlo finishes in 63 seconds (was 15 mins)

04/18/23

53

Vary the DPOR dependence matrix during search

Eliminate computations that don’t affect control– Static analysis to remove blocks that won’t deadlock

Loop peeling transformations– Many MPI calls are within loops

– Do not interleave all of them (let some happen w/o trapping)

– This sampling should not confuse our scheduler

Insert barriers to confine search– Analysis to infer concurrent cuts (incomparable clock vectors)

Other ideas to improve ISP (TBD)

54

Summary: Motivations for the In-Situ DPOR (ISP) Approach

Building verification models of MPI programs is not straightforward – The bug may be in the code that looks innocuous

– The bug may be in the MPI library function itself

– The final production code may be hand-tuned

Complementary approaches are possible – Model check using other tools to weed out concurrency errors

– Use static analysis to detect bugs

– Use automated synthesis to guarantee correctness by construction

55

Related work on FV for MPI programs

Main related work is that by Siegel and Avrunin

Provide synchronous channel theorems for blocking and non-blocking MPI constructs– Deadlocks caught iff caught using synchronous channels

Provide a state-machine model for MPI calls– Have built a tool called MPI_Spin that uses C extensions to Promela to encode

MPI state-machine

Provide a symbolic execution approach to check computational results of MPI programs

Define “Urgent Algorithm,” which is a static POR algorithm – Schedules processes in a canonical order– Schedules sends when receives posted – sync channel effect– Wildcard receives handled through over-approximation

56

Initial implementation– Rajeev Thakur (Argonne) proposed ISP idea – instrument + play

– Salman Pervez’s MS thesis – wrote our first ISP

– Robert Palmer provided lots of help / inspiration

– We have a EuroPVM / MPI 2007 paper coauthored by Salman Pervez, Robert Palmer, myself, Mike Kirby – and Rajeev Thakur, Bill Gropp of Argonne National Lab

– Salman moves to Purdue for PhD ; Sarvani takes over ISP

Sarvani’s ISP implementation– A TOTAL rewrite

– Modular for experimentation

– Have collected lots of data

Subodh looking into static analysis support for ISP– Make it easier to do ISP on a given program

Credits for ISP Algorithm

57

Quick demo

58

Overview of Distributed DPOR in Inspect(a tool for PThread Verification – SPIN 07)

59

We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect”

Multithreaded C/C++ program

Multithreaded C/C++ program

instrumented program

instrumented program

instrumentation

Thread library wrapper

Thread library wrapper

compile

executableexecutable

thread 1

thread n

schedulerrequest/permit

request/permit

60

worker a worker b

Request unloading

idle node id

work description

report result

load balancer

We then devised a work-distribution scheme…

61

Speedup on aget

62

Speedup on bbuf

63

DPOR based search is quite promising– Tool built for MPI exploration is ISP

– Tool built for PThreads exploration is Inspect

– Distributed Inspect helps obtain linear speed-up

– Distributed ISP is within easy reach

– More understanding of Forward Progress and other implementation issues

– More examples

– Any more properties than deadlocks ?

Will improve efficiency of search

Will couple static analysis with DPOR to improve performance

Handling concurrent software written using large APIs remains an important challenge to meet– Need more people to be working on this

Conclusions

1 In-Situ Model Checking of MPI Parallel Programs Ganesh Gopalakrishnan Joint work with Salman Pervez, Michael DeLisi Sarvani Vakkalanka, Subodh Sharma,

Documents

mpi verification

mpi programming

need mpi

important mpi programs

c mpi programs

highlevel features of

simplistic view of mpi

wider impact mpi