1 In-Situ Model Checking of MPI Parallel Programs Ganesh Gopalakrishnan Joint work with Salman Pervez, Michael DeLisi Sarvani Vakkalanka, Subodh Sharma, Yu Yang, Robert Palmer, Mike Kirby, Guodong Li (http://www.cs.utah.edu/formal_verification ) School of Computing University of Utah Supported by: Microsoft HPC Institutes NSF CNS 0509379
63
Embed
1 In-Situ Model Checking of MPI Parallel Programs Ganesh Gopalakrishnan Joint work with Salman Pervez, Michael DeLisi Sarvani Vakkalanka, Subodh Sharma,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
In-Situ Model Checkingof MPI Parallel Programs
Ganesh Gopalakrishnan
Joint work with Salman Pervez, Michael DeLisi
Sarvani Vakkalanka, Subodh Sharma, Yu Yang, Robert Palmer, Mike Kirby, Guodong Li
MPI is the de-facto standard for programming cluster machines
Our focus: Eliminate Concurrency Bugs from HPC Programs !
(BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, CSAFE, Utah)
3
Reason for our interest in MPI verification
Widely felt need – MPI is used on expensive machines for critical simulations
Potential for wider impact– MPI is a success as a standard
– What’s good for MPI may be good for OpenMP, Cuda, Shmem, …
Working in a less crowded but important area– Funding in HW verification decreasing
» We are still continuing two efforts: Verifying hierarchical cache coherence protocols Refinement of cache coherence protocol models to HW implementations
– SW verification in “threading / shared memory” crowded
» Whereas HPC offers LIBRARY BASED concurrent software creation as an unexplored challenge!
4
A highly simplistic view of MPI
Many MPI programs compute something like f o g o h ( x ) in a distributed manner
(think of maps on separate data domains, and later combinations thereof)
Compute h(x) on P1
Start g ( ) on P2
Fire-up f on P1
Use sends , receives , barriers , etc., to maximize computational speeds
This view may help compare it against PThread programs, for instance
5
Some high-level features of MPI Organized as a large library (API)– Over 300 functions in MPI-2 (was 128 in MPI-1)
Most MPI programs use about a dozen
Usually a different dozen for each program
6
MPI programming and optimization MPI includes Message Passing, Shared Memory, and I/O
– We consider C++ MPI programs, largely focussing on msg passing
MPI programs are usually written by hand – Automated generation has been proposed and still seems attractive
Source-to-source optimizations of MPI programs attractive– Break up communications and overlap with computations (ASPHALT)
Many important MPI programs do evolve– Re-tuning after porting to a new cluster, etc.
Correctness expectation varies – Some are throw-away programs; others are long-lasting libraries– Code correctness – not Model Fidelity – is our emphasis
7
Why MPI is Complex: Collision of features
– Send
– Receive
– Send / Receive
– Send / Receive / Replace
– Broadcast
– Barrier
– Reduce
– Rendezvous mode
– Blocking mode
– Non-blocking mode
– Reliance on system buffering
– User-attached buffering
– Restarts/Cancels of MPI Operations
– Non Wildcard receives
– Wildcard receives
– Tag matching
– Communication spaces
An MPI program is an interesting (and legal)combination of elementsfrom these spaces
8
Shared memory “escape” features of MPI
MPI has shared memory (called “one-sided”)
Nodes open shared region thru a “collective”
One process manages the region (“owner”)– Ensures serial access of the window
Within a lock/unlock, a process does puts/gets– There are more functions such as “accumulate” besides puts / gets
The puts/gets are not program-ordered !
9
A Simple Example of Msg Passing MPI Programmer expectation: Integration of a region
// Add-up integrals calculated by each process if (my_rank == 0) {
total = integral;
for (source = 0; source < p; source++) {
MPI_Recv(&integral, 1, MPI_FLOAT,source,
tag, MPI_COMM_WORLD, &status);
total = total + integral;
}
} else {
MPI_Send(&integral, 1, MPI_FLOAT, dest,
tag, MPI_COMM_WORLD);
}
04/18/23
10
A Simple Example of Msg Passing MPI Bug ! Mismatched send/recv causes deadlock
Runtime Considerations Does System provide Buffering? What progress engine does MPI have? How does it schedule?
MPI Run-time; there is no separate thread for it…
• Does the system provide buffering ?• If not, a rendezvous behavior is enforced !
• When does the runtime actually process events?• Whenever an MPI operation is issued• Whenever some operations that “poke” the progress engine is issued
12
Differences between MPI and Shared Memory / Thread Parallel programs
Processes with local state communicate by copying– Processes sharing global state, heap
– Synchronization using locks, signals, notifies, waits
Not much dynamic process creation– PThread programs may spawn children dynamically
Control / data dependencies are well confined (often to rank variables and such). – Pervasive decoding of “data” (e.g. through heap storage).
Simple aliasing– Also aliasing relations may flow considerably across pointer
chains, across procedure calls.
13
Conventional debugging of MPI
Inspection– Difficult to carry out on MPI programs (low level notation)
Simulation Based– Run given program with manually selected inputs
– Can give poor coverage in practice
Simulation with runtime heuristics to find bugs– Marmot: Timeout based deadlocks, random executions
– Intel Trace Collector: Similar checks with data checking
– TotalView: Better trace viewing – still no “model checking”(?)
– We don’t know if any formal coverage metrics are offered
14
What should one verify ? The overall computation achieves some “f o g o h”
Symbolic execution of MPI programs may work (Siegel et.al.)
Symbolic execution has its limits – Finding out the “plumbing” of f, g, and h is non-trivial for optimized MPI
programs
So why not look for reactive bugs introduced in the process of erecting the plumbings ?
A common concern: “my code hangs”:– ISends without wait / test– Assuming that system provides buffering for Sends– Wildcard receive non-determinism is unexpected– Incorrect collective semantics assumed (e.g. for barriers)
ISP currently checks for deadlocks (not all procs reach MPI_Finalize). In future, we may check local assertions.
15
Static Analysis for violated usages of the API
Model Checking for Concurrency Bugs
Instrumentation and Trace Checking
Static Analysis to Support Model Checking– Loop transformations
– Strength reduction of code
– …
But, …who gives us the formal models to check !?
What approaches are cost-effective ? Some candidate approaches to MPI verification
16
Will look at C++ MPI programs– Gotta do C++ , alas ; C won’t do
Not ask user to hand-build Promela / Zing models– Do In-Situ Model Checking – run the actual code
May need to simplify code before running– OK, so complementary static analysis methods needed
LOTS of interleavings that do not matter!– Process memory is not shared!
When can we commute two actions?– Need a formal basis for Partial Order Reduction
» Need Formal Semantics for MPI
» Need to Formulate “Independence”
» Need viable model-checking approach
Our initial choices .. and consequences
17
POR
With 3 processes, the size of an interleaved state space is ps=27
Partial-order reduction explores representative sequences from each equivalence class
Delays the execution of independent transitions
In this example, it is possible to “get away” with 7 states (one interleaving)
• To keep things simple, the scheduler works in “phases”• When Pi…Pj have been “let go” in one phase of the scheduler, no other Pk is “let go” till Pi..Pj have reported back.
45
Simple 1-sided Example…will show advancing computation by Blue marching
• If we now allow P0’s PMPI_Win_unlock to issue, it may zip thru the progress engine and miss P1’s PMPI_Win_unlock• But we HAVE to allow P0 to launch, or else, P1 won’t get access to window!
46
Simple 1-sided Example…will show advancing computation by Blue marching
• So P1 will likely be stuck in the progress engine• But P0 next enters the progress engine only at Barrier• But we don’t schedule P0 till P1 has reported back• But P1 won’t report back (stuck inside progress engine)
47
Simple 1-sided Example…will show advancing computation by Blue marching
Solution: When P0 comes to scheduler, we do notgive a ‘go-ahead’ to it; so it keeps poking the progressengine; this causes P1 to come back to scheduler; thenwe let P0’s PMPI_Win_unlock to issue
49
P0’s code to handle MPI_Win_unlock(in general, this is how every MPI_SomeFunc is structured…)
MPI_Win_unlock(arg1, arg2...argN) {
sendToSocket(pID, Win_unlock, arg1,...,argN);
while(recvFromSocket(pID) != go-ahead)
MPI_Iprobe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD...);
return PMPI_Win_unlock(arg1, arg2...argN);
} An innocuous Progress-Engine “Poker”
50
Assessment of Solution to forward-progress
• Solutions may be MPI-library specific
• This is OK so long as we know exactly how the progress engine of the MPI library works
• This needs to be advertised by MPI library designers
• Better still: if they can provide more “hooks”, ISP can be made more successful
04/18/23
51
So how well does ISP work ?• Trapezoidal integration deadlock
• Found in seconds• Total 33 interleavings in 9 seconds after fix• 8.4 seconds spent restarting MPI system
• Monte-carlo computation of Pi• Found three deadlocks we did not know about, in seconds• No modeling effort whatsoever• After fixing, took 3,427 interleavings taking 15.5 mins• About 15 mins restarting MPI system
• For Byte-Range Locking using 1-sided• Deadlock was found by us in previous work• Found again by ISP in 62 interleavings• After fix, 11,000 interleavings… no end in sight
04/18/23
52
How to improve the performance of ISP ?• Minimize restart overhead
• Maybe we don’t need to reset all data before restarting
• Implemented Chinese-Postman-like tour • “Collective goto” to initial state, just before MPI_Finalize
• Trapezoidal finishes in 0.3 seconds (was 9 seconds before)
• Monte-carlo finishes in 63 seconds (was 15 mins)
04/18/23
53
Vary the DPOR dependence matrix during search
Eliminate computations that don’t affect control– Static analysis to remove blocks that won’t deadlock
Loop peeling transformations– Many MPI calls are within loops
– Do not interleave all of them (let some happen w/o trapping)
– This sampling should not confuse our scheduler
Insert barriers to confine search– Analysis to infer concurrent cuts (incomparable clock vectors)
Other ideas to improve ISP (TBD)
54
Summary: Motivations for the In-Situ DPOR (ISP) Approach
Building verification models of MPI programs is not straightforward – The bug may be in the code that looks innocuous
– The bug may be in the MPI library function itself
– The final production code may be hand-tuned
Complementary approaches are possible – Model check using other tools to weed out concurrency errors
– Use static analysis to detect bugs
– Use automated synthesis to guarantee correctness by construction
55
Related work on FV for MPI programs
Main related work is that by Siegel and Avrunin
Provide synchronous channel theorems for blocking and non-blocking MPI constructs– Deadlocks caught iff caught using synchronous channels
Provide a state-machine model for MPI calls– Have built a tool called MPI_Spin that uses C extensions to Promela to encode
MPI state-machine
Provide a symbolic execution approach to check computational results of MPI programs
Define “Urgent Algorithm,” which is a static POR algorithm – Schedules processes in a canonical order– Schedules sends when receives posted – sync channel effect– Wildcard receives handled through over-approximation
56
Initial implementation– Rajeev Thakur (Argonne) proposed ISP idea – instrument + play
– Salman Pervez’s MS thesis – wrote our first ISP
– Robert Palmer provided lots of help / inspiration
– We have a EuroPVM / MPI 2007 paper coauthored by Salman Pervez, Robert Palmer, myself, Mike Kirby – and Rajeev Thakur, Bill Gropp of Argonne National Lab
– Salman moves to Purdue for PhD ; Sarvani takes over ISP
Sarvani’s ISP implementation– A TOTAL rewrite
– Modular for experimentation
– Have collected lots of data
Subodh looking into static analysis support for ISP– Make it easier to do ISP on a given program
Credits for ISP Algorithm
57
Quick demo
58
Overview of Distributed DPOR in Inspect(a tool for PThread Verification – SPIN 07)
59
We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect”
Multithreaded C/C++ program
Multithreaded C/C++ program
instrumented program
instrumented program
instrumentation
Thread library wrapper
Thread library wrapper
compile
executableexecutable
thread 1
thread n
schedulerrequest/permit
request/permit
60
worker a worker b
Request unloading
idle node id
work description
report result
load balancer
We then devised a work-distribution scheme…
61
Speedup on aget
62
Speedup on bbuf
63
DPOR based search is quite promising– Tool built for MPI exploration is ISP
– Tool built for PThreads exploration is Inspect
– Distributed Inspect helps obtain linear speed-up
– Distributed ISP is within easy reach
– More understanding of Forward Progress and other implementation issues
– More examples
– Any more properties than deadlocks ?
Will improve efficiency of search
Will couple static analysis with DPOR to improve performance
Handling concurrent software written using large APIs remains an important challenge to meet– Need more people to be working on this