Formal Verification of Shared Memory Systems During their Design Ganesh Gopalakrishnan Department of Computer Science University of Utah ganesh.

Formal Verification ofShared Memory Systems

During their Design

Ganesh Gopalakrishnan

Department of Computer Science

University of Utah

http://www.cs.utah.edu/~ganesh

06/21/99 Ganesh, Utah Verifier group -- SAS talk ('99 visit)

2

FM and shared-memory system design

• Processor speed increasing at 55% per year - memory speeds at 7%

• Mismatch exacerbated by shared memory multiprocessors

• Complex protocols employed to hide memory latencies

• Need for formal verification techniques that can be employed during design


3

Our Project: Utah Verifier


4

A Shared Memory Multiprocessor(a “shared memory system”)

Memory

CPUCPU

Interconnect

Memory

...

...


5

Classification: Symmetric Multi-Processors (SMP)

CPU$

Memory

CPU$

CPU$

Coherentsnoopingbus

Potential bugs in complex bus designs:

• Deadlocks, lack of forward progress

• Lack of coherency

• Incorrect shared memory consistency model


6

2. Distributed Shared Memory (DSM) systems

Memory

CPU CPU CPU...

DC Memory

CPU CPU CPU...

DC

…

High-speed networkSMPnode

Problems due to complex DSM protocols:

• Deadlocks, lack of forward progress, …

• Incorrect shared memory consistency models


7

Formal Methods for Shared Memory System Design

Verification Provably-correctSynthesis

Theorem-proving

Model-checkingProtocol

Low-level concerns(e.g. deadlocks, progress,...)

Higher-level concerns (e.g. shared memory consistency models)

Finite-state Reachability


8

Results of the UV group

• New Partial Order reduction algorithm• Realized in verifier called PV

• Outperforms SPIN “10 to 1” on most examples

• Selective state-caching is available “for free”

• A DSM Protocol synthesis algorithm• Safety of synthesis proved correct using PVS

• Derives realistic (hand-quality) DSM protocols

• Incorporates a scalable buffer-reservation scheme

• Verifying Formal Memory Models


9

Protocol Refinement


10

Motivations

• Distributed directory based coherence protocols difficult

to understand and debug

• low-level requests / acks / nacks don’t reveal *what* is being implemented

• transient states are introduced and handled in an ad-hoc way

• buffer allocation is not tied to desired high-level properties (e.g. progress)

• verification is tedious


11

Example of problems due to “unexpected msgs”

Req Ack

Another Req? ? ?

Usually don’t know what to say…...saying nothing causes deadlock!

CacheCtrlr

DirectoryCtrlr


12

Our approach

• Based on synthesis

• Transient states introduced automatically

• Buffer allocation is tied to desired high-level properties (e.g. progress

• Verification becomes much easier

• Synthesized protocols seem efficient


13

Overview of Synthesis Method

I ECacheCtrlr

F EDir Ctrlr

I E

F E

Req (N)ack


14

Model-checking Efficiency

Protocol N states / time(low level)

states / time(high level)

Mig 2 23,164 / 2.8 54 / 0.14 235 / 0.48 965 / 0.5

Inv 2 193,389 / 19.23 546 / 0.64 18,686 / 18.4


15

An Illustration: Migratory Protocol (i)

I V

V1

V2

r(i)?reqr(i)!gr(data) r(j)?req r(o)!inv

r(o)?LR(data)r(j)!gr(data) r(o)?ID(data)

r(o)?LR(data)

Process ‘h’

h!LR(data) evict

h!ID(data)

rwh!req

h?inv

h?gr(data)

Process ‘r(i)’

F E I2

I3

I1


16

An Illustration: Migratory Protocol (ii)

I V

V1

V2

r(i)?reqr(i)!gr(data) r(j)?req r(o)!inv

r(o)?LR(data)r(j)!gr(data) r(o)?ID(data)

r(o)?LR(data)

Process ‘h’

h!LR(data) evict

h!ID(data)

rwh!req

h?inv

h?gr(data)

Process ‘r(i)’

F E I2

I3

I1


17

A Generic Example

P Q R

Q!aR!b

P?x

Q!c

R?y


18

Async Implementation of Example (i)

P Q R

Q!aR!b

P?x

R?y

Q!c

1 msg buffer location for Ack/Nack

R!!bQ!!a


19

Async Implementation of Example (ii)

P Q R

Q!aR!b

P?x

R?y

Q!c

R!!bQ!!aQ!!cP!!ack

Progress Buffer


20

Organization of Protocol - per Cache Line

RemoteNodes

HomeNode

- Remote nodes (cache ctrlrs) communicate w. home directory controller only

- If Remote and Home requests cross in medium, . Remote request treated as Nack by Home . Home request is dropped by Remote

- Pt-to-pt order-preserving error-free communication


21

General Nature of Communication States

(Remote)

h!msg T h?m1h?m2

(Home)

Tr(i)?m1

r(j)!m2


22

Summary: Remote node rules

Statement Buffer ActionH!m empty Req; goto trans.H!m req Del req; req; goto trans.H?m req Ack / Nacktrans. ack successtrans. nack retrytrans. req Ignore req


23

Summary: Home node (i)

stmt buffer has action

r(i)?msg msg from r(i) ack the msg

r(i)!msg reserve progress, responsebuffers.Req and go trans.

trans ack from r(i) done

trans nack go back


24

Summary: Home node (ii)

condition action

trans. req from r(i) “implicit” nack

trans. req from r(j), buff has space add to buffer

trans. req x from r(j), progress buff isfree, and r(j)?x in comm. state

add to progress buffer

trans. req from r(j), progress buff is fullor r(j)?x not in comm. state

nack the request


25

Status of Work

• Correctness of Protocol Synthesis Proved in PVS

• Write-invalidate protocol also synthesized

• Offers a general synthesis method for protocols (not necessarily for DSM)– Related work: Buckley and Silberschatz, Chandra

et.al., Park and Dill, Gribomont, ...


26

Verifying Conformance toFormal Memory Models


27

FM and shared-memory system design

• Shared-memory systems are complex!

• Designers need “safety net” when exploring optimizations formal verification

• We focus on verifying that a (finite-state model of a) shared memory system provides the required memory model (mainly Sequential Consistency)

– E.g. Verify a Cache Coherence Protocol for SC

• Our approach: finite-state reachability analysis


28

Importance of Memory Models -- An Example Peterson’s algorithm for mutex undera memory model called “TSO”:

P1:

A = 1 ;turn = 2 ;while (B /\ turn==2 );

..CS..

P2:

B = 1 ;turn = 1;while (A /\ turn==1 );

..CS..

w(A,1);r(B,0);

w(B,1);r(A,0);

Init A=B=0

Must Specify Synchronization Routines and the Shared Memory Consistency Model(s) under which they work!


29

Impact on CPU design -- Do Read-Speculation Right!

wr(a,2) - Missrd(b, 0) - SpeculateSnoop wr(a) - Spec OK

wr(b,3) - Missrd(a, 0) - SpeculateSnoop wr(a)

CPU1 CPU2

busMEM

Without reissue, results are inconsistent with SC

..wr(a,2);.. wr(b,3)..

Spec not OK reissue rd(a, 2)


30

Basis for our work: ARCHTEST (Collier)

• Multi-threaded C programs

• Used to debug actual multiprocessor machines

– unavailable at design-time

• Based on the theory of graph-sets

– used in our work also

• Our CAV’98 work: adapt Collier’s tests for model-checking

– incomplete

• This work: a complete verification method (sound too!)


31

What is a shared memory model? Captured by the set of all executions of a concurrent program!

Memory

CPUCPU

w(A,1);r(B,0);

w(B,1);r(A,0);

Init A=B=0

Memory

CPUCPU

w(A,1);r(B,0);

w(B,1);r(A,1);

Init A=B=0

SC TSO

TSO allows more executions than SC (hence “weaker”)

Execution #1 Execution #2


32

An Operational Definition of SC and TSO

MemoryMemorySC TSO

fifo fifoMUX

cpu1 cpu2 cpu1 cpu2


33

How are allowed executions specified?

As constraints on events generated by the execution!

Constraints are expressed in terms of ordering rules:

RO - Read OrderingROA - RO over the same addressWOS - Write Ordering by StoragePOS - Program Ordering by StorageCMP - Computational OrderingWA - Write Atomicity

Ordering rules specify constrains on EVENTS

Memory Model = “Collier Cocktail!” - e.g. (CMP, RO, WOS)

06/21/99 34

CPU_i

STORE_i

CPU_j

STORE_j

R1(a,0) ;W2(b,1) ;R5(d,2) ;

R3(c,0) ;W4(d,2) ;W6(d,3) ;

R3(c,T) W4(d,2) W6(d,3)

W2(b,1)

RO(i)

part ofPOS(j)

R1(a,T) W2(b,1) R5(d,2)

W4(d,2) W6(d,3) WOS(i)

WOS(j)

Definition of POS (and also RO and WOS)

PO includes RR, RW, WR, and WW orders

View theseevents first asan unordered setwhich is subsequently ordered bythe arcs


35

CPU_i

STORE_i

CPU_j

STORE_j

W2(b,1) R5(d,2)

W4(d,2) W6(d,3)

W4(d,2) W6(d,3)

W2(b,1)

OneCMPordercmp1(i,d)

Another cmp2(i,d)

cmp1(j,d)

cmp2(j,d)

Definition of CMP (defined per CPU per address)


36

Assumptions in defining CMP...… and in the rest of this talk

• We are interested in more than SC

– We would like to set-up a general framework for defining and verifying memory models

– Assume that RO is obeyed by every memory model of interest to us

• We Assume

– Projectability,

– Data Independence

– Unambiguous executions


37

CPU_i

STORE_i

R1(a,T) ;W2(b,1) ;R5(d,2) ;

Projectible: R3(c,T) ;W4(d,2) ;W6(d,3) ;

CPU_j

STORE_j

Data independent:

Assume Projectability, Data Independence,and consider only Unambiguous executions

Executionsprojected ontosubsets ofaddresses resultin executions

Replacing all data values d in anexecution with f(d) for some function f results in an execution

Unambiguous: Same datum never written twice (sowe can uniquely trace source of data!)


38

CPU_i

STORE_i

CPU_j

STORE_j

R1(d,T)

R2(d,2)

W4(d,2)

W4(d,2)

W2(d,4)

Definition of CMP for CPU i for address d

W4(d,2) R2(d,2)

W2(d,4)

R1(d,T)

R3(d,2) R3(d,2)

W3(d,5)

W3(d,5)

ROA

W2(d,4)R4(d,5)

W3(d,5) R4(d,5)

ROA

CMP includesROA ; also is an implied edge


39

Initially a = 0 R1(a,1) ;W2(a,1) ; ..no writes to a..

CPU_i

STORE_i

CPU_j

STORE_j

Even thisexecutionis possibleunder (CMP,RO,WOS)

Let’s study (CMP, RO, WOS) - a useful drosophila!


40

An execution satisfying (CMP, RO, WOS)R1(a,T) ;W2(b,1) ;R5(d,2) ;

R3(c,T) ;W4(d,2) ;W6(d,3) ;

CPU_i

STORE_i

CPU_j

STORE_j

R3(c,T) W2(b,1)

W4(d,2) W6(d,3)

WOS(j)

CMP(j,d)

R1(a,T) W2(b,1) R5(d,2)

W4(d,2) W6(d,3) WOS(i)

CMP(i,d)

RO

Execution satisfies (CMP, RO, WOS) as there areno cycles created by adding their arcs!


41

An execution that violates (CMP,RO,WOS)

wr(A,2) ;wr(A,3) ;

CPU_i

STORE_i

CPU_j

STORE_j

rd(A,3) ;rd(A,2) ;

wr(A,2) rd(A,2)

wr(A,3) rd(A,3)

ROAWOS


42

Verification Techniques for Memory Models

• Consider all possible executions– involving all possible addresses A

– and all possible data D

– for all possible concurrent programs P

• Introduce the arcs due ordering rules

• Look for cycles

Impractical!

• So, look for ways to limit A, D, and P


43

Our approach

• Assume address projectability (or “projectability”)

and data independence

• Prove limited address theorems (helps limit A)

• Characterize all violating executions { E_i } over A

• Come up with finite-state abstractions for each E_i – using data independence to limit D, and

– using non-determinism

to arrive at a finite number of test automata aut_i

• Explore state-space of each aut_i || memory-system

• Look for entry into error-states


44

Use of data abstraction & non-determinism

P2X1 := AX2 := AX3 := A

....Xk := A

P1A := 1A := 2A := 3

....A := k

Look for some i,js.t. j < i /\ X(j) < X(i)

Suppose E_i are:

rd(1)

rd(0)

rd(0)

rd(1)

wr(0)

wr(1)

wr(1)Errorstate

P2P1

- Achieves the effect of k = infinity- Considers all interleavings

Then a_iare:


45

Limited Address Theorem for (CMP,RO,WOS)

Two addresses suffice!


46

PowerPoint proof of the limited address theorem for (CMP,RO,WOS)

R1(P1)

R2(P1)

W1(P2)

W2(P2)

W3(P3)

W4(P3)

P1:

R1

R2

W1

W2

W3

W4

P1:

RO

WOS

WOS

R1

R2

W1

W2

W3

W4

RO

WOS

WOS

CMP

CMP

CMP

R1

R2

W1

W2

W3

W4

RO

WOS

WOS

CMP

CMP

CMP

R

RO

RO

Involves twoaddrs!


47

Exhaustive characterization of violations of (CMP, RO, WOS) over one address, “a”

v is not the initialvalue T of a, and a is not writtenanywhere

(1)

P_i...rd(a, v)…

P_ j...…...

(2)

P_i...rd(a,v1)…rd(a,v2)...

P_ j…wr(a,v2)…wr(a,v1)...

P_ i and P_ jcould be thesame process

(3)

P_i...rd(a,v)…rd(a,T)...

P_ j…wr(a,v)…



48

Test automata for 1-address (CMP,RO,WOS) violations

Error states: E1, E2


49

Exhaustive characterization of two addresses violations of (CMP, RO, WOS)

(1)

Allone-addressviolationsinvolvingonly address Aor only address B

(2)

P_i...rd(B,v2)…rd(A,v1)...

P_ j…wr(A,v1)…wr(B,v2)...


R1(P1)

W3(P3)

W4(P3)WOS

CMP

R3(P1)

RO

CMP


50

Test automata for 2-address (CMP,RO,WOS) violations



51

Limited Address Theorem for (CMP,POS)

2 addresses suffice


52

1-address (CMP,POS) verification



53

2-address (CMP,POS) verification



54

CPU_1

STORE_1

CPU_2

STORE_2

w(A,1);r(B,0);

w(B,1);r(A,1);

w(A,1)r(B,0)w(B,1)

w(A,1)r(A,1)

w(B,1)

• Write Atomicity

• POS

• CMP

Memory

CPU2CPU1

w(A,1);r(B,0);

w(B,1);r(A,1);

SC

SC = (CMP, POS, WA)


55

Memory

CPUCPU

w(A,1);r(A,1);w(A,2);r(A,2);

r(A,2);r(A,1);

Init A=0

Definition of WA - by showing what is not WA!


56

The limited-address theorem for SC = (CMP, POS, WA)

• In an N-processor system, N addresses are– sufficient

• IF concurrent program P using M > N addresses shows a violation

• THEN there exists a subset A of N addresses

• such that P projected onto A yields concurrent program P’ that also shows a violation.

PowerPoint proof to follow

– and necessary:

Wr(A,1)Rd(B,0)

Wr(B,1)Rd(C,0)

Wr(C,2)Rd(A,0)


57

PowerPoint proof of the limited address theorem for SC = (CMP, POS, WA)

- Suppose C is the cycle containing the smallest number of events that involves more than N <pos edges. - Then two <pos edges connect events generated by the same processor, say `g’, and observed by `a’ and `b’.- If a=b, we can eliminate one of these POS edges- if a <> b, consider g <> a, and possibly equal to b. - a0 and a1 are writes. Find corresp events in `b’.

a0

a1

b2

b3

Pos(g) Pos(g)

a0

a1

b2

b3

Pos(g) Pos(g)

b0One linearization

wa


58

All N-address (CMP, POS, WA) violations:

(1) (2)

(CMP, POS)violations

Two processors “see”two writes w1 and w2 in different orders


59

Complete test for SC for 1-address programs

Error states:- < P14, Q41 >- { P41a, P41b } x { Q14a, Q14b }


60

Complete test for SC for 2-address programs

Error states:- < P14, Q41 >- { P41a, P41b } x { Q14a, Q14b }


61

Case Studies

Runway/PA system model– Bus based design

– An aggressive split transaction protocol

– Out-of-order (speculative) completion of transactions on Runway for high-performance

• not modeled in current experiments

– In-order completion of instructions in PA for sequential consistency


62

SC verification of the HP/Runway model

Spin PV

PO-1 56K 2794

PO-2 > 5M/DNF 11M

SC-1 499K 7880

SC-2a > 5M/DNF 5.9M

SC-2b > 4M/DNF 574K


63

Conclusions• Promising

– Violations caught very quickly

– Need to try larger examples

• Currently studying weaker memory models• Future work:

– Combatting state-explosion• Symmetries

• Better automata

• Integrate into design cycle of CPUs Support performance optimizations

and verification regressions


64

• Graf (CAV’94)

– for more than SC (hence unsound for SC)

– properties depend on design

• Alur, McMillan, Peled (LICS’96)

– undecidable if data can be compared

• Nalumasu, Ghughal, Mokkedem, Gopalakrishnan (CAV’98)

– incomplete

• Henzinger, Qadeer, Rajamani (CAV’99)

– needs invariants

– invariants depend on design

– assumes address-symmetry

• Collier (‘80s)– not available at design-time

Related Work

Formal Verification of Shared Memory Systems During their Design Ganesh Gopalakrishnan Department of Computer Science University of Utah ganesh.

Documents

utah verifier slide

utah verifier group

memory latencies

design slide

dc memory cpu

formal memory models

tedious slide

efficient slide