Top Banner
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader
38

Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Sep 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Parallel Methods for Verifying the Consistency of

Weakly-Ordered Architectures

Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader

Page 2: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Challenges of Design Verification

• Contemporary hardware designs require

millions of lines of RTL code

– More lines of code written for verification than for

the implementation itself

• Tradeoff between performance and design

complexity

– Speculative execution, shared caches, instruction

reordering

– Performance wins out

GTC 2016, San Jose, CA 2

Page 3: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Performance vs. Design Complexity

• Programmer burden

– Requires correct usage of

synchronization

• Time to market

– Earlier remediation of bugs is less costly

– Re-spins on tapeout are expensive

• Significant time spent of verification

– Verification techniques are often NP-

complete

GTC 2016, San Jose, CA 3

Page 4: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Memory Consistency Models

• Contract between SW and HW regarding the

semantics of memory operations

• Classic example: Sequential Consistency (SC)

– All processors observe the same ordering of

operations serviced by memory

– Too strict for modern optimizations/architectures

• Nomenclature

– ST[A] → 1 “Wrote a value of 1 to location A”

– LD[B] ← 2 “Read a value of 2 from location B”

GTC 2016, San Jose, CA 4

Page 5: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

ARM Idiosyncrasies

• Our focus: ARMv8

• Speculative Execution is allowed

• Threads can reorder reads and writes

– Assuming no dependency exists

• Writes are not guaranteed to be

simultaneously visible to other cores

GTC 2016, San Jose, CA 5

Page 6: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Problem Setup

1. Construct an initial graph

– Vertices represent load, store,

and barrier insts

– Edges represent memory

ordering

• Based on architectural rules

2. Iteratively infer additional

edges to the graph

– Based on existing

relationships

3. Check for cycles

– If one exists: contradiction!GTC 2016, San Jose, CA 6

LD[B] ← 92

LD[A] ← 2

ST[B] → 92

LD[B] ← 93

LD[B] ← 92

CPU 0

CPU 1

ST[B] → 90

• Given an inst. trace from a simulator, RTL, or silicon

Page 7: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

TSOtool

• Hangal et al., ISCA ’04

– Designed for SPARC, but portable to ARM

• Each store writes a unique value to memory

– Easily map a load to the store that wrote its data

• Tradeoff between accuracy and runtime

– Polynomial time, but false positives are possible

– If a cycle is found, a bug indeed exists

– If no cycles are found, execution appears consistent

GTC 2016, San Jose, CA 7

Page 8: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Need for Scalability

• Must run many tests to maximize coverage

– Stress different portions of the memory subsystem

• Longer tests put supporting logic in more interesting

states

– Many instructions are required to build history in an LRU

cache, for instance

• Using a CPU cluster does not suffice

– The results of one set of tests dictate the structure of the

ensuing tests

– Faster tests help with interactivity!

• Solution: Efficient algorithms and parallelism

GTC 2016, San Jose, CA 8

Page 9: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Inferred Edge Insertions (Rule 6)

• S can reach X

• X does not load data

from S

GTC 2016, San Jose, CA 9

W: ST[A] → 2

X: LD[A] ← 2

S: ST[A] → 1

Page 10: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Inferred Edge Insertions (Rule 6)

• S can reach X

• X does not load data

from S

• S comes before the

node that stored X’s

data

GTC 2016, San Jose, CA 10

W: ST[A] → 2

X: LD[A] ← 2

S: ST[A] → 1

Page 11: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Inferred Edge Insertions (Rule 7)

• S can reach X

• Loads read data from S, not X

GTC 2016, San Jose, CA 11

L: LD[A] → 1

X: ST[A] → 2

S: ST[A] → 1

M: LD[A] → 1

Page 12: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Inferred Edge Insertions (Rule 7)

• S can reach X

• Loads read data from S, not X

• Loads came before X

GTC 2016, San Jose, CA 12

L: LD[A] → 1

X: ST[A] → 2

S: ST[A] → 1

M: LD[A] → 1

Page 13: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Initial Algorithm for Inferring Edges

for_each(store vertex S)

{

for_each(reachable vertex X from S) //Getting this set is expensive!

{

if(location[S] == location[X])

{

if((type[X] == LD) && (data[S] != data[X]))

{

//Add Rule 6 edge from S to W, the store that X read from

}

else if(type[X] == ST)

{

for_each(load vertex L that reads data from S)

{

//Add Rule 7 edge from L to X

}

} //End if instruction type is store

} //End if location

} //End for each reachable vertex

} //End for each store

GTC 2016, San Jose, CA 13

Page 14: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Virtual Processors (vprocs)

• Split instructions from physical to virtual processors

• Each vproc is sequentially consistent

– Program order ↔ Memory order

GTC 2016, San Jose, CA 14

ST[B] → 91

ST[A] → 1

LD[A] ← 2

ST[B] → 92

VPROC 0

ST[A] → 1

LD[A] ← 2

VPROC 1

VPROC 2

ST[B] → 91

ST[B] → 92

CPU 0

Page 15: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Reverse Time Vector Clocks (RTVC)

• Consider the RTVC of ST[B] = 90

Purple: ST[B] = 92

Blue: NULL

Green: LD[B] = 92

Orange: LD[B] = 92

• Track the earliest

successor from each

vertex to each vproc

– Captures transitivity

GTC 2016, San Jose, CA 15

LD[B] ← 92

LD[A] ← 2

ST[B] → 92

LD[B] ← 93

LD[B] ← 92

CPU 0

CPU 1

ST[B] → 90

Complexity of inferring edges: 𝑂 𝑛2𝑝2𝑑𝑚𝑎𝑥

Page 16: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Updating RTVCs

• Computing RTVCs once is fast

– Process vertices in the reverse

order of a topological sort

– Check neighbors directly, then

their RTVCs

• Every time a new edge is

inserted, RTVC values need to

change

– # of edge insertions ≈ 𝑚

GTC 2016, San Jose, CA 16

• TSOtool implements both vprocs and RTVCs

Page 17: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Facilitating Parallelism

• Repeatedly updating RTVCs is expensive

– For 𝑘 edge insertions, RTVC updates take 𝑂(𝑘𝑝𝑛) time

• 𝑘 = 𝑂 𝑛2 , but usually is a small multiple of 𝑛

• Idea: Update RTVCs once per iteration rather than

per edge insertion

– For 𝑖 iterations RTVC updates take 𝑂(𝑖𝑝𝑛) time

• 𝑖 ≪ 𝑘 (less than 10 for all test cases)

– Less communication between threads

• Complexity of inferring edges: 𝑂(𝑛2𝑝)

GTC 2016, San Jose, CA 17

Page 18: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Correctness

• Inferred edges found by our approach will not be the

same as the edges found by TSOtool

– Might not infer an edge that TSOtool does

• RTVC for TSOtool can change mid-iteration

– Might infer an edge that TSOtool does not

• Our approach will have “stale” RTVC values

• Both approaches make forward progress

– Number of edges monotonically increases

• Any edge inserted by our approach could have been

inserted by the naïve approach [Thm 1]

• If TSOtool finds a cycle, we will also find a cycle [Thm 2]

GTC 2016, San Jose, CA 18

Page 19: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Parallel Implementations

• OpenMP

– Each thread keeps its own partition of added

edges

– After each iteration of inferring edges, reduce

• CUDA

– Assign threads to each store instruction

– Threads independently traverse the vprocs of this

store

– Atomically add edges to a preallocated array in

global memory

GTC 2016, San Jose, CA 19

Page 20: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Experimental Setup

• Intel Core i7-2600K CPU

– Quad core, 3.4GHz, 8MB LLC, 16GB DRAM

• NVIDIA GeForce GTX Titan

– 14 SMs, 837 MHz base clock, 6GB DRAM

• ARM system under test

– Cortex-A57, quad core

• Instruction graphs range from 𝑛 = 218 to 𝑛 = 222

vertices, 𝑛 ≈ 𝑚

– Sparse, high-diameter, low-degree

– Tests vary by their distribution of LD/ST/DMB

instructions, # of vprocs, and inst dependencies

GTC 2016, San Jose, CA 20

Page 21: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Importance of Scaling

GTC 2016, San Jose, CA 21

• 512K instructions

per core

• 2M total

instructions

Page 22: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Speedup over TSOtool (Application)

Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU

64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x

128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x

256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x

512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x

1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x

GTC 2016, San Jose, CA 22

• GPU is always best; scales much better to larger tests

• Extreme case: 9 hours using TSOtool → under 10

minutes using our GPU approach

• Avg. Parallel speedups over our improved sequential

approach:

– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)

Page 23: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Summary

• Relaxing the updates to RTVCs lead to a better

sequential approach and facilitated parallel

implementations

– Trade off between redundant work and parallelism

• Faster execution leads to interactive bug-finding

• The GPU scales well to larger problem instances

– Helpful for corner case bugs that slip through pre-silicon

verification

• For the twelve largest test cases our GPU

implementation achieves a 26.36x average

application speedup

GTC 2016, San Jose, CA 23

Page 24: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Acknowledgments

• Shankar Govindaraju, and Tom Hart for their

help on understanding NVIDIA’s

implementation of TSOtool for ARM

GTC 2016, San Jose, CA 24

Page 25: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Questions

“To raise new questions, new possibilities, to

regard old problems from a new angle, requires

creative imagination and marks real advance in

science.”– Albert Einstein

25GTC 2016, San Jose, CA

Page 26: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Backup

26GTC 2016, San Jose, CA

Page 27: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Sequential Consistency Examples

• Valid

• Invalid

GTC 2016, San Jose, CA 27

P1: ST[x]→1 P2: LD[x]←1 LD[x]←2P3: LD[x]←1 LD[x]←2P4: ST[x]→2

t=0 t=1 t=2

P1: ST[x]→1 P2: LD[x]←1 LD[x]←2P3: LD[x]←2 LD[x]←1P4: ST[x]→2

t=0 t=1 t=2

• ST[x]→1 handled before

ST[x]→2

• Writes propagate to P2

and P3 in a different

order

– Valid for weaker memory

models

Page 28: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Weaker Models

• SC is intuitive, but is too strict

– Prevents common compiler/arch. optimizations

• Commercial products use weaker models

– x86: Total Store Order (TSO)

– Power/ARM: Relaxed Memory Ordering (RMO)

• Weaker models allow for greater optimization

opportunities

– Cost: More complicated semantics

GTC 2016, San Jose, CA 28

Page 29: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Initial Algorithm: Weaknesses

• Expensive to compute

– 𝑂(𝑛3), assuming edges can be inserted in 𝑂(1)time

– Repeated iteratively until a fixed point is reached

• Requires the transitive closure of the graph

– Expensive to store

– Capturing 𝑛2 relationships (does vertex 𝑖 reach

vertex 𝑗?)

• Adds lots of redundant edges

– Should leverage transitivity when possible

GTC 2016, San Jose, CA 29

A

B

C

Page 30: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Reverse Time Vector Clocks (RTVCs)

• vprocs provide implicit orderings

GTC 2016, San Jose, CA 30

ST[B] → 91

ST[B] → 92

ST[A] → 1

Page 31: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Reverse Time Vector Clocks (RTVCs)

• vprocs provide implicit orderings

• Reverse Vector Time Clock

– Track the earliest successor from each vertex to each

vproc

• Bounds the number of reachable edges to be

inspected by 𝑝, the number of vprocs

– No need to compute or store the transitive closure!

GTC 2016, San Jose, CA 31

ST[B] → 91

ST[B] → 92

ST[A] → 1

Page 32: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Reverse Time Vector Clocks (RTVCs)

• Track the earliest successor from each vertex to

each vproc

– Captures transitivity

• Traverse vprocs rather than the graph itself

– No need to check every reachable vertex

• Bounds the number of reachable edges to be

inspected by 𝑝, the number of vprocs

– No need to compute or store the transitive closure!

GTC 2016, San Jose, CA 32

Page 33: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Superfluous work?

• Our approach tends

to add more edges

than TSOtool, some of

which are redundant

– Worst case: 36%

additional edges

– The redundancy is well

worth the

performance benefits

GTC 2016, San Jose, CA 33

Page 34: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Test Info

𝒏 = 𝑽 𝒎 = 𝑬 TSOtool

Inferred

Iterations ST/LD/BAR

(%)

2,097,963 3,799,254 4,487,224 5 76/24/0

2,098,219 3,686,624 4,411,887 4 79/21/0

1,977,832 4,453,340 5,179,108 5 46/53/1

2,097,741 3,875,831 4,635,852 7 77/23/0

1,936,321 5,109,990 5,236,671 5 44/54/2

2,098,321 2,491,062 4,257,077 6 80/20/0

2,097,809 4,321,793 4,404,753 7 78/21/1

1,871,831 3,660,617 4,861,044 6 44/54/2

2,097,809 4,434,120 4,418,555 5 80/20/0

4,195,405 6,934,725 9,338,902 7 76/23/1

4,194,961 7,960,567 8,963,281 6 78/22/0

GTC 2016, San Jose, CA 34

Page 35: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Speedup over TSOtool (Inferring edges)

Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU

64K*4 = 256K 27 15.09x 29.31x 53.45x 57.90x

128K*4 = 512K 27 16.41x 31.49x 57.34x 76.98x

256K*4 = 1M 23 14.51x 27.98x 51.68x 72.32x

512K*4 = 2M 10 4.01x 7.52x 14.19x 42.90x

1M*4 = 4M 2 3.08x 5.70x 10.39x 45.16x

GTC 2016, San Jose, CA 35

• Number of tests decreases with test size because

of industrial time constraints

– Motivation for this work

• Avg. Parallel speedups over our improved

sequential approach:

– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)

Page 36: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Problem Setup

1. Construct an initial graph

– Vertices represent load, store,

and barrier insts

– Edges represent memory

ordering

• Based on architectural rules

2. Iteratively infer additional

edges to the graph

– Based on existing

relationships

3. Check for cycles

– If one exists: contradiction!GTC 2016, San Jose, CA 36

LD[B] ← 92

LD[A] ← 2

ST[B] → 92

LD[B] ← 93

LD[B] ← 92

CPU 0

CPU 1

ST[B] → 90

• Given an inst. trace from a simulator, RTL, or silicon

Page 37: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Importance of Scaling

GTC 2016, San Jose, CA 37

• 128K instructions

per core

• 512K total

instructions

Page 38: Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader. Challenges of Design Verification •Contemporary

Importance of Scaling

GTC 2016, San Jose, CA 38

• 256K instructions

per core

• 1M total

instructions