DACOTA: Post-silicon validation of the memory subsystem in multi-core designs Andrew DeOrio Ilya Wagner Valeria Bertacco Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009
DACOTA: Post-silicon validation of the
memory subsystem in multi-core designs
Andrew DeOrio Ilya Wagner
Valeria Bertacco
Advanced Computer
Architecture Laboratory
University of Michigan, Ann Arbor
HPCA 2009
2/22
Multi-core Designs
• Many simple processors
• Communicate through interconnect network
Intel Polaris Tilera TILE64
3/22
Complex Multi-core: Memory Subsystem
• Cache coherence: the ordering of operations to a single
cache line
• Memory consistency: controls the ordering of
operations among different memory addresses
L2 Cache
Interconnect
...L1 Cache
Core 0
L1 Cache
Core N-1
The memory subsystem is hard to verify
4/22
The Verification Landscape
Pre-Silicon Post-Silicon Runtime
• Fast: at-speed
• Early HW prototypes
• Hard-to-find bugs
• Relatively new technology
– Ad-hoc
• Slow: ~Hz
• Stimuli generators
• Random testers
• Formal verification
• Fast: at-speed
• Research ideas
– Austin, Malik, Sorin
• Microcode patching
– Intel, AMD
bugs exposed: 98% bugs exposed: 2% bugs exposed: <1%
effort: 70% effort: 30% effort: 0%
Logic
Sim.
Stimuli
module dut;
assign x = ~y | z;
always @ * begin
…
end RTL
5/22
Post-Silicon Validation Today
=
Silicon prototype
Sim
ulat
ion
serv
ers
Test
g
ener
ato
r
Prototype’s final state
Simulation final state
Simulation is the bottleneck of the validation process
Critical path
6/22
• 10% of the bugs that made it to product are related to
the memory subsystem
Escaped Bugs in the Memory Subsystem
bug AW38 No Fix Instruction fetch may cause a livelock during
snoops of the L1 data cache
Excerpt from Specification Update
[Nov. 2007]
Memory related bugs are hard to find
7/22
Post-Silicon Design Goals
• High coverage
– Enable self-detection of memory ordering errors
– Coherence and consistency errors
• Low area impact
• No performance impact after shipment
=
Test
g
ener
ato
r
Prototype’s final state
Simulation final state
Critical path
self
check
8/22
L2 Cache
Interconnect
...L1 Cache
Core 0
L1 Cache
Core N-1
• Logging
– stores ordering info
– uses cache storage
temporarily
• Checking
– starts when storage fills
– distributed algorithm on
individual cores
DACOTA: Data Coloring for Consistency Testing and Analysis
Post-silicon validation for the memory subsystem
benchmark execution check time
9/22
• DACOTA controller augments cache controller logic
• Reconfigures a portion of cache for activity log
Low Overhead Logging Architecture
dacota
components
L 2 Cache
Interconnect
data access vector
c o
n t r o
l D
a c o t a
Core 0
c o
n t r o
l D
a c o t a
Core N - 1
10/22
0 0 0 0 …
• Attach access vector to each cache line
– Tracks the order of memory accesses to one line
– Entry for each core stores a sequence ID
data access vector
• Allocate space for activity log
– Stores a sequence of access vectors
in program order
one entry for
each core counter
1 2 1 2
1. core 0 store
2. core 1 store
3. core 0 store
3 3
Low Overhead Logging Architecture
L2 Cache
Interconnect
data access vector
co
ntro
lD
aco
ta
Core 0
co
ntro
lD
aco
ta
Core N-1
11/22
L2 Cache
Interconnect
...L1 Cache
Core 0
L1 Cache
Core N-1
Checking Algorithm – On Site
• Compares activity logs from L1 caches
• Distributed algorithm runs on cores
1. Aggregate logs
2. Construct graph (protocol specific)
• many protocol supported: SC, TSO, processor C., weak C.
3. Search graph for cycles, indicating ordering violation
ST A1
ST B1
ST A2
LD B1
ST C1
ST D1
ST C2
ST E1
ST B2
ST B1 ST A2
ST C2
ST E1
…
ST A1
12/22
c o
n t r o
l D
a c o t a
Core 0
c o
n t r o
l D
a c o t a
Core 1
Example – Sequential Consistency
Issue Order
[C1] store to address 0xC
[C0] load from address 0xC
[C1] load from address 0xB
[C0] store to address 0xA
[C0] store to address 0xB
[C1] load from address 0xA
<data> 0xB
1 1 0 0xC
tag data log
<data> 0xA 1 1 0 0xB
<data> 0xB
1 1 0 0xB
<data> 0xA 0 0 0 0xA <data> 0xA
<data> 0xC <data> 0xC
1 1 0 0xA 1 1 0 0xC
Actual Order
[C1] store to address 0xC
[C0] load from address 0xC
[C1] load from address 0xA
[C0] store to address 0xA
[C0] store to address 0xB
[C1] load from address 0xB
13/22
Example – Sequential Consistency
1 1 0 0xA 1 1 0 0xB
ST 0xA
ST 0xB LD 0xA
LD 0xB
Activity Logs
1 1 0 0xB 0 0 0 0xA
cycle indicates
violation
1 1 0 0xC 1 1 0 0xC
LD 0xC ST 0xC
address reference edges
program order edges
14/22
L2 Cache
data access vector
co
ntro
lD
aco
ta
Core 0
co
ntro
lD
aco
ta
Core N-1
Experimental Setup
• Implemented checkers in GEMs simulator
• Created buggy versions of cache controllers
• TSO consistency model
directory-based
MOESI cache
coherence
4MB
Mesh network
… 16 cores
15/22
Experimental Setup
• Bugs inspired by bugs found in processor errata
• Injected one at a time
shared-store store to a shared line may not invalidate other caches
invisible-store store message may not reach all cores
store-alloc1 store allocation in any core may not occur properly
store-alloc2 store allocation in one core may not occur properly
reorder1 invalid store reordering (all cores)
reorder2 invalid store reordering (one core)
reorder3 invalid store reordering (single address, all cores)
reorder4 invalid store reordering (single address, one core)
• Testbenches
– Directed random stimulus: memory intensive
– SPLASH2 Benchmarks
Cycles to
Expose
Bug
0.3M
1.3M
1.9M
2.3M
1.4M
2.8M
2.9M
5.6M
16/22
Performance Impact - Random P
erf
orm
ance o
verh
ead (
%)
299
0
20
40
60
80
100
120
Computation overhead
Communication overhead
average
17/22
0
10
20
30
40
50
Performance Impact – SPLASH2 P
erf
orm
ance o
verh
ead (
%)
Computation overhead
Communication overhead
average
Pre-Silicon 100,000,000 %
Traditional Post-Silicon 10,000 %
DACOTA Post-Silicon 60 %
100x more tests!
18/22
Area Impact
Area Overhead - Storage
DACOTA 544 B
Chen, et al., 2008 617,472 B
Meixner, et al., 2006 940,032 B
Pre-Silicon 100,000,000 %
Traditional Post-Silicon 10,000 %
DACOTA Post-Silicon 60 %
Runtime 0 %
• Implemented DACOTA in Verilog
• 0.01% overhead in OpenSPARC T1
19/22
Communication Overhead O
verh
ead d
ue to c
om
munic
ation (
%)
Core activity log entries Core activity log entries
0
50
100
150
200
250
300
350
64 128 256 512 1024 2048
large_1000_shared
barrier
locks
small_0_shared
average
0
5
10
15
20
25
30
35
40
64 128 256 512 1024 2048
radix lu
cholesky fft
average
SPLASH2 benchmarks Random benchmarks
20/22
Checking Algorithm Overhead O
verh
ead d
ue to C
heckin
g A
lg. (%
)
Core activity log entries
0
20
40
60
80
100
120
64 128 256 512 1024 2048
radix
lu
cholesky
fft
average
0
100
200
300
400
500
600
700
64 128 256 512 1024 2048
large_1000_shared
barrier
locks
small_0_shared
average
Core activity log entries
SPLASH2 benchmarks Random benchmarks
ideal trade-off
21/22
Related Work
Pre-Silicon Post-Silicon Runtime
Meixner, et al., 2006;
Chen, et al., 2008
• Effective for
protection against
transient faults
• Problematic for
functional errors
• High area overhead
Dill, et al., 1992;
Abts, et al., 1993;
Pong, et al., 1997;
German, et al., 2003
• Formal verification
possible for
abstract protocol
• Insufficient for
implementation
Josephson, et al., 2006
Paniccia, et al., 1998
Whetsel, et al., 1991
Tsang, et al., 2000
• Post-Si testing
DeOrio, et al., 2008
• Post-Si verification
• Verifies coherence,
but not consistency
22/22
Conclusions
• DACOTA is an on-chip post-silicon debugging solution
for detecting errors in memory ordering
– Enables self-detection of memory ordering errors
• Effective at catching bugs
– 100x more coverage than traditional post-silicon
• Very low area overhead
– 0.01% area overhead on OpenSPARC T1
• No performance impact to end user
– Disable on shipment