Top Banner
- Dongyoon Lee , Mahmoud Said*, Satish Narayanasamy , Zijiang James Yang*, and Cristiano L. Pereira University of Michigan, Ann Arbor Western Michigan University * Intel, Inc Offline Symbolic Analysis for Multi-Processor Execution Replay
32

Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor †

Mar 23, 2016

Download

Documents

landis

Offline Symbolic Analysis for Multi-Processor Execution Replay. Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡. Overview. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 1 -

Dongyoon Lee†, Mahmoud Said*, Satish Narayanasamy†, Zijiang James Yang*, and

Cristiano L. Pereira‡

University of Michigan, Ann Arbor †

Western Michigan University *

Intel, Inc ‡

Offline Symbolic Analysis forMulti-Processor Execution Replay

Page 2: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 2 -

Overview

Goal: Deterministic replay for multi-threaded programs• Debug non-deterministic bugs

ProgramInput

SharedMemory

Dependency

Past Solutions Our Solution

Log I/O, signals, DMA, etc.,

Monitor memory operations Software is slow Hardware is complex

BugNet [ISCA'05]Log loads (cache miss data)

SAT constraint solverDetermine offline before replay

Sources of non-determinism• Program input (interrupt, I/O, DMA, etc.)• Shared-memory dependencies

Page 3: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 3 -

Deterministic Replay Uses

Recorder

Replayer

Memory Leaks

Data Races

Dangling Pointers

Dynamic ProgramAnalysis

Reproducenon-deterministic bugs

Remote Site ORIn-house

Developer Site

Step-Backwardin time

Debugging

Page 4: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 4 -

Traditional Record-N-Replay Systems

Write

ReadRead

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

Page 5: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 5 -

Recording Shared Memory Dependency

Problem Need to monitor every memory operation

Software-based Replay SystemPinSEL (UCSD/Intel) iDNA (Microsoft)

Hardware-based Replay SystemFDR/ReRun (Wisconsin)Strata (UCSD)DeLorean (UIUC)

x100 x10

Complex hardware

Page 6: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 6 -

Hardware Complexity

Hardware-based solution• Detect shared memory dependencies by monitoring cache

coherence messages• Transitive optimization to reduce log size

Complexity• Requires changes to coherence sub-system• Complex to design and verify • 9 design bugs in coherence mechanism of AMD64

[Narayanasamy et al. ICCD’06]

W(a)W(b)

W(b)R(a)

Page 7: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 7 -

New Direction to Hardware-based Solution

Complexity-effective solution• Do NOT record shared-memory dependencies at all

• Infer dependencies offline before replay using Satisfiability Modulo Theory (SMT) solver

Page 8: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 8 -

Our Approach

Write

ReadRead

Log shared memory dependency

Checkpoint Memory and Registers

Log non-deterministic program inputInterrupts, I/O values, DMA, etc.

BugNet [ISCA’05]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline

Checkpoint Registers

Page 9: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 9 -

Roadmap

• Motivation• BugNet for single-threaded programs [ISCA’05]• Recording cache miss data is sufficient

• BugNet is sufficient for multi-threaded programs• Insight: BugNet can replay each thread in isolation

• Offline SMT Analysis• Evaluation• Conclusion

Page 10: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 10 -

BugNet [Narayanasamy et al, ISCA’05]

Insight• Recording initial register state and values of loads is sufficient for

deterministic replay• Implicitly captures the program input from I/O, DMA, interrupts, etc.• Input and output of other instructions are reproduced during replay

Optimization• Record a load only if it is the first access to a memory location

Our modification• Recording data fetched on cache miss captures first loads• Any first access to a location would result in a cache miss• May unnecessarily record data due to store misses, but that is OK

Page 11: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 11 -

Recording Cache Miss Data (First Loads)

ExecutionTime

Log file

First Load

Checkpoint• Register Values• Program Counter

Load A = 0

Load A = 0

(cnt1, 0)

Load B = 5 (cnt2, 5)

Store C = 1

On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically

Cache Miss

Checkpoint

Record cache misses• (Memory count , Data)• Implicitly capture first loads

(cnt3, 0)

Deterministic Replay• Input and output (including address) of all instructions are replayed

Page 12: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 12 -

BugNet Extension

Self-modifying code• Consider instruction read as a load; so instructions are logged

Full system Replay• Continue logging in kernel mode• See the paper for details on context switches, page faults, etc.

Page 13: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 13 -

Roadmap

• Motivation• BugNet for single-threaded programs [ISCA’05]• Recording cache miss data is sufficient

• BugNet is sufficient for multi-threaded programs• Insight: BugNet can replay each thread in isolation

• Offline SMT Analysis• Evaluation• Conclusion

Page 14: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 14 -

BugNet for Multithreaded Programs

Insight• BugNet recorder (initial register state + loads) for each thread is

sufficient for replaying that threadÞ Recording cache miss data is sufficient for multithreaded programsÞ No additional hardware support required for recording dependencies

Reason • Load dependent on a remote write cause a cache miss to ensure

coherenceÞ BugNet implicitly records load values dependent on remote writes

Effect• Can replay each thread in isolation (independent of other threads)

using BugNet logs

Page 15: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 15 -

Replaying Each Thread Independently

Proc 1 Proc 2

Load A=0

Load A=0

Load A=

Store A=1

Invalidation

Cache Coherence• Invalidate cache block to gain exclusive permission

Log cache miss data• Implicitly records loads dependent on remote writes• No change to coherence mechanism

(1st, 0)

(3rd, 1)

Proc 1 LOG

(1st, 0)

Proc 2 LOG

Cache Miss

Cache BlockInvalidated

1Replay each thread• independent of others

Page 16: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 16 -

Shared Memory DependencyThread 1 Thread 2

Load

StoreLoad

LoadStore

Load

Load

StoreLoad

StoreStore

Load

SMT Solver resolves shared memory dependency

Billion instructions• Offline analysis would not scale

Final State : A, B, C

We need to bound search space

?: Old Value x : New Value

A

A

A

B

B

C

A

A

B B

CC

Page 17: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 17 -

Roadmap

• Motivation• BugNet• Offline Symbolic Analysis• Encoding Ordering Constraints• Bounding Search Space

• Evaluation• Conclusion

Page 18: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 18 -

Old Value

Encoding Ordering Constraints

Proc 1 Proc 2

x New Value

x1

x2

x 3

x 4

x 5

xFinal

Program Order Constraint(Assume Sequential Consistency)

Proc1 : X1 < X2 ANDProc2 : X3 < X4 < X5 AND

Load-Store Constraint( M→old== M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND

Page 19: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 19 -

Multiple Memory Locations

Proc 1 Proc 2

x1

x2

x3

x 4

x 5

xFinal

Program Order Constraints(Assume Sequential Consistency)

Proc1 : Y1 < X1 < X2 < Y2 AND Proc2 : X3 < X4 < X5 < Y3 AND

Load-Store Constraints( M→old== M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y1 < Y2 < Y3 AND

:

y

y1

2 y3

yFinalOld Value x New Value

Page 20: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 20 -

Satisfiability-Modulo-Theory (SMT) Solver

SMT Solver

Ordering Constraints

(Program Order) ∧(Load-Store Order for X) ∧(Load-Store Order for Y) ∧ :

Total Order

x1

x2

x 3

x 4

x 5

y

y1

2 y3

SMT solver • Find one valid total order from multiple solutions• All solutions could be produced, if needed

Page 21: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 21 -

Replay Guarantees

• The replayed execution has the same final register and memory states

• Each thread has the exactly same sequence of instructions along with input and output

• Reconstructed shared memory dependencies obey program order and load-store semantics

Page 22: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 22 -

Roadmap

• Motivation• BugNet• Offline Symbolic Analysis• Encoding Ordering Constraints• Bounding Search Space

• Evaluation• Conclusion

Page 23: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 23 -

Bounding Search Space

Proc 1 Proc 2

N cycles

N cycles

Final State

cnt 1 cnt 2

cnt 3 cnt 4

Record “Strata hints”• Each processor periodically records memory operation count• Strata regions have a global order

Strata Region 3

SMT solver analyzes• One region at a time• Start from the last region • Final state of a region = Initial state of the following region

Strata Region 2

Strata Region 1

Final State

Initial State

Final State

Initial State

Final State

Page 24: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 24 -

Strata Hints

Cycle-bound• After N cycles, each core records its memory operation count• No communication is required between cores

Problem • The size of Strata region is not based to number of shared memory

dependencies• Can we bound based on number of shared memory dependencies?

Downgrade-bound• Count coherence downgrade requests• Requires communication between cores, but reduces offline analysis

overhead

Page 25: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 25 -

Filtering Local & Read-only Accesses

Load A

Store B

Load B

Store B

Store A

Filter• Local accesses : no shared-memory dependency

• Read-only accesses : any total order is valid

Load C

Load C

Load C

Load CLoad C

Load C

Effectiveness< 1% of memory accesses remain to be analyzed

Strata Region

Thread 1 Thread 2

Page 26: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 26 -

Roadmap

• Motivation• Record & Replay• Offline Symbolic Analysis• Evaluation• Strata Hint Size• Offline Symbolic Analysis Overhead

• Conclusion

Page 27: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 27 -

Evaluation

• Simics + cycle accurate simulator• Simulate multi-processor execution (2, 4, 8,16 cores) • Fast-forward up to known synchronization points• Trace collected for 500 million instructions

• Benchmarks• SPLASH2 : barnes, fmm, ocean• Parsec 2.0 : blackscholes, bodytrack, x264• SPEComp : wupwise, swim• Apache• MySQL

• Yices SMT constraint solver [Dutertre and Moura CAV’06]

Page 28: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 28 -

Strata Hints Size vs. Offline Analysis Overhead

• Downgrade-bound scheme is effective2.7

2.8

2.9

3

3.1

3.2

3.3

Stra

ta lo

g si

ze (M

B/se

c)

100

1000

10000

100000

1000000

Offl

ine

anal

ysis

tim

e

(sec

s pe

r se

c of

prog

. Exe

c)

Cycle-bound (10,000) Downgrade-bound (25) Downgrade-bound (10)

10% x100

• Offline analysis overhead is one-time cost (not for every replay)

Page 29: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 29 -

Strata hints vs. ReRun log

• Strata hints are 4x less than ReRun log• Significant reduction in hardware complexity

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge1

10

100Downgrade-bound (d10.c10000) Rerun (henkins)

Stra

ta lo

g si

ze (M

B/se

c)

Proposed System ReRun [Hower and Hill, ISCA’08]

x4

Page 30: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 30 -

Recording Performance, etc.

• Cache Miss Data Log• 290 Mbytes / one second of program execution

• Recording Performance• On average, 0.35% slowdown in IPC

• Scalability results can be found in the paper

Page 31: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 31 -

Conclusion

• Deterministic replay for multi-threaded program is critical

• We proposed a complexity-effective solution• Use BugNet : Record cache miss data• No need to record shared memory dependencies• Determine shared memory dependency using SMT constraint solver

offline

• Result• < 1% recording overhead• Efficient log size (4x smaller than state-of-the-art scheme ReRun)• Can analyze one second of 8-threaded program in less than 1000

seconds• One-time offline analysis cost (not for every replay)

Page 32: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,   Zijiang  James Yang*, and Cristiano  L. Pereira ‡ University of Michigan, Ann Arbor  †

- 32 -

Thank you