Top Banner
National Sun Yat-sen University Embedded System Laboratory Quick Detection of Difficult Bugs for Effective Post-Silicon Validation Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University, Stanford, CA, USA 3 Intel Corporation Santa Clara, CA, USA DAC’12, June 3–7, 2012, San Francisco, CA, USA 1
15

Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

Jan 04, 2016

Download

Documents

Rosanna Craig
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

1

National Sun Yat-sen University Embedded System Laboratory

Quick Detection of Difficult Bugs for Effective Post-Silicon

Validation

Presenter : Cheng-Ta Wu

David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University, Stanford, CA, USA 3 Intel Corporation Santa Clara, CA, USADAC’12, June 3–7, 2012, San Francisco, CA, USA

Page 2: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

2

We present a new technique for systematically creating post-silicon validation tests that quickly detect bugs in processor cores and uncore components (cache controllers, memory controllers, on-chip networks) of multi-core System on Chips (SoCs). Such quick detection is essential because long error detection latency, the time elapsed between the occurrence of an error due to a bug and its manifestation as an observable failure, severely limits the effectiveness of existing post-silicon validation approaches. In addition, we provide a list of realistic bug scenarios abstracted from “difficult” bugs that occurred in commercial multi-core SoCs.

Our results for an OpenSPARC T2-like multi-core SoC demonstrate: 1. Error detection latencies of “typical” post-silicon validation tests can be very long, up to billions of clock cycles, especially for bugs in uncore components. 2. Our new technique shortens error detection latencies by several orders of magnitude to only a few hundred cycles for most bug scenarios. 3. Our new technique enables 2-fold increase in bug coverage. An important feature of our technique is its software-only implementation without any hardware modification. Hence, it is readily applicable to existing designs.

Abstract

Page 3: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

3

Typical post-silicon validation tests Very long detection latencies for detecting bugs. Difficult to trace too far back to history for bug localization. Check the expected output values is not in time.

This paper presented a new Proactive Load and Check(PLC) technique to short the latencies of bug detection

What’s the Problem

Page 4: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

4

Related Work

[Hong 10]QED: Quick Error Detection Tests for

Effective Post-Silicon Validation

This Paper PLC transformation

extend

Page 5: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

5

Step 1: Initialization Transforming the existing validation tests into new test with

PLC.。Using EDDI-V transformation.

“Error Detection by Duplicated Instructions for Validation” Perform loads from selected variables. Insert self-consistency checks on those variables.

Proactive Load and Check(PLC) Transformation

Page 6: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

6

。Create PLC_List <original variable pointer, EDDI-V variable pointer > Protect the listed variables to against race conditions.

Proactive Load and Check(PLC) Transformation (cont.)

Page 7: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

7

Step 2: PLC Operation Insertion PLC transformation inserts PLC operations in each thread in

each processor core.

PLC_inst_min。To minimize possible intrusiveness due to PLC operations.。The minimum number of instructions in the same thread that must

execute before a PLC operation is inserted.

Proactive Load and Check(PLC) Transformation (cont.)

Page 8: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

8

Environment : 8 processor cores, 64 threads. private split L1 data and instruction caches. crossbar-based interconnects. 8-way banked L2 cache using directory-based cache coherence protocol. 4 memory controllers.

Experiment

Page 9: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

9

Benchmark SPLASH-2: FFT, LU proprietary industrial post-silicon validation test targeting

memory bugs.

Results

OERT(Original Equivalent RunTime tests)

Page 10: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

10

Several orders of magnitude improvement in error detection latencies.

The error detection latencies of PLC tests are within a few hundred.

2-fold improvement in the coverage of bug scenarios.

Contribution

Page 11: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

11

Post-silicon validation involves three activities: detecting a problem by applying proper stimuli localizing the problem to a small region inside the chip fixing the problem through software patches, circuit editing, or silicon re-

spin.

The effort to localize the problem from an observed failure often dominates the cost of post-silicon validation.

Post-silicon validation

Page 12: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

12

By analyzing “difficult” bugs(from proprietary bug databases) that occurred in lasted commercial multi-core SoCs(OpenSPARC T2-like)

These bug scenarios are considered “difficult” because of very long debug times as indicated in bug reports.

Each bug scenario is decomposed into a bug activation criterion and a bug effect.

Bug Scenarios

Page 13: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

13

The condition that must be satisfied to activate a bug.

Criteria 1-4 correspond to cache controller bugs.

Criteria 5 correspond to bugs inside cache/memory controller and on-chip networks.

Criteria 6-8 correspond to processor core bugs.

Bug activation criterion

Page 14: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

14

Be defined as the incorrect behavior resulting from bug activation.

Effect A-E correspond to cache controller bugs.

Effect F corresponds to memory controller bugs.

Effect G corresponds to interconnection network bugs.

Effect H-J correspond to bugs inside processor cores.

Bug effect

Page 15: Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

15

Create families of bug scenarios by adjusting integer parameters X and Y in Tables 1a and 1b. For example, pairing bug activation criterion 2, for X=10, with bug effect A produces the following bug scenario:

Example of Bug Scenarios