Top Banner
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev
17

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Dec 14, 2015

Download

Documents

Brett Staker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Fred A. Bower, Daniel J. Sorin, and Sule Ozev

Page 2: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

overview

Motivation Current Techniques

Proposed Mechanism for Online Fault DiagnosisResults

ChallengesConclusion

Page 3: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Hard Faults

Electron Migration Gate Oxide Breakdown

background

Transient Faults

Single Event Upset

Page 4: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

motivation

Process Scaling

Page 5: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

current fault handling techniques

DIVA

Redundancy

Page 6: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

DIVA

UTILIZEREDUNDANCY

error detection and correction

hybrid approach

Page 7: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

online diagnosis

Track Units

DIVA ERROR

deconfigureunit

error_count++

If(error_count > threshold)

YES

NONo Action

Page 8: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

ALU DIVA CHECKER

Reorder Buffer

Reservation Station

Units that can be turned off in case of a fault

Field Deconfigurable Units (FDU)

Page 9: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Deconfigure entries in circular buffer Deconfigure entries in tabular structure

deconfiguring mechanism

Page 10: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Hard fault diagnosis latency Performance impact of losing component to hard fault

analysis

• DIVA: 6% of an Alpha 21264 core

• Error counters (~1227 bits total)

• Instruction resource usage (19 wires in total)

• Deconfiguration logic

• Can be reduced using coarse granularity

Page 11: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

challenges

Error count threshold• Related to resource usage• Heavily used resources have higher

counters• Pipeline flushes before threshold is

reached

Page 12: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

challenges

Error count threshold• Related to resource usage• Heavily used resources have higher

counters• Pipeline flushes before threshold is

reached

Page 13: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Transient faults

Independent resource usage

ERRORHARD FAULT

TRANSIENT FAULT

A B C

D E F

Desired

Observed

DIVA CHECKER

challenges

Page 14: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

• Certain structures cannot be protected• Register File• Issue logic• Common Data Bus (CDB)

• Transient fault False Deconfiguration• Possibly masked by error counter

• Faults in the error counter or deconfiguration logic• Periodically test counters• Permanently configure or deconfigure FDU

upon error

• Window of vulnerability• DIVA produces errors until counter

saturates

limitations

Page 15: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

• As transistors shrink, hard fault rate increases

• Current reliability mechanisms• Redundancy (TMR)• Thread level redundancy• Pre shipment testing and deconfiguration• Low cost solutions such as DIVA

• Online diagnosis• Low cost and hardware overhead• Use FDUs along with DIVA to diagnose faults dynamically• Increase yield Binned to a lower performance bin

conclusion

Page 16: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

discussion

What are the advantages of this hybrid scheme over using just a DIVA checker?

As process technology gets smaller, can this mechanism help increase the lifetime of the processor a significant amount?

As transistors shrink, the number of cores will increase, can this mechanism be used still as opposed to turning off a faulty core?

How can we extend this mechanism to take care of the issue logic, singleton resources and CDB?

Page 17: A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

citations

images• Electron Migration. Digital image. Wikimedia.org. Wikimedia, 6 Mar. 2007. Web.

<http://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Leiterbahn_ausfallort_elektromigration.jpg/220px-Leiterbahn_ausfallort_elektromigration.jpg>.

• Gate Oxide Breakdown. Digital image. Attopsemi Technology. Attopsemi Technology, n.d. Web. <http://www.attopsemi.com/tec3.htm>.

• Sawant, Minal. Single Event Upset. Digital image. COTS. Microsemi, Jan. 2012. Web. <http://www.cotsjournalonline.com/articles/view/102279>.

• Sawant, Minal. Soft Error Rate. Digital image. CCCP. University of Michigan, 11 May 2012. Web. <http://cccp.eecs.umich.edu/research/reliability.php>.

• Carr, Robert. Simultaneous Multithreading. Digital image. Prezi. Prezi, 31 Oct. 2013. Web. <http://prezi.com/tegbbfk34l57/question-2/>.

• Wong, William. Out of Order Pipeline. Digital image. Electronic Design. Electronic Design, 19 Oct. 2011. Web. <http://electronicdesign.com/microcontrollers/little-core-shares-big-core-architecture>.

• Mark Brehob, EECS 470 Lecture Slides

• Fred A. Bower, Daniel J. Sorin, and Sule Ozev. A Mechanism for Online Diagnosis of Hard Faults Microprocessors. In Proc. Of the 38th Annual IEEE/ACM International Symposium on Microarchiteceture (MICRO’05), 2005

• T.M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. Of the 32nd Annual IEEE/ACM Int’l Symposium on Microarchitecture, pages 196-207, Nov. 1999.

papers