1. Introduction 1.Faults and their manifestation (4) 2.Analysis of faults (12) 3.Classification of tests (5) 4.Fault coverage requirements (3) 5.Test economics.

1. Introduction

1. Faults and their manifestation (4)

2. Analysis of faults (12)

3. Classification of tests (5)

4. Fault coverage requirements (3)

5. Test economics (4)

1.1 Faults and their manifestation

Definition of the terms: Failure, Error and Fault

Failure: A system failure is present when the service of the system

differs from the expected service

A failure is caused by an error

Error: There is an error in the system when its state differs from the

state required to deliver the expected service

An error is caused by a fault

Fault: A fault is present when there is a physical difference between

the correct system and the current system

1.2 Faults and their manifestation: Example

Example: A car cannot be used due to a flat tire

Failure: The car cannot be driven due a flat tire

I.e., the service differs from the expected service

The failure is caused by an error

Error: The air pressure has an erroneous state

An error is caused by a fault

Fault: A puncture, causing an erroneous air-pressure-state

I..e, the puncture is the difference between the correct system and the current system

Note: A fault may not immediately result in a failure; e.g., as will be the case with a slowly leaking tire

1.3 Fault manifestation

According to the way faults manifest themselves in time, they

can be divided into permanent and non-permanent faults

Permanent fault: Affects the system’s functional behavior permanently

Permanent faults are also referred to as solid or hard faults

Examples: Broken wires, functional design errors, etc.

Non-permanent fault: Affects the system’s functional behavior only part

of the time

1.4 Non-permanent faultsNon-permanent faults are only present part of the time

• They occur at random moments and affect the system behavior for finite

periods of time

• Therefore, their detection and localization is difficult

These faults consist of the groups

• Transient faults

– Caused by environmental conditions

– They are also referred to as soft errors

Examples: cosmic rays, -particles, temperature, pressure, vibration

• Intermittent faults

– Caused by non-environmental conditions

Examples: Loose connections, deteriorating or aging components

2.1 Analysis of faults

The following topics explain this subject

• Analyze the frequency of occurrence of faults

• Analyze system failure rate over its life time

• Show failure rates of series and parallel systems

• Explain physical and electrical causes of faults

There are referred to as failure mechanisms

2.2 Frequency of occurrence of faults (1)Can be explained using reliability theory

The point in time t at which a fault occurs can be considered a random variable u

The probability of a failure before time t , F(t), is the unreliabilty of the system

The reliability of a system, R(t), is the probability of a correct functioning system at time t. , or alternatively:

It is assumed that:

F(0) = 0: Initially the system will be operable

F() = 1: Ultimately the system will fail

: System is either operable or failing

)()( tuPtF

)(1)( tFtR

1)()( tRtF

0 at time components of #

at time surviving components failing of #)(

ttR

2.3 Frequency of occurrence of faults (2)The derivative of F(t), f(t), is called the failure probability density

function

Hence:

and

The failure rate , z(t), is defined as the conditional probability that the system fails during the period (t, t+t); given that the system was operational at time t

Alternatively, z(t) can be expressed as follows:

t

dttftF0

)()(

t

dttftR )()(

)(

)(

)(

1*

)(

)(

1*

)()()( lim

0 tR

tf

tRdt

tdF

tRt

tFttFtz

t

t

ttz

at time components surviving of #

at time unit timeper components failing of #)(

dt

tdR

dt

tdFtF

)()()(

2.4 Frequency of occurrence of faults (3)R(t) can be expressed in terms of z(t) as follows

or,

The average lifetime of a system, , can be expressed as the mathematical expectation of t to be

For a non-maintained system, , is called the Mean Time To Failure, MTTF. Using partial integration, and assuming

tdttz

eRtR 0)(

)0()(

)0(

)(ln

)(

)()()(

0 0

)(

)0( R

tR

tR

tdRdt

dt

tfdttz

t t tR

R

0

)(*)( dttftt

0

)(0 )(0

)(*lim dttRT

TMTTF dttR

TtRt

T

0)(*lim

TRTT

2.5 Frequency of occurrence of faults (4)Given a system with the following reliability

The failure rate, z(t), of that system is computed below, and has a constant value

Assuming failures occur randomly with a constant rate , the MTTF can be expressed as

tetR )(

tttt

eeedt

edtR

dt

tdF

tR

tftz //

)1()(/

)(

)(

)()(

1

0

dteMTTF t

Example: R(t) & F(t) of Dutch male population(over years: 1976– 1980)

Note: # of people > 100 yrs old too small

2.6 Frequency of occurrence of faults (5)

R(t) & F(t) of Dutch male population

z(t)z(t)

Note: Increase of z(t) & f(t) between ages 18—20 due to driving accidents

f(t)

Note: Infant mortality rate

2.7 Failure rate over product lifetime (1)A well-know graphical representation of the failure rate, z(t), is the

bathtub curve. It consists of three regions:1. Infant mortality

Failures in this region are termed infant mortalities. They are attributed to poor quality due to variations in the production process

2. Working life; Constant failure rate: z(t) = Failures are considered to occur randomly in time

3. Wear out; Increasing failure rateThis represents the end-of-life period of a system

It should be clear that a system should be shipped after it has passed the infant mortality period, in order to reduce the # of field returns.

z(t) Dutch males

2.8 Failure rate over product lifetime (2)Shipping a system after the infant mortality period can be done by:1. Aging the system for that period (this can be several months)2. Aging the system under stress

– This accelerates the aging process

An important stress condition is increased temperature: Burn-InThe accelerating effect of temperature follows Arrhenius’ equation

• T1 and T2 are absolute temperatures (in degrees Kelvin, K) T1 and T2 are the failure rates at T1 and T2, respectively• Ea is the activation energy; constant expressed in electron-volts, eV• k is Boltzmann’s constant k = 8.617*10-5 eV/K

The equation shows that the failure rate is exponentially dependent on the temperature

)/)/1/1(( 21

12* kTTE

TTae

2.9 Failure rate over product lifetime (3)Example of use of Arrhenius equation

Assume Burn-In takes place at 150 oC = 423 oK; i.e., T2 = 423

Note: Room temperature is 30 oC = 303 oK; i.e., T1 = 303

Given that the Ea for the targeted failure rate is: Ea = 0.6 eV

Then the acceleration factor is: 678

This means that the 150 oC temperature stress reduces the aging time by a factor of 678.

678/5

12

10*617.8/4230/1303/1(6.0 eTT

Failure mechanism Ea: Activation energy Corrosion of metallization 0.3 – 0.6 eV Electrolytic corrosion 0.8 – 1.0 eV Electromigration 0.4 – 0.8 eV Bonding (purple plague) 1.0 – 2.2 eV Ionic contamination 0.5 – 1.0 eV Alloying (contact migration) 1.7 – 1.8 eV

Note: Every failure mechanism has its typical Ea value

2.10 Failure rates of series and parallel systems

A series system is a system of which all components have to be operational in order for the system to be operational

Consider that the system consists of n components with reliability Ri(t), then the reliability of the system, R(t), is:

It can be shown that

A parallel system is a system which is operational as long as one of its n components is operational. The unreliability is:

The reliability is:

n

i is tRtR1

)()(

n

i is tztz1

)()(

n

i ip tFtF1

)()(

n

i ip tFtR1

)(1)(

2.11 Failure mechanisms

Failure mechanisms describe the physical and electrical causes for faults. They can be divided into 3 classes:

1. Electrical stress

Poor design leading to electrical overstress, or careless handling causing static damage

2. Intrinsic failure mechanisms

Inherent to the semiconductor material itself.

Examples: Crystal defects, dislocations and processing defects

3. Extrinsic failure mechanisms

Originate in the packaging and interconnection process

Examples: Poor bonding, corrosion, etc.

2.12 Failure mechanisms

Failure mechanism class

Electrical stress

Intrinsic failure mechanisms

Extrinsic failure mechanisms

Electrical overstressElectrostatic dischargeGate oxide breakdownIonic contaminationSurface charge spreadingCharge effects

•Slow rapping•Hot electrons•Secondary slow trapping

PipingDislocations

PackagingMetallization

•Corrosion•Electromigration•Contact migration•Microcracks

Bonding (purple plague)Die attachments failureParticle contaminationRadiation

•External•Intrinsic

3.1 Classification of tests

A test is a procedure which allows one to distinguish

between good and bad parts

Tests can be classified according to:

1. The technology they are designed for

2. The parameters they measure

3. The purpose for which the test results are used

4. The test application method

3.2 Technology aspectsThe type of test depends heavily on the technology of the

circuit to be tested:1. Analog tests

The domain of input and output signal values is analog; i.e., they can take on any value within a given range (Ex.: a range of 0 – 5 V)Analog tests aim at determining the values of analog parameters such as voltage and current levels, frequency response, bandwidth, etc. The generation of the input stimuli and the measurement of the responses is inherently imprecise. Therefore, a range of values is used to determine the operational correctness

2. Digital testsThe input and output signals are digital (0 or 1); hence, precise. The test are called logical or digital tests.

3. Mixed signal testsThe domain of either the input or the output values is analog, while the other is digital. Typically used for testing digital-to-analog and analog-to-digital converters

3.3 Measured parameter aspectsThe nature of the measured parameter can be:1. Logical: Logical tests aim at detecting faults causing a change in the

logical behavior of the system ( a 0 is expected, while a 1 is measured)2. Electrical: Electrical tests measure the values of electrical parameters

(voltage and current levels) as well as their behavior over time; they can be divided into Parametric and Dynamic tests

Parametric testsAre concerned with the external behavior of the circuitEx.: Voltage & current levels & delays on the input & output pins– DC parametric tests are concerned with the with time-independent

properties of the input and output valuesIDDQ tests are a special class of DC parametric tests; they are concerned with the leakage currents during the quiescent state of the circuit

– AC parametric tests are concerned with the with time-dependent properties of the input and output values

Dynamic tests aim at faults which are time-dependent and internal to the chip

3.4 Purpose of test resultsThe most obvious use of the test results is to distinguish between good

and bad parts. This can be done with a test which detects faults. In case of repair, a test capable of locating faults is required.

Testing can be done during normal use of the system; referred to as concurrent testing; for example, parity checking is a simple for of concurrent testing. Alternatively, non-concurrent tests cannot be performed during normal use of the system, because they do not preserve the application data. However, they usually have a higher fault detection capability.

Design-for-Testability (DFT) includes extra circuitry on the to-be-tested chip; it allows non-concurrent tests to be performed faster and/or with a higher fault coverage.

Built-in-Self Test (BIST) includes extra circuitry on the to-be-tested chip, to the extent that the complete test function can be performed on chip, without external tester support.

3.5 Test application methodsTests can also be classified according to the way the test stimuli

are applied and the test responses are evaluated• External test: Automatic Test Equipment ‘ATE’ is used to apply the test

stimuli and evaluate the test responsesAt the board level the stimuli can be applied :

– Via the regular board connectorsAllows for a simple interface with the ATE and for at-speed

testing. However, the nt all circuits are easy to reach. Manual test program design is required, called functional tests

– Via special fixture (set of connectors)That way each components pins becomes accessible. Structural tests, which can be generated automatically, can now be used.

• Internal test (BIST)The ATE function is completely integrated on the to-be-tested chip. This requires extra silicon area, however, no ATE is required and the chip can be tested at speed.

4.1 Fault coverage requirements (1)Given a chip with potential defects, the question can be raised on how

extensive the tests have to be?This question can be answered in terms of the chips defect level and the

yield of the fabrication process.• Defect Level ‘DL’ is the fraction of bad parts that passes all tests

– Values for DL are usually expressed in Parts Per Million ‘PPM’• Process Yield ‘Y’ is the fraction of the manufactured parts that is

fault free. Exact value hard to establish. Therefore, Y approximated as follows:

• Fault Coverage ‘FC’ is a measure of the quality of a test. It is defined as:

In practice it is impossible to have a complete test (FC=1), because of:1. Imperfect fault modeling: An actual fault may not correspond with a

modeled fault2. Data dependency of faults (e.g., the carry function in an ALU)3. Testability limitations (e.g., ATE pin and/or speed limitations)

parts of # total

parts defective-not of#Y

faults of # total

faults detected of# actualFC

4.2 Fault coverage requirements (2)Because tests may not be complete, a defective chip may pass the tests.

Assume that a chip has exactly n Stuck-At Faults ‘SAFs’ – A SA0 fault causes a 0 value on a line; a SA1 fault causes a 1 value

Let m be the number of detected faults (m n)

Assume that the probability of a fault is independent of the occurrence of another fault (i.e., there is no fault clustering) and that all faults are equally likely with probability p

Assume that: A is the event that a part is free of defects, and B that a part has been tested for m defects while none were found. Then:

• The Fault Coverage of a test is defined as:

• The Process Yield is defined as:

•

• DL can now be expressed as:

nmFC /)()1( APpY n

mpBP )1()( npAPBAP )1()()(

mn ppBPBAPBAP )1/()1()(/)()( )1()/1()1( FCnmn Yp

)1(1)(1 FCYBAPDL

DL is expressed as (see figure):For large values of Y (i.e., a manufacturing process with a high yield), it

approaches a straight line

Example: Assume a manufacturing process with Y = 0.5 and a TC = 0.8, then:

This means that 12.95% of the shipped parts are defective!

If a DL=200 PPM (i.e., DL = 0.0002)is required, given Y = 0.5, then:

This is a FC of 99.971%

4.3 Fault coverage requirements (3))1(1)(1 FCYBAPDL

1295.05.01 )8.01( DL

99971.0)log/)1(log(1 YDLFC

5.1 Test economicsRepair cost during the product phases

A move from one product phase to the next causes the volume of parts and the test & repair cost to increase by a factor of 10This is the rule-of-ten

Economics and liability of testing. Good tests• reduce test & repair cost (see above rule-of-ten)• can reduce development time & time-to market• can reduce field maintenance costs• reduce personal injury and law suits

There is an optimum in test development cost and its contribution to profit: Too many tests require a long test development time and test cost

Optimum

5.2 Total profitThe life time of a product has several economic phases• The development phase

– Product design takes place– No income; only expenses– Area under zero-line is development cost

• The market growth phase– Market acceptance increases with time

• The market decline phase– Product becomes less attractive– Market share decreases– Price may have to be reduced

The total profit over the life time of a product is the area above the zero-line (revenue) – area below the zero-line (development cost)

In case of a delay ‘D’ in product development, the development cost is higher, while the revenue is reduced, because the obsolescence point will not change

5.3 Product development delay cost

)*(*)2(*2

1M

W

DWDWRDP

RDPERLR

2

22

2

)3(***

2

*32*

W

DWDERM

W

DWDWMWLR

MWMWER **2*2

1

Assuming M is the maximum market growth, which is reached after time W, the revenue lost due to a delay D (hatched area) can be computed as follows:

• The Expected Revenue ‘ER’ is:

• The Revenue of the Delayed Product ‘RDP’ is:

• The Lost Revenue ‘LR’ is:

5.4 Life-cycle costThe cost of a product over its life time, consists of:1. The design cost

This typically is on the order of 5% of the product cost

2. The manufacturing cost

This is the cost associated with the production and sales of the product

3. The maintenance cost

The cost associated with repair, calibration, etc.

This may be the largest cost factor

Note: Product life is 30 years; e.g., for a telephone exchange

1. Introduction 1.Faults and their manifestation (4) 2.Analysis of faults (12) 3.Classification of tests (5) 4.Fault coverage requirements (3) 5.Test economics.

Documents

system behavior

correct system

failure mechanisms2

correct functioning

expected servicethe

expected servicea failure

difficultthese faults

way faults