Top Banner
Memory Diagnosibility TOI Mike Buckley Platforms TRE Sun Microsystems
45
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: mem_diagI

Memory Diagnosibility TOI

Mike BuckleyPlatforms TRESun Microsystems

Page 2: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Goals

• Improve Customer Satisfaction by reducing time to resolution through increased technical proficiency

• Reduce the incidence of inaccurate Onsite Action Plans and wrong parts ordered

• Replace memory dimms and other parts only when necessary

• Correctly identify proper dimm size and part number

• Accurately diagnose various memory issues

• Save SUN money $$$

Page 3: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Topics

• Types of memory errors

• Sun's Best Practices regarding memory errors (review)

• Dimm size and part number identification techniques

• Diagnostic tools and utilities (some new ones)

• Troubleshooting tips and techniques

• Examples of error messages and tool usage

• Known memory issues (PLL dimm chip)

• Resources

Page 4: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Because ECC can correct single bit flips, single bit errors are referred to as Correctable Errors.These are detected and corrected and generally do not impact performance.

CE: Correctable (single bit) errorsTypes of CE's:

IntermittentPersistentSticky Bit

Multi-bit errors are referred to as Uncorrectable Errors.These are detected, but not corrected.These will result in machine reset (panic or reboot).

UE: Uncorrectable (multi bit) errors

Page 5: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Correctable Errors (CE):

When a CE is detected, the device that reads the word and detected theerror can correct the data read and continue on unimpeded. However, thisdoes not address the fact that the referenced word could still be residentin memory uncorrected (i.e. a subsequent read of this word could result inanother CE event). If, over time, this word in memory is never correctedthe possibility starts to arise that another bit may flip in the same word.This would lead to a UE event which will result in a loss of system service. To avoid this possibility, the detection of a CE causes a trap to Solaris.The Solaris error handling code logs the error and scrubs the affectedmemory word by writing the corrected word back into memory.

Page 6: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Intermittent:Means the error was not detected on a reread of the affected memoryword. "Intermittent" is not the best choice of words because itimplies that this same error can be expected to manifest itself atirregular intervals. This CE is also known as a transient soft error.No DIMM with this sort of error should be considered for replacementwithout first examining the soft error rate (SER) of this DIMM.

Persistent:Means the error was detected again on a re-read of the affected memoryword but the scrub operation corrected it. This CE is also known as a temporarysoft error. No DIMM with this sort of error should be considered for replacementwithout first examining the SER of this DIMM.

Sticky (aka Sticky Bit):Means that the error still exists in memory even after the scrub operation.This CE is also known as a stuck-at hard error.No DIMM with this sort of error should be considered for replacement withoutfirst examining the SER of this DIMM.

Page 7: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

# cat messages | grep -i memory

May 24 16:07:34 smro97 SUNW,UltraSPARC-III+: [ID 631608 kern.info][AFT0] errID 0x00055be6.99821550 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Persistent

May 24 16:07:34 smro97 SUNW,UltraSPARC-III+: [ID 631608 kern.info][AFT0] errID 0x00055be6.99821550 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Persistent

May 24 16:12:40 smro97 SUNW,UltraSPARC-III+: [ID 910566 kern.info][AFT0] errID 0x00055c2d.d4cf4320 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Sticky

May 24 16:12:40 smro97 SUNW,UltraSPARC-III+: [ID 910566 kern.info][AFT0] errID 0x00055c2d.d4cf4320 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Sticky

May 24 16:12:40 smro97 unix: [ID 752700 kern.warning] WARNING: [AFT0]

Sticky Softerror encountered on Memory Module /N0/SB3/P2/B0/D2 J15500

May 24 16:12:40 smro97 unix: [ID 752700 kern.warning] WARNING: [AFT0]Sticky Softerror encountered on Memory Module /N0/SB3/P2/B0/D2 J15500

Page 8: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Here is an example of a memory errors which does not involve a CPU module. A PCI controller was reading data from memory.

May 17 18:45:01 j2kweb06 unix: WARNING: correctable error from pci0 (upa mid 1f) during dvma read transaction

May 17 18:45:01 j2kweb06 unix: AFSR=40f10000.9f800000 AFAR=00000000.4fbc1e60,

May 17 18:45:01 j2kweb06 double word offset=4, Memory Module <U0402> port id 31.

If this is just a single event then there is nothing to worry about.

Basically, there was simply a correctable ECC event on a read from memory.

The only difference between the "normal" CE events is that this one happened to be detected by the PCI controller (since it was doing the read) instead of a CPU.

A single CE event is nothing to worry about.

That's the reason for having ECC protected memory.

The system is doing its job and functioning normally.

Page 9: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Uncorrectable Errors (UE):

If a UE is detected, the device that read the word and detected theerror cannot correct the data and continue.

A UE will cause Solaris to panic if the UE is in kernel memory,or kill of the particular user process that contained the memory in error and an then issue an orderly shutdown and reboot to protectthe other processes in the domain.

Either way, whether via panic or shutdown and reboot, the customeris considerably impacted (and will likely call for support).

Page 10: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Memory Scrubber:

The Solaris OS runs a memory "scrubber" routine as part of its normal operation.

The time interval is 12 hours for scrubbing stale (unused, idle) memory pages.

This scrubber does not do anything special besides ensure that every memory location is accessed at least once every 12 hours.

If the access finds a CE, then the normal trap to the Solaris OS that occurs for any CE will scrub the affected memory word by writing the corrected word back into memory and log the event.

This ensures that multiple CEs do not have time to build up and form a UE at memory locations that are infrequently accessed.Correctable memory errors reported EXACTLY every twelve hours are a result of the Memory scrubber.(see Infodoc 74049: “How often does the memory scrubber run?”

The normal rules apply: a DIMM should only be replaced if it meets the criteria described in the Sun DIMM Replacement Policy.Infodoc 79928 Sun Enhanced Memory DIMM Replacement Policy.

Page 11: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

BADWRITERS:

1. Sometimes multiple memory DIMMs within a system can start reporting soft errors.

Examining the messages may reveal that the same databit (or error syndrome) is in error on each DIMM.

This indicates that some other component is actually writing the bad data to RAM and consistently creating errors at the same bit address, regardless of the physical DIMM.

Recognizing this pattern, and troubleshooting further can prevent much wasted downtime and cost, and the replacing of perfectly good memory DIMMs.

2. When a DIMM is replaced and the errors persist, or return with the same data bit in error, some other component in the system is likely causing the memory errors.

Again, recognizing this possibility can head off assumptions that replacing memory will solve the problem.

3. In terms of CE/ECC, a system may only reveal errors when the failing address range is utilized by a particular application or combination of applications.

This is almost always a hardware fault. In very rare instances, bad code may generate errors that appear to be hardware.

A good first step when troubleshooting a reproducible CE memory issue is to first isolate or disable the suspect memory component(s) via asr-disable, setenv disabled-memory-list X, setenv disabled-board-list X(all under OBP), psradm -f X or cfgadm (under OS).

If disabling the suspect memory components is not possible, it may be advisable (especially on lower-end machines) to swap the suspect DIMM with another DIMM in the same bank. If the problem follows the DIMM, replace it. If the problems persist in the same location, it is not a bad DIMM issue.

Note: FINDAFT is especially useful when diagnosing Bad Writer scenarios, look for a common CPU (the one implicated more than other CPU's) to be possible Bad Writer.

Page 12: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Sun's Enhanced Sparc/Solaris DIMM Replacement Policy

Note: The rules detailed in this Policy apply to the following architectures:

UltraSPARC II, UltraSPARC III, UltraSPARC IV, UltraSPARC IV+ and T1 Systems.

Replace a DIMM when:

1. POST (when run at a level which actually tests memory) fails it.

2. For systems with Predictive Self-Healing (Solaris 10 and later, except on UltraSPARC II-based platforms),

when the system tells you to.

3. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier),

whenever Solaris reports a UE or DUE, and investigation shows that the UE or DUE truly originated from memory,

and not from a transfer from some CPU's cache, as determined by a qualified Sun Support specialist.

4A. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier),

whenever Solaris reports two or more CEs from two or more different physical addresses on each of two or more

different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative

checkword (that is, the AFARs are all the same module 64).

[Note: This means at least 4 CEs; two from one bit position, with unique addresses, and two from another,

also with unique addresses, and the lower 6 bits of all the addresses are the same.]

Page 13: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

4B. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier),

whenever Solaris reports two or more CEs from two or more different physical addresses on each of three or more

different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond

to the same relative bit position in their respective checkwords.

[Note: This means at least 6 CEs; two from one DRAM output signal, with unique addresses, two from another output

from the same DRAM, also with unique addresses, and two more from yet another output from the same DRAM,

again with unique addresses, as long as the three outputs do not all correspond to the same relative bit position in

their respective checkwords.]

5. For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9,

patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 and later,

when the system indicates that the page retirement limit of 0.1% of physical memory has been reached and denotes

one and only one DIMM as suspect (i.e., it has accumulated 130 or more non-intermittent CEs).

If more than one DIMM is marked as suspect, then other possible causes of CEs have to be ruled out by

a qualified Sun Support specialist before replacing any DIMMs.

[Note: Determining these factors is aided by the CEDIAG diagnostic tool set.]

In the unlikely event that the system indicates that the page retirement limit has been reached but no DIMM is marked

as suspect, contact a Sun Support specialist for assistance in determining any necessary action.

Example:

connole 73 =>uname -a

SunOS connole 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-5_10

Page 14: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

6. For older Solaris releases and patch levels, when Solaris reports more than 24 non-intermittent CEs in 24 hours

from a single DIMM.

If more than one DIMM has experienced more than 24 non-intermittent CEs in 24 hours, then other possible causes

of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs.

Limitations:

Prior to Solaris 10, retired pages are returned to service whenever a system is rebooted, and will be re-retired if and

when Solaris encounters CEs from them again.

POST may fail a DIMM that contained retired pages; if it does, replace the DIMM at that time.

----------------------------------------------(end of official policy)-------------------------------------

Note:

Exceptions MAY be made to the Policy in the interest of Customer Satisfaction.

Consult with your lead, backline or manager if necessary.

When making exception, always make note of that in case notes.

Example:

“Advised customer of Sun's Enhanced Memory Dimm Replacement Policy and suggested that they employ the cediag utility.

Referenced Infodocs 79928 & 82264 which explain more about Sun's Enhanced Memory DIMM Replacement Policy and

the recommended CEDIAG utility. Customer declined to follow recommendations and insists upon dimm replacement.”

Page 15: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Identifying the correct dimm size / part number

Variables that need to be known:dimm sizedimm type (speed)dimm quantity (some dimms are always replaced in pairs, eg: V440)

Useful utilities to identify dimm size:prtdiag -v /usr/platform/sun4u/sbin/prtdiag -vprtfru -x output (applies to newer machines)POST diagnostic output (when available)memconf utility (now able to be run against Explorer output)showfru ALOM command, displays all FRU info

Depending upon the machine platform the prtdiag output may report only the total memory installed, the physical bank size, the logical bank size, or the actual dimm size

Page 16: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Sun Microsystem machine platforms have varying memory layouts.

Some have ALL the memory dimms installed on a single, common system board (aka motherboard).

Examples: most Sun Desktop machines and E250, E450, 280R, V210 & V240

Some have half of the dimms located on the system board and half on a “Memory Riser Board”.

Specifically: Ultra 80 / Enterprise 420R / Netra t 1400/1405

Others machines use “Mezzanine Memory” modules.

Specifically: Netra t / ct 400/800 / SPARCengine CP1500

Some have multiple CPU memory boards, each comprised of CPU modules AND memory dimms.

Examples: older machines like E3500/4500/5500/6500 and newer ones like V480, V880 & V440

The examples listed above are by no means all inclusive.

When in doubt ALWAYS refer to the online Sun System Handbook.

There you may also find helpful notes regarding:

Minimum memory dimm slot population requirements

Memory dimm / bank installation order

Whether dimms must be installed as matched pairs

etc...

Page 17: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

E250 Prtdiag output example (partial)

e250-hw 41 =>/usr/platform/sun4u/sbin/prtdiag -v

System Configuration: Sun Microsystems sun4u Sun (TM) Enterprise 250 (2 X UltraSPARC-II 296MHz)System clock frequency: 99 MHzMemory size: 1792 Megabytes (total amount of memory installed in system)

========================= CPUs =========================

Run Ecache CPU CPUBrd CPU Module MHz MB Impl. Mask--- --- ------- ----- ------ ------ ----SYS 0 0 296 2.0 US-II 2.0SYS 1 1 296 2.0 US-II 2.0

========================= Memory =========================

Interlv. Socket SizeBank Group Name (MB) Status---- ----- ------ ---- ------ 0 none U0701 64 OK (64 meg dimm) 0 none U0801 64 OK 0 none U0901 64 OK 0 none U1001 64 OK 1 none U0702 128 OK (128 meg dimm) 1 none U0802 128 OK 1 none U0902 128 OK 1 none U1002 128 OK 2 none U0703 128 OK 2 none U0803 128 OK 2 none U0903 128 OK 2 none U1003 128 OK 3 none U0704 128 OK 3 none U0804 128 OK Each dimm is shown individually 3 none U0904 128 OK 4 Banks of memory 3 none U1004 128 OK 3 Banks of 128 meg dimms, 1 Bank of 64 meg dimms

Page 18: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise E4500/E5500

System clock frequency: 100 MHz

Memory size: 12288Mb

========================= CPUs =========================

Run Ecache CPU CPU

Brd CPU Module MHz MB Impl. Mask

--- --- ------- ----- ------ ------ ----

0 0 0 400 8.0 US-II 10.0

0 1 1 400 8.0 US-II 10.0

2 4 0 400 8.0 US-II 10.0

2 5 1 400 8.0 US-II 10.0

4 8 0 400 8.0 US-II 10.0

4 9 1 400 8.0 US-II 10.0

5 10 0 400 8.0 US-II 10.0

========================= Memory =========================

Intrlv. Intrlv.

Brd Bank MB Status Condition Speed Factor With

--- ----- ---- ------- ---------- ----- ------- -------

0 0 2048 Active OK 60ns 4-way A Board 0 / Bank 0 (8 dimms per bank) 2048 / 8 = 256 meg dimms

0 1 1024 Active OK 60ns 8-way B Board 0 / Bank 1 (8 dimms per bank) 1024 / 8 = 128 meg dimms

2 0 2048 Active OK 60ns 4-way A

2 1 1024 Active OK 60ns 8-way B

4 0 2048 Active OK 60ns 4-way A

4 1 1024 Active OK 60ns 8-way B

5 0 2048 Active OK 60ns 4-way A

5 1 1024 Active OK 60ns 8-way B

E4500 prtdiag (excerpt)

Page 19: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

memconf is a perl script that reports the size of each SIMM/DIMM memorymodule that is installed in a Sun system.It also reports the system type and any empty memory sockets.In verbose mode, it also reports: * banner name, model, and CPU/system frequencies * address range and bank numbers for each module

External url (for customers)http://www.sunfreeware.com/http://myweb.cableone.net/4schmidts/memconf.html

Usage: memconf [ -v | -D | -h ] [ explorer_dir ] -v verbose mode -D send results to memconf maintainer -h print help explorer_dir Sun Explorer output directory

Page 20: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

# prtdiag -v

System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz)

System clock frequency: 90 MHz

Memory size: 1024 Megabytes

========================= CPUs =========================

Run Ecache CPU CPU

Brd CPU Module MHz MB Impl. Mask

--- --- ------- ----- ------ ------ ----

0 0 0 360 0.2 12 9.1

========================= IO Cards =========================

Bus# Freq

Brd Type MHz Slot Name Model

--- ---- ---- ---- -------------------------------- ----------------------

0 PCI-1 33 1 ebus

0 PCI-1 33 1 network-SUNW,hme

0 PCI-1 33 2 SUNW,m64B ATY,GT-C

0 PCI-1 33 3 ide-pci1095,646.1095.646.3

No failures found in System

========================= HW Revisions =========================

ASIC Revisions:

---------------

Cheerio: ebus Rev 1

System PROM revisions:

----------------------

OBP 3.31.0 2001/07/25 20:36 POST 3.1.0 2000/06/27 13:56

Notice that the prtdiag output from this Ultra 10 shows only the TOTAL memory installed.

NOT how many dimms or which size.

Page 21: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

connole 167 =>./memconf

hostname: connole

Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz)

Memory Interleave Factor = 2-way

socket DIMM1 has a 256MB DIMM

socket DIMM2 has a 256MB DIMM

socket DIMM3 has a 256MB DIMM

socket DIMM4 has a 256MB DIMM

empty sockets: None

total memory = 1024MB (1GB)

connole 168 =>./memconf -v (verbose mode)

memconf: V1.65 13-Feb-2006 http://www.4schmidts.com/unix.html

hostname: connole

banner: Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz)

model: Ultra-5_10

Sun development name: Darwin/Otter (Ultra 5), Darwin/SeaLion (Ultra 10)

Solaris 9 4/04 s9s_u6wos_08a SPARC, 64-bit kernel, SunOS 5.9

1 UltraSPARC-IIi 360MHz cpu, system freq: 90MHz

CPU Units:

========================= CPUs =========================

Run Ecache CPU CPU

Brd CPU Module MHz MB Impl. Mask

--- --- ------- ----- ------ ------ ----

0 0 0 360 0.2 12 9.1

Memory Units:

Memory Interleave Factor = 2-way

socket DIMM1 has a 256MB DIMM (bank 0L, address 0x00000000-0x0fffffff, 0x20000000-0x2fffffff)

socket DIMM2 has a 256MB DIMM (bank 0H, address 0x00000000-0x0fffffff, 0x20000000-0x2fffffff)

socket DIMM3 has a 256MB DIMM (bank 1L, address 0x10000000-0x1fffffff, 0x30000000-0x3fffffff)

socket DIMM4 has a 256MB DIMM (bank 1H, address 0x10000000-0x1fffffff, 0x30000000-0x3fffffff)

empty sockets: None

total memory = 1024MB (1GB)

Ultra 10 memconf examples

Page 22: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

System Configuration: Sun Microsystems sun4u Sun Enterprise 420R (4 X UltraSPARC-II 450MHz)

System clock frequency: 113 MHz

Memory size: 4096 Megabytes (only TOTAL memory reported)

========================= CPUs =========================

Run Ecache CPU CPU

Brd CPU Module MHz MB Impl. Mask

--- --- ------- ----- ------ ------ ----

0 0 0 450 4.0 US-II 10.0

0 1 1 450 4.0 US-II 10.0

0 2 2 450 4.0 US-II 10.0

0 3 3 450 4.0 US-II 10.0

========================= IO Cards =========================

Bus Freq

Brd Type MHz Slot Name Model

--- ---- ---- ---- -------------------------------- ----------------------

0 PCI 33 0 SUNW,qfe-pci108e,1001 SUNW,pci-qfe

0 PCI 33 1 network-SUNW,hme

0 PCI 33 1 SUNW,qfe-pci108e,1001 SUNW,pci-qfe

0 PCI 33 2 fibre-channel-pci10df,f800.10df.+

0 PCI 33 2 SUNW,qfe-pci108e,1001 SUNW,pci-qfe

0 PCI 33 3 scsi-glm/disk (block) Symbios,53C875

0 PCI 33 3 scsi-glm/disk (block) Symbios,53C875

0 PCI 33 3 SUNW,qfe-pci108e,1001 SUNW,pci-qfe

0 PCI 33 4 fibre-channel-pci10df,f800.10df.+

========================= HW Revisions =========================

ASIC Revisions:

---------------

PCI: pci Rev 4

PCI: pci Rev 4

Cheerio: ebus Rev 1

System PROM revisions:

----------------------

OBP 3.31.0 2001/07/25 20:35 POST 1.2.8 2000/08/22 19:50

420R prtdiag

Page 23: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

connole 170 =>memconf /home/mbuckley/Explorers/64834462_420R/explorer.80e8b7a9.njocsprd2-2005.12.05.17.04

hostname: njocsprd2

Sun Explorer directory: /home/mbuckley/Explorers/64834462_420R/explorer.80e8b7a9.njocsprd2-2005.12.05.17.04

Sun Enterprise 420R (4 X UltraSPARC-II 450MHz)

socket U0301 has a 256MB DIMM (individual dimm size reported)

socket U0302 has a 256MB DIMM

socket U1301 has a 256MB DIMM

socket U1302 has a 256MB DIMM

socket U0401 has a 256MB DIMM

socket U0402 has a 256MB DIMM

socket U1401 has a 256MB DIMM

socket U1402 has a 256MB DIMM

socket U0303 has a 256MB DIMM

socket U0304 has a 256MB DIMM

socket U1303 has a 256MB DIMM

socket U1304 has a 256MB DIMM

socket U0403 has a 256MB DIMM

socket U0404 has a 256MB DIMM

socket U1403 has a 256MB DIMM

socket U1404 has a 256MB DIMM

empty sockets: None

total memory = 4096MB (4GB)

WARNING: Layout of memory sockets not completely recognized on this system.

The memory configuration displayed should be correct though since this is a fully stuffed system.

This is a known bug due to Sun's 'prtconf', 'prtdiag' and 'prtfru' commands not providing enough detail for the memory layout of this

SunOS 5.8 SUNW,Ultra-80 system to be accurately determined.

This is a bug in Sun's OBP, not a bug in memconf.

The latest release (OBP 3.33.0 2003/10/07) still has this bug.

This system is using OBP 3.31.0 2001/07/25 20:35

Page 24: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

V880 POST output (excerpt)

Probing Memory............Probing CPU0 memory configuration NGDIMM#0 part# 501-5030-03 serial# 235446, 256MB + 256MB, SC#0 (512 meg dimm) NGDIMM#1 part# 501-5030-03 serial# 235457, 256MB + 256MB, SC#0 NGDIMM#2 part# 501-5030-03 serial# 241586, 256MB + 256MB, SC#0 NGDIMM#3 part# 501-5030-03 serial# 241589, 256MB + 256MB, SC#0 NGDIMM#4 part# 501-5030-03 serial# 241581, 256MB + 256MB, SC#0 NGDIMM#5 part# 501-5030-03 serial# 241579, 256MB + 256MB, SC#0 NGDIMM#6 part# 501-5030-03 serial# 241573, 256MB + 256MB, SC#0 NGDIMM#7 part# 501-5030-03 serial# 241577, 256MB + 256MB, SC#0Probing CPU1 memory configuration NGDIMM#0 part# 501-5030-03 serial# 241516, 256MB + 256MB, SC#0 NGDIMM#1 part# 501-5030-03 serial# 241522, 256MB + 256MB, SC#0 NGDIMM#2 part# 501-5030-03 serial# 241601, 256MB + 256MB, SC#0 NGDIMM#3 part# 501-5030-03 serial# 241507, 256MB + 256MB, SC#0 NGDIMM#4 part# 501-5030-03 serial# 243281, 256MB + 256MB, SC#0 NGDIMM#5 part# 501-5030-03 serial# 243486, 256MB + 256MB, SC#0 NGDIMM#6 part# 501-5030-03 serial# 241594, 256MB + 256MB, SC#0 NGDIMM#7 part# 501-5030-03 serial# 241588, 256MB + 256MB, SC#0

SubTool output:Part#: 501-5030Desc: FRU,ASSY,SDRAM,DIMM,512MB Category: Boards Is a FRU but has no substitutable parts.

Page 25: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Prtfru -x output from V880:

<Location name="dimm-slot?Label=J8001">

<Container name="dimm-module">

<ContainerData>

<Segment name="SD">

<ManR>

<UNIX_Timestamp32 value="Mon Mar 3 19:39:25 MST 2003"/>

<Fru_Description value="256 MB NG SDRAM DIMM"/>

<Manufacture_Loc value="ONYANG,KOREA"/>

<Sun_Part_No value="5015401"/>

<Sun_Serial_No value="A4663A"/>

<Vendor_Name value="Samsung"/>

<Initial_HW_Dash_Level value="03"/>

<Initial_HW_Rev_Level value="50"/>

<Fru_Shortname value="DIMM"/>

</ManR>

<Fru_Type value="256 MB DIMM"/>

<DIMM_R>

<DIMM_Speed value="75"/>

<DIMM_Size value="256"/>

</DIMM_R>

</Segment>

</ContainerData>

</Container> <!-- dimm-module -->

</Location> <!-- dimm-slot?Label=J8001 -->

SubTool output:Part#: 501-5401Desc:FRU,ASSY,SDRAM,DIMM,256MB,18X8MX16 Category: Boards Is a FRU but has no substitutable parts.

Page 26: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Showfru is a commandline prtfru -x summary script available online from: http://pts-appl-z1.holland/showfru.html

From the commandline showfru needs to be run on Solaris 10 FCS or later, where the XML perl modules are installed by default.

The Showfru script aims to provide a concise summary of FRU data from a prtfru -x output This allows quick identification of FRUs installed and depending on the platform other additional information is available.

NOTE: Please link to the script rather than taking a private copy.

##################################################################

Latest version 0.74 /net/cores.uk/export/hotline/hotlocal/bin/showfru

Report bugs, RFEs or if you have questions email [email protected]

Further info from http://pts-platform/twiki/bin/view/Tools/ToolPageShowfru

###################################################################

Non RoHS example: http://gmpweb.uk/~db124859/showfru/v240_mixed_dimm_sizes.html

The script only runs on Solaris 10 and above so if you are stuck on a Solaris 9 sunray use the online version: http://pts-appl-z1.holland/showfru.html

Further details and example outputs here:http://pts-platform/twiki/bin/view/Tools/ToolPageShowfru

More on ROHS: http://sunsolve2.central.sun.com/handbook_internal/Systems/common-docs/RoHS_Communication.html#meaning

Page 27: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

$ /net/cores.uk/export/hotline/hotlocal/bin/showfru prtfru_-x.out

################################################################################

FRU part and serial number info, use -v for install date and vendor

################################################################################

MB MOTHERBOARD 375-3346 RoHS H00ORF

PS0 PS 300-1846 RoHS 005530

IFB CHASSIS 371-0796 RoHS E2JB13

PS1 PS 300-1846 RoHS 005529

MB.P0.B0.D0 1 GB

MB.P0.B0.D1 1 GB

MB.P0.B1.D0 1 GB

MB.P0.B1.D1 1 GB

################################################################################

SPD DIMM info - FRU, vendor name, vendor part and serial number

################################################################################

MB.P0.B0.D0 Infineon (formerly Siemens) 72D128320GBR6C 0403E910

MB.P0.B0.D1 Infineon (formerly Siemens) 72D128320GBR6C 0403EA10

MB.P0.B1.D0 Infineon (formerly Siemens) 72D128320GBR6C 0403EA12

MB.P0.B1.D1 Infineon (formerly Siemens) 72D128320GBR6C 0409FD27

Page 28: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

sc> showfru

FRU_PROM at PS0.SEEPROM

Manufacturer Record

Timestamp: TUE JUL 01 19:53:52 UTC 2003

Description: P/S,SSI MPS,680W,HOT PLUG

Manufacture Location: DELTA ELECTRONICS THAILAND

Sun Part No: 3001501

Sun Serial No: T00541

Vendor: Delta Electronics

Initial HW Dash Level: 06

Initial HW Rev Level: 50

Shortname: A42_PSU

FRU_PROM at C0.P0.B0.D0.SEEPROM

Timestamp: MON JUN 02 12:00:00 UTC 2003

Description: SDRAM DDR, 512 MB

Manufacture Location:

Vendor: Samsung

Vendor Part No: M3 12L6420DT0-CA2

FRU_PROM at C0.P0.B0.D1.SEEPROM

Timestamp: MON JUN 02 12:00:00 UTC 2003

Description: SDRAM DDR, 512 MB

Manufacture Location:

Vendor: Samsung

Vendor Part No: M3 12L6420DT0-CA2

Page 29: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

The Findaft script, aims to provide a concise summary of AFT, CPU and PCI ECC errors found in the Solaris Operating System /var/adm/messages files.

This summary can then used to assist in diagnosing a customers' hardware fault. Note: Findaft is Sun Internal only and cannot be sent to customers.

Provides a concise summary of all CPU/Memory/PCI/ECC errors found in the messages.

(Makes an ideal case note or start point for an SGR template.)

Assists with identification of memory UE errors.Features highlighting of E-Cache events.Directs TSE's towards Best Practices, when to do "nothing".Features highlighting of Datapath faults.

Helps to identify the true, root cause of errors.Helps to prevent mis-diagnosis which could result in "wrong" parts being replaced.

Page 30: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Findaft is a standalone perl script, the latest version is runnable from here:/net/cores.uk/export/hotline/hotlocal/bin/findaft

(always use latest available versions of tools)

Or downloadable from here:http://gmpweb.uk/~db124859/findaft/

Findaft is always a good starting step to troubleshooting and diagnosing memory issues. Read the docs that findaft suggests, these will usually assist diagnosis.

Reference:Infodoc 80270: "Findaft an AFT, CPU, Memory and PCI ECC error message summary

script"http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft

Alias is available to provide tool support: [email protected]

Page 31: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 862595 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7f9abdc2

May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30

May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2b4

May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 339206 kern.info] [AFT0] errID 0x0015bd43.7f9abdc2 Corrected Memory Error on U0302 is Intermittent

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 368593 kern.info] [AFT0] errID 0x0015bd43.7f9abdc2 ECC Data Bit 31 was in error and corrected

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 748639 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa13b30

May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30

May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2ac

May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 233778 kern.info] [AFT0] errID 0x0015bd43.7fa13b30 Corrected Memory Error on U0302 is Intermittent

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 315879 kern.info] [AFT0] errID 0x0015bd43.7fa13b30 ECC Data Bit 31 was in error and corrected

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 712106 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa59597

May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30

May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2b4

May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 957346 kern.info] [AFT0] errID 0x0015bd43.7fa59597 Corrected Memory Error on U0302 is Intermittent

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 356654 kern.info] [AFT0] errID 0x0015bd43.7fa59597 ECC Data Bit 31 was in error and corrected

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 585299 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa993cf

May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30

May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2a8

May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 342081 kern.info] [AFT0] errID 0x0015bd43.7fa993cf Corrected Memory Error on U0302 is Persistent

May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 499012 kern.info] [AFT0] errID 0x0015bd43.7fa993cf ECC Data Bit 31 was in error and corrected

Page 32: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

# /net/cores.uk/export/hotline/hotlocal/bin/findaft /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages

################################################################################

This script looks for Hardware errors including all AFT and pci ECC events

Written for 108528-16/112233-01 or above. Some tests may fail on other revisions

Report bugs,RFEs or if you have questions email [email protected]

Version 2.00 homepage http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft

Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft

Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script

################################################################################

Input file /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages is 0.1 MB

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Syndrome errors CE and UE errors are included

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

10 Syndrome 0x58 Memory Module U0302

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other AFT Events

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

5 [AFT0] Corrected Memory Error detected by CPU0,

1 [AFT0] Corrected Memory Error detected by CPU1,

2 [AFT0] Corrected Memory Error detected by CPU2,

2 [AFT0] Corrected Memory Error detected by CPU3,

3 [AFT0] Sticky Softerror encountered on Memory Module U0302

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Main Memory Correctable ECC events sorted by date

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

3 May 28 U0302 is Intermittent

1 May 28 U0302 is Persistent

3 May 30 U0302 is Sticky

2 May 31 U0302 is Persistent

1 Jun 01 U0302 is Persistent

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Page 33: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

(continued)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Panics, Reboots, Fatal errors etc

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 Jun 01 cht1ds004 SunOS Release 5.8 Version Generic_117000-03 64-bit

###############################################################################

Correctable memory errors found, use cediag to determine if a DIMM needs to be

replaced, see Infodoc 83216 for examples of the cediag rule failure messages

Infodoc 79928: Sun Enhanced Memory DIMM Replacement Policy

################################################################################

cediag -e explorer_directory/

cediag -c SunOS,cht1ds004,5.8,sparc -k 117000-03 -u 2 /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages

################################################################################

Start of Ultrasparc II CE specific checks

Unique Simms total 1

################################################################################

U0302

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Unique Syndromes total 1

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

CE Event Syndrome 0x58 Data Bit 31

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

USII CE Event type reported by each CPU

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reporting CPU Intermittent Persistent Sticky

CPU0 3 2 0

CPU1 0 0 1 <<

CPU2 0 1 1 <<

CPU3 0 1 1 <<

Page 34: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

CEDIAG is a memory error analysis tool, comprised of shell scripts and a few binary executables.Currently runs on Solaris SPARC architectures only.Reference: http://onestop/qco/dimm/tools/cediag.shtmlInternally runnable from: /net/cores.uk/export/hotline/hotlocal/bin/cediag

Usage:# cediag -e unpacked_explorer_dirMay also be run in verbose mode to gather additional information(such as total number of memory pages retired)Syntax example:# /net/cores.uk/export/hotline/hotlocal/bin/cediag -v -e /explorer-directory

Customers may download from: http://sunsolve.sun.com (Diagnostic Tools)Memory DIMM Replacement Management Tool(Download, install cediag 1.2.1)

Page 35: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

# /net/cores.uk/export/hotline/hotlocal/bin/cediag -v -e /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/

cediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTC

cediag: info: cediag directory: /net/cores.uk/export/hotline/hotlocal/bin

cediag: info: Explorer directory: /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/

cediag: info: UltraSPARC Version: 2 (2)

cediag: info: OS Type: SunOS

cediag: info: OS Version: 5.8

cediag: info: Hostname: cht1ds004

cediag: info: Memory size: 524032 (8KB pages)

cediag: info: MPR (deduced) PRL pages: 497 (8KB pages) (MPR = Memory Page Retirement)

cediag: info: MPR-capable OS: true

cediag: info: KJP: 117000-03

cediag: info: MPR-aware kernel in-use: true

cediag: info: MPR enabled: true

cediag: info: MPR disabled in /etc/system: false

cediag: info: MPR force mode: n/a

cediag: info: MPR state: active

cediag: info: Rule#3 check: true

cediag: info: Rule#4 check: true

cediag: info: Rule#5 check: true

cediag: info: Rule#5 check via cestat: false

cediag: info: Rule#6 check: false

cediag: #### CE Summary prior to reboot at Jun 1 14:55:33 ###################

cediag: info: DIMM U0302 had 10 CE(s)

cediag: info: DIMM U0302 had 7 non-intermittent CE(s)

cediag: info: DIMM U0302 @ Data Bit 31 had 10 CE(s)

cediag: info: DIMM U0302 @ Data Bit 31 @ AFAR%64=48 had 10 CE(s) across 1 AFARs

Page 36: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

cediag: info: messages files: 1 pages scheduled for retirement

cediag: info: messages files: 1 pages successfully retired

cediag: info: messages files: 0 pages scheduled for clearing

cediag: info: messages files: 0 pages successfully cleared

cediag: info: PRL deduced status: PRL reached = false

cediag: findings: 0 datapath fault message(s) found

cediag: findings: 0 UE(s) found - there is no rule#3 match

cediag: findings: 0 DIMMs with a failure pattern matching rule#4

cediag: findings: 0 DIMMs with a failure pattern matching rule#5

cediag: #### CE Summary since last detected reboot ###########################

cediag: #### last detected reboot at Jun 1 14:55:33 #########################

cediag: info: messages files: 0 pages scheduled for retirement

cediag: info: messages files: 0 pages successfully retired

cediag: info: messages files: 0 pages scheduled for clearing

cediag: info: messages files: 0 pages successfully cleared

cediag: info: PRL deduced status: PRL reached = false

cediag: findings: 0 datapath fault message(s) found

cediag: findings: 0 UE(s) found - there is no rule#3 match

cediag: findings: 0 DIMMs with a failure pattern matching rule#4

cediag: findings: 0 DIMMs with a failure pattern matching rule#5

#

Page 37: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Example cediag messages when a single DIMM needs to be replaced.

Rule 4 can identify UE DIMMs before they cause an outage.

cediag: findings: 1 DIMMs with a failure pattern matching rule#4

cediag: findings: DIMM 'Slot A: J8101' matched rule#4 failure pattern

cediag: advice:HIGH: replace DIMM 'Slot A: J8101' [A]s [S]oon [A]s [P]ossible

Rule 5 failures are low risk and should not cause an outage.

cediag: findings: 1 DIMMs with a failure pattern matching rule#5

cediag: findings: DIMM 'Slot B: J3101' matched rule#5 failure pattern

cediag: advice:MEDIUM: replace DIMM 'Slot B: J3101' during next maintenance period

Rule 6 applies when Solaris is not patched to the level to provide MPR and is low risk.

cediag: findings: 1 DIMMs with a failure pattern matching rule#6

cediag: findings: DIMM 'Slot C: J8200' matched rule#6 (24 in 24) failure pattern

cediag: advice:MEDIUM: replace DIMM 'Slot C: J8200' during next maintenance period

Page 38: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Example cediag messages for the more complex faults.

Uncorrectable UE errors are often seen as a result of single DIMM Rule 4 failures. cediag: findings: 1 UE(s) found - potential rule#3 match cediag: advice:HIGH: refer UE(s) to Sun Support [A]s [S]oon [A]s [P]ossible

Datapath fault - See Infodocs 70134 and 80288 for diagnosis of bad writers anddatapath faults from Solaris messages. cediag: findings: 4 datapath fault message(s) found cediag: findings: 8 DIMM(s) having CEs with Esynd of 0x0010 found cediag: advice:HIGH: possible datapath fault - refer to Sun Support ASAP

Whenever more than one DIMM fails rules 4,5 or 6 you will get this message.Make sure you really do have multiple failures before replacing any DIMMs cediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs

Page 39: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

FIND_UE Utility:Used to identify those UE errors where a single DIMM from a memory bank can be reliably identified as the cause of the fault, or at least narrow down the number of suspect DIMMs.(Enhanced algorithms are being implemented to reduce the number of suspect components.) Identify those UEs which are likely to have been caused by FCO A0258.Field Change Order A0258-1:Mitsubishi 256MB DIMMs (Sun p/n 501-5658) showing significantly lower than expected reliability.

Info available at: http://pts-platform/twiki/bin/view/Tools/ToolPageFindUE

Alias list is available to provide tool support: [email protected]

FindUE is a commandline syndrome decoderUsage:/net/cores.uk/export/hotline/hotlocal/bin/findUE messages

Page 40: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

###################################################################FindUE was written to assist in ECC syndrome history analysis, the script willunderstand Solaris messages, console logs, msgbuf, showlogs and wfail outputsSupported systems include USIII and USIV systems the E3000-6500s but not USIIIi.

Version 1.33 from /net/cores.uk/export/hotline/hotlocal/bin/findUEInfodocs 75538 and 74624 have further details. If you find bugs in the scriptemail [email protected] for syndrome decode bugs [email protected]##################################################################

Page 41: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Infodoc 80346: “Using the fin954 script to diagnose main memory versus L2SRAM errors”

The fin954 script was written by Mike Arnott in 2003. The aim was to automate the diagnosis of main memory versus L2SRAM errors on the UltraSparc III systems using the errors found in a Solaris messages file. The fin954 script implements the rules described in FIN I0954-1. These rules apply to all USIII and USIV systems including the VSP systems, SunBlade 1000/2000, 280R, Netra 20, 480/490 and the 880/890, but not the USIIIi based systems.

The latest version of fin954 is available from http://fde.aus/tools/fin954 it is a Perl script and needs to be run from the commandline.

Further information is available from:

FIN I0760-2 Sun Enhanced Memory DIMM Replacement Policy.

Infodoc 52427 L2SRAM/DIMM Misdiagnosis Issues

Infodoc 75538 Sun Fire[TM] Server: Using ECC Syndrome History to Troubleshoot Uncorrectable Errors (UE) in Memory

When to use fin954:

The fin954 is a special purpose diagnosis script and as such should not be used as an initial scan of messages. If you need a general summary of all AFT events found in messages use findaft. When cediag finds UE errors but cannot identify a single faulty DIMM, fin954 can be used to help diagnose the faulty FRU.

Example:

$ fin954 sample.4.messages.txt

============================================================================

Findings: from analysing sample.4.messages.txt

Total "Events" logged: 167

Total *significant* "Events" logged: 28

Total insignificant "Events" logged: 138 .

FRU "SB10/P1/B1 J14301 J14401 J14501 J14601" implicated as error source 22 times.

FRU "SB10/P1/B1/D0 J14301" implicated as error source 6 times.

Fin954 is a standalone script, the latest version is runnable from here:

/net/cores.uk/export/hotline/hotlocal/bin/fin954

Page 42: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Dimm PLL chip Issues:

Sun has determined that a limited subset of memory DIMMs shipped in 2001 and 2002 (less than one percent of the installed base) may begin to show reduced reliability after approximately two years of operation.

This reliability issue manifests itself in the form of UEs (Uncorrectable Errors), sometimes with CEs (Correctable Errors), originating from the DIMMs.

The reliability of these DIMMs is normal for approximately the first two years of use, after which they may start to degrade below the expected level.

The root cause of this issue is related to a PLL device on the DIMMs. This sub-population of DIMMs has PLL devices with a date code range between 0049 and 0215 inclusive.

No unique symptom will be experienced due to this issue, other than higher than expected UEs and CEs.

A DIMM lookup tool has been developed to assist in identifying suspect DIMMs (but sometimes manual inspection is required)

Impacted Platforms:

It has been determined that the following platforms if shipped between 1/01/2001 and 12/31/2002 could be impacted:

SB1000, SB2000, Netra20, 280R, V480, V490, V880, V890, V1280, SF3800, SF4800, SF4810, SF6800, F12K, F15K

References:

FCO A0253-1: A sub-population of DIMMs that shipped between 2001 and 2002 on the below platforms are showing significantly lower reliability than expected.

FAQ: http://onestop/qco/plldimm/index_plldimm.shtml

Page 43: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

How this issue may “uniquely” affect V480/V490/V880/V890 platforms:

In some cases during UE DIMM errors, incorrect memory banks can be called out masking the true location of the faulty DIMM.

Due to this bug, not even POST can help since POST is also affected by the bug and will also call out the wrong DIMM location.

The Kernel or POST reports a memory group as the source of a CE or UE error, which might cause the engineer to believe there is a defective DIMM within that group.

However, on a system experiencing BugID #5034665, the reported group is USUALLY NOT the location of the defective DIMM.

You will know that the system is experiencing this bug because there will be multiple UE error messages calling out different groups of memory on the same CPU/Memory board.

Note: When this bug is exhibited, the UE DIMM errors will be confined to a single CPU/Memory board - the false errors will not span different CPU/Memory boards, they will only be on one board.

In light of the new information discovered by bug # 5034665, we no longer will look for one DIMM specifically. We now will remove all DIMMs containing a specific PLL chip within a range of date codes.

References:

Infodoc 77110: Sun Fire[TM] Server (V480, V880, V880z, V490, V890): How to Troubleshoot "Dimm PLL chip failure causes CE/UEs to be called out by POST & Solaris[TM] in any dimm location" BugID #5034665

SunAlert 101667 (formerly 57757): A Limited Subset of DIMMs (less than 1%) Shipped in 2001-2002 May Have a Reliability Issue

PLL Lookup Tool: http://pts-appl-z1.holland/pll.html (used to scan Explorer outputs)

Page 44: mem_diagI

Sun Proprietary/Confidential: Internal Use Only

Misc Memory ResourcesMemory reference site:

http://onestop/qco/dimm/https://onestop.sfbay.sun.com/qco/dimm/index_dimm.shtml

Infodoc 70361: "Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages"

Infodoc 72846: Event Messages for UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R], and UltraSPARC-IV[R] CPU Modules

Infodoc 72775: "How to determine if a correctable error (CE) on a memory DIMM should result in replacement of FRU"

Infodoc 70134 Diagnosis of bad writers and datapath faults from Solaris messages

Infodoc 79928: "Sun Enhanced Memory DIMM Replacement Policy"

Infodoc 82264: Memory DIMM Replacement Management Tool - cediag 1.2.1 FAQ

FIN 100271 (Formerly I0760-2) Sun Enhanced Memory DIMM Replacement Policy

Page 45: mem_diagI

Mike [email protected]

revision: H (6/22/06)

Memory Diagnosibility TOI