Memory Diagnosibility TOI Mike Buckley Platforms TRE Sun Microsystems
Sun Proprietary/Confidential: Internal Use Only
Goals
• Improve Customer Satisfaction by reducing time to resolution through increased technical proficiency
• Reduce the incidence of inaccurate Onsite Action Plans and wrong parts ordered
• Replace memory dimms and other parts only when necessary
• Correctly identify proper dimm size and part number
• Accurately diagnose various memory issues
• Save SUN money $$$
Sun Proprietary/Confidential: Internal Use Only
Topics
• Types of memory errors
• Sun's Best Practices regarding memory errors (review)
• Dimm size and part number identification techniques
• Diagnostic tools and utilities (some new ones)
• Troubleshooting tips and techniques
• Examples of error messages and tool usage
• Known memory issues (PLL dimm chip)
• Resources
Sun Proprietary/Confidential: Internal Use Only
Because ECC can correct single bit flips, single bit errors are referred to as Correctable Errors.These are detected and corrected and generally do not impact performance.
CE: Correctable (single bit) errorsTypes of CE's:
IntermittentPersistentSticky Bit
Multi-bit errors are referred to as Uncorrectable Errors.These are detected, but not corrected.These will result in machine reset (panic or reboot).
UE: Uncorrectable (multi bit) errors
Sun Proprietary/Confidential: Internal Use Only
Correctable Errors (CE):
When a CE is detected, the device that reads the word and detected theerror can correct the data read and continue on unimpeded. However, thisdoes not address the fact that the referenced word could still be residentin memory uncorrected (i.e. a subsequent read of this word could result inanother CE event). If, over time, this word in memory is never correctedthe possibility starts to arise that another bit may flip in the same word.This would lead to a UE event which will result in a loss of system service. To avoid this possibility, the detection of a CE causes a trap to Solaris.The Solaris error handling code logs the error and scrubs the affectedmemory word by writing the corrected word back into memory.
Sun Proprietary/Confidential: Internal Use Only
Intermittent:Means the error was not detected on a reread of the affected memoryword. "Intermittent" is not the best choice of words because itimplies that this same error can be expected to manifest itself atirregular intervals. This CE is also known as a transient soft error.No DIMM with this sort of error should be considered for replacementwithout first examining the soft error rate (SER) of this DIMM.
Persistent:Means the error was detected again on a re-read of the affected memoryword but the scrub operation corrected it. This CE is also known as a temporarysoft error. No DIMM with this sort of error should be considered for replacementwithout first examining the SER of this DIMM.
Sticky (aka Sticky Bit):Means that the error still exists in memory even after the scrub operation.This CE is also known as a stuck-at hard error.No DIMM with this sort of error should be considered for replacement withoutfirst examining the SER of this DIMM.
Sun Proprietary/Confidential: Internal Use Only
# cat messages | grep -i memory
May 24 16:07:34 smro97 SUNW,UltraSPARC-III+: [ID 631608 kern.info][AFT0] errID 0x00055be6.99821550 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Persistent
May 24 16:07:34 smro97 SUNW,UltraSPARC-III+: [ID 631608 kern.info][AFT0] errID 0x00055be6.99821550 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Persistent
May 24 16:12:40 smro97 SUNW,UltraSPARC-III+: [ID 910566 kern.info][AFT0] errID 0x00055c2d.d4cf4320 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Sticky
May 24 16:12:40 smro97 SUNW,UltraSPARC-III+: [ID 910566 kern.info][AFT0] errID 0x00055c2d.d4cf4320 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Sticky
May 24 16:12:40 smro97 unix: [ID 752700 kern.warning] WARNING: [AFT0]
Sticky Softerror encountered on Memory Module /N0/SB3/P2/B0/D2 J15500
May 24 16:12:40 smro97 unix: [ID 752700 kern.warning] WARNING: [AFT0]Sticky Softerror encountered on Memory Module /N0/SB3/P2/B0/D2 J15500
Sun Proprietary/Confidential: Internal Use Only
Here is an example of a memory errors which does not involve a CPU module. A PCI controller was reading data from memory.
May 17 18:45:01 j2kweb06 unix: WARNING: correctable error from pci0 (upa mid 1f) during dvma read transaction
May 17 18:45:01 j2kweb06 unix: AFSR=40f10000.9f800000 AFAR=00000000.4fbc1e60,
May 17 18:45:01 j2kweb06 double word offset=4, Memory Module <U0402> port id 31.
If this is just a single event then there is nothing to worry about.
Basically, there was simply a correctable ECC event on a read from memory.
The only difference between the "normal" CE events is that this one happened to be detected by the PCI controller (since it was doing the read) instead of a CPU.
A single CE event is nothing to worry about.
That's the reason for having ECC protected memory.
The system is doing its job and functioning normally.
Sun Proprietary/Confidential: Internal Use Only
Uncorrectable Errors (UE):
If a UE is detected, the device that read the word and detected theerror cannot correct the data and continue.
A UE will cause Solaris to panic if the UE is in kernel memory,or kill of the particular user process that contained the memory in error and an then issue an orderly shutdown and reboot to protectthe other processes in the domain.
Either way, whether via panic or shutdown and reboot, the customeris considerably impacted (and will likely call for support).
Sun Proprietary/Confidential: Internal Use Only
Memory Scrubber:
The Solaris OS runs a memory "scrubber" routine as part of its normal operation.
The time interval is 12 hours for scrubbing stale (unused, idle) memory pages.
This scrubber does not do anything special besides ensure that every memory location is accessed at least once every 12 hours.
If the access finds a CE, then the normal trap to the Solaris OS that occurs for any CE will scrub the affected memory word by writing the corrected word back into memory and log the event.
This ensures that multiple CEs do not have time to build up and form a UE at memory locations that are infrequently accessed.Correctable memory errors reported EXACTLY every twelve hours are a result of the Memory scrubber.(see Infodoc 74049: “How often does the memory scrubber run?”
The normal rules apply: a DIMM should only be replaced if it meets the criteria described in the Sun DIMM Replacement Policy.Infodoc 79928 Sun Enhanced Memory DIMM Replacement Policy.
Sun Proprietary/Confidential: Internal Use Only
BADWRITERS:
1. Sometimes multiple memory DIMMs within a system can start reporting soft errors.
Examining the messages may reveal that the same databit (or error syndrome) is in error on each DIMM.
This indicates that some other component is actually writing the bad data to RAM and consistently creating errors at the same bit address, regardless of the physical DIMM.
Recognizing this pattern, and troubleshooting further can prevent much wasted downtime and cost, and the replacing of perfectly good memory DIMMs.
2. When a DIMM is replaced and the errors persist, or return with the same data bit in error, some other component in the system is likely causing the memory errors.
Again, recognizing this possibility can head off assumptions that replacing memory will solve the problem.
3. In terms of CE/ECC, a system may only reveal errors when the failing address range is utilized by a particular application or combination of applications.
This is almost always a hardware fault. In very rare instances, bad code may generate errors that appear to be hardware.
A good first step when troubleshooting a reproducible CE memory issue is to first isolate or disable the suspect memory component(s) via asr-disable, setenv disabled-memory-list X, setenv disabled-board-list X(all under OBP), psradm -f X or cfgadm (under OS).
If disabling the suspect memory components is not possible, it may be advisable (especially on lower-end machines) to swap the suspect DIMM with another DIMM in the same bank. If the problem follows the DIMM, replace it. If the problems persist in the same location, it is not a bad DIMM issue.
Note: FINDAFT is especially useful when diagnosing Bad Writer scenarios, look for a common CPU (the one implicated more than other CPU's) to be possible Bad Writer.
Sun Proprietary/Confidential: Internal Use Only
Sun's Enhanced Sparc/Solaris DIMM Replacement Policy
Note: The rules detailed in this Policy apply to the following architectures:
UltraSPARC II, UltraSPARC III, UltraSPARC IV, UltraSPARC IV+ and T1 Systems.
Replace a DIMM when:
1. POST (when run at a level which actually tests memory) fails it.
2. For systems with Predictive Self-Healing (Solaris 10 and later, except on UltraSPARC II-based platforms),
when the system tells you to.
3. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier),
whenever Solaris reports a UE or DUE, and investigation shows that the UE or DUE truly originated from memory,
and not from a transfer from some CPU's cache, as determined by a qualified Sun Support specialist.
4A. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier),
whenever Solaris reports two or more CEs from two or more different physical addresses on each of two or more
different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative
checkword (that is, the AFARs are all the same module 64).
[Note: This means at least 4 CEs; two from one bit position, with unique addresses, and two from another,
also with unique addresses, and the lower 6 bits of all the addresses are the same.]
Sun Proprietary/Confidential: Internal Use Only
4B. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier),
whenever Solaris reports two or more CEs from two or more different physical addresses on each of three or more
different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond
to the same relative bit position in their respective checkwords.
[Note: This means at least 6 CEs; two from one DRAM output signal, with unique addresses, two from another output
from the same DRAM, also with unique addresses, and two more from yet another output from the same DRAM,
again with unique addresses, as long as the three outputs do not all correspond to the same relative bit position in
their respective checkwords.]
5. For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9,
patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 and later,
when the system indicates that the page retirement limit of 0.1% of physical memory has been reached and denotes
one and only one DIMM as suspect (i.e., it has accumulated 130 or more non-intermittent CEs).
If more than one DIMM is marked as suspect, then other possible causes of CEs have to be ruled out by
a qualified Sun Support specialist before replacing any DIMMs.
[Note: Determining these factors is aided by the CEDIAG diagnostic tool set.]
In the unlikely event that the system indicates that the page retirement limit has been reached but no DIMM is marked
as suspect, contact a Sun Support specialist for assistance in determining any necessary action.
Example:
connole 73 =>uname -a
SunOS connole 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-5_10
Sun Proprietary/Confidential: Internal Use Only
6. For older Solaris releases and patch levels, when Solaris reports more than 24 non-intermittent CEs in 24 hours
from a single DIMM.
If more than one DIMM has experienced more than 24 non-intermittent CEs in 24 hours, then other possible causes
of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs.
Limitations:
Prior to Solaris 10, retired pages are returned to service whenever a system is rebooted, and will be re-retired if and
when Solaris encounters CEs from them again.
POST may fail a DIMM that contained retired pages; if it does, replace the DIMM at that time.
----------------------------------------------(end of official policy)-------------------------------------
Note:
Exceptions MAY be made to the Policy in the interest of Customer Satisfaction.
Consult with your lead, backline or manager if necessary.
When making exception, always make note of that in case notes.
Example:
“Advised customer of Sun's Enhanced Memory Dimm Replacement Policy and suggested that they employ the cediag utility.
Referenced Infodocs 79928 & 82264 which explain more about Sun's Enhanced Memory DIMM Replacement Policy and
the recommended CEDIAG utility. Customer declined to follow recommendations and insists upon dimm replacement.”
Sun Proprietary/Confidential: Internal Use Only
Identifying the correct dimm size / part number
Variables that need to be known:dimm sizedimm type (speed)dimm quantity (some dimms are always replaced in pairs, eg: V440)
Useful utilities to identify dimm size:prtdiag -v /usr/platform/sun4u/sbin/prtdiag -vprtfru -x output (applies to newer machines)POST diagnostic output (when available)memconf utility (now able to be run against Explorer output)showfru ALOM command, displays all FRU info
Depending upon the machine platform the prtdiag output may report only the total memory installed, the physical bank size, the logical bank size, or the actual dimm size
Sun Proprietary/Confidential: Internal Use Only
Sun Microsystem machine platforms have varying memory layouts.
Some have ALL the memory dimms installed on a single, common system board (aka motherboard).
Examples: most Sun Desktop machines and E250, E450, 280R, V210 & V240
Some have half of the dimms located on the system board and half on a “Memory Riser Board”.
Specifically: Ultra 80 / Enterprise 420R / Netra t 1400/1405
Others machines use “Mezzanine Memory” modules.
Specifically: Netra t / ct 400/800 / SPARCengine CP1500
Some have multiple CPU memory boards, each comprised of CPU modules AND memory dimms.
Examples: older machines like E3500/4500/5500/6500 and newer ones like V480, V880 & V440
The examples listed above are by no means all inclusive.
When in doubt ALWAYS refer to the online Sun System Handbook.
There you may also find helpful notes regarding:
Minimum memory dimm slot population requirements
Memory dimm / bank installation order
Whether dimms must be installed as matched pairs
etc...
Sun Proprietary/Confidential: Internal Use Only
E250 Prtdiag output example (partial)
e250-hw 41 =>/usr/platform/sun4u/sbin/prtdiag -v
System Configuration: Sun Microsystems sun4u Sun (TM) Enterprise 250 (2 X UltraSPARC-II 296MHz)System clock frequency: 99 MHzMemory size: 1792 Megabytes (total amount of memory installed in system)
========================= CPUs =========================
Run Ecache CPU CPUBrd CPU Module MHz MB Impl. Mask--- --- ------- ----- ------ ------ ----SYS 0 0 296 2.0 US-II 2.0SYS 1 1 296 2.0 US-II 2.0
========================= Memory =========================
Interlv. Socket SizeBank Group Name (MB) Status---- ----- ------ ---- ------ 0 none U0701 64 OK (64 meg dimm) 0 none U0801 64 OK 0 none U0901 64 OK 0 none U1001 64 OK 1 none U0702 128 OK (128 meg dimm) 1 none U0802 128 OK 1 none U0902 128 OK 1 none U1002 128 OK 2 none U0703 128 OK 2 none U0803 128 OK 2 none U0903 128 OK 2 none U1003 128 OK 3 none U0704 128 OK 3 none U0804 128 OK Each dimm is shown individually 3 none U0904 128 OK 4 Banks of memory 3 none U1004 128 OK 3 Banks of 128 meg dimms, 1 Bank of 64 meg dimms
Sun Proprietary/Confidential: Internal Use Only
System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise E4500/E5500
System clock frequency: 100 MHz
Memory size: 12288Mb
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 400 8.0 US-II 10.0
0 1 1 400 8.0 US-II 10.0
2 4 0 400 8.0 US-II 10.0
2 5 1 400 8.0 US-II 10.0
4 8 0 400 8.0 US-II 10.0
4 9 1 400 8.0 US-II 10.0
5 10 0 400 8.0 US-II 10.0
========================= Memory =========================
Intrlv. Intrlv.
Brd Bank MB Status Condition Speed Factor With
--- ----- ---- ------- ---------- ----- ------- -------
0 0 2048 Active OK 60ns 4-way A Board 0 / Bank 0 (8 dimms per bank) 2048 / 8 = 256 meg dimms
0 1 1024 Active OK 60ns 8-way B Board 0 / Bank 1 (8 dimms per bank) 1024 / 8 = 128 meg dimms
2 0 2048 Active OK 60ns 4-way A
2 1 1024 Active OK 60ns 8-way B
4 0 2048 Active OK 60ns 4-way A
4 1 1024 Active OK 60ns 8-way B
5 0 2048 Active OK 60ns 4-way A
5 1 1024 Active OK 60ns 8-way B
E4500 prtdiag (excerpt)
Sun Proprietary/Confidential: Internal Use Only
memconf is a perl script that reports the size of each SIMM/DIMM memorymodule that is installed in a Sun system.It also reports the system type and any empty memory sockets.In verbose mode, it also reports: * banner name, model, and CPU/system frequencies * address range and bank numbers for each module
External url (for customers)http://www.sunfreeware.com/http://myweb.cableone.net/4schmidts/memconf.html
Usage: memconf [ -v | -D | -h ] [ explorer_dir ] -v verbose mode -D send results to memconf maintainer -h print help explorer_dir Sun Explorer output directory
Sun Proprietary/Confidential: Internal Use Only
# prtdiag -v
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz)
System clock frequency: 90 MHz
Memory size: 1024 Megabytes
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 360 0.2 12 9.1
========================= IO Cards =========================
Bus# Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 PCI-1 33 1 ebus
0 PCI-1 33 1 network-SUNW,hme
0 PCI-1 33 2 SUNW,m64B ATY,GT-C
0 PCI-1 33 3 ide-pci1095,646.1095.646.3
No failures found in System
========================= HW Revisions =========================
ASIC Revisions:
---------------
Cheerio: ebus Rev 1
System PROM revisions:
----------------------
OBP 3.31.0 2001/07/25 20:36 POST 3.1.0 2000/06/27 13:56
Notice that the prtdiag output from this Ultra 10 shows only the TOTAL memory installed.
NOT how many dimms or which size.
Sun Proprietary/Confidential: Internal Use Only
connole 167 =>./memconf
hostname: connole
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz)
Memory Interleave Factor = 2-way
socket DIMM1 has a 256MB DIMM
socket DIMM2 has a 256MB DIMM
socket DIMM3 has a 256MB DIMM
socket DIMM4 has a 256MB DIMM
empty sockets: None
total memory = 1024MB (1GB)
connole 168 =>./memconf -v (verbose mode)
memconf: V1.65 13-Feb-2006 http://www.4schmidts.com/unix.html
hostname: connole
banner: Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz)
model: Ultra-5_10
Sun development name: Darwin/Otter (Ultra 5), Darwin/SeaLion (Ultra 10)
Solaris 9 4/04 s9s_u6wos_08a SPARC, 64-bit kernel, SunOS 5.9
1 UltraSPARC-IIi 360MHz cpu, system freq: 90MHz
CPU Units:
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 360 0.2 12 9.1
Memory Units:
Memory Interleave Factor = 2-way
socket DIMM1 has a 256MB DIMM (bank 0L, address 0x00000000-0x0fffffff, 0x20000000-0x2fffffff)
socket DIMM2 has a 256MB DIMM (bank 0H, address 0x00000000-0x0fffffff, 0x20000000-0x2fffffff)
socket DIMM3 has a 256MB DIMM (bank 1L, address 0x10000000-0x1fffffff, 0x30000000-0x3fffffff)
socket DIMM4 has a 256MB DIMM (bank 1H, address 0x10000000-0x1fffffff, 0x30000000-0x3fffffff)
empty sockets: None
total memory = 1024MB (1GB)
Ultra 10 memconf examples
Sun Proprietary/Confidential: Internal Use Only
System Configuration: Sun Microsystems sun4u Sun Enterprise 420R (4 X UltraSPARC-II 450MHz)
System clock frequency: 113 MHz
Memory size: 4096 Megabytes (only TOTAL memory reported)
========================= CPUs =========================
Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 450 4.0 US-II 10.0
0 1 1 450 4.0 US-II 10.0
0 2 2 450 4.0 US-II 10.0
0 3 3 450 4.0 US-II 10.0
========================= IO Cards =========================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 PCI 33 0 SUNW,qfe-pci108e,1001 SUNW,pci-qfe
0 PCI 33 1 network-SUNW,hme
0 PCI 33 1 SUNW,qfe-pci108e,1001 SUNW,pci-qfe
0 PCI 33 2 fibre-channel-pci10df,f800.10df.+
0 PCI 33 2 SUNW,qfe-pci108e,1001 SUNW,pci-qfe
0 PCI 33 3 scsi-glm/disk (block) Symbios,53C875
0 PCI 33 3 scsi-glm/disk (block) Symbios,53C875
0 PCI 33 3 SUNW,qfe-pci108e,1001 SUNW,pci-qfe
0 PCI 33 4 fibre-channel-pci10df,f800.10df.+
========================= HW Revisions =========================
ASIC Revisions:
---------------
PCI: pci Rev 4
PCI: pci Rev 4
Cheerio: ebus Rev 1
System PROM revisions:
----------------------
OBP 3.31.0 2001/07/25 20:35 POST 1.2.8 2000/08/22 19:50
420R prtdiag
Sun Proprietary/Confidential: Internal Use Only
connole 170 =>memconf /home/mbuckley/Explorers/64834462_420R/explorer.80e8b7a9.njocsprd2-2005.12.05.17.04
hostname: njocsprd2
Sun Explorer directory: /home/mbuckley/Explorers/64834462_420R/explorer.80e8b7a9.njocsprd2-2005.12.05.17.04
Sun Enterprise 420R (4 X UltraSPARC-II 450MHz)
socket U0301 has a 256MB DIMM (individual dimm size reported)
socket U0302 has a 256MB DIMM
socket U1301 has a 256MB DIMM
socket U1302 has a 256MB DIMM
socket U0401 has a 256MB DIMM
socket U0402 has a 256MB DIMM
socket U1401 has a 256MB DIMM
socket U1402 has a 256MB DIMM
socket U0303 has a 256MB DIMM
socket U0304 has a 256MB DIMM
socket U1303 has a 256MB DIMM
socket U1304 has a 256MB DIMM
socket U0403 has a 256MB DIMM
socket U0404 has a 256MB DIMM
socket U1403 has a 256MB DIMM
socket U1404 has a 256MB DIMM
empty sockets: None
total memory = 4096MB (4GB)
WARNING: Layout of memory sockets not completely recognized on this system.
The memory configuration displayed should be correct though since this is a fully stuffed system.
This is a known bug due to Sun's 'prtconf', 'prtdiag' and 'prtfru' commands not providing enough detail for the memory layout of this
SunOS 5.8 SUNW,Ultra-80 system to be accurately determined.
This is a bug in Sun's OBP, not a bug in memconf.
The latest release (OBP 3.33.0 2003/10/07) still has this bug.
This system is using OBP 3.31.0 2001/07/25 20:35
Sun Proprietary/Confidential: Internal Use Only
V880 POST output (excerpt)
Probing Memory............Probing CPU0 memory configuration NGDIMM#0 part# 501-5030-03 serial# 235446, 256MB + 256MB, SC#0 (512 meg dimm) NGDIMM#1 part# 501-5030-03 serial# 235457, 256MB + 256MB, SC#0 NGDIMM#2 part# 501-5030-03 serial# 241586, 256MB + 256MB, SC#0 NGDIMM#3 part# 501-5030-03 serial# 241589, 256MB + 256MB, SC#0 NGDIMM#4 part# 501-5030-03 serial# 241581, 256MB + 256MB, SC#0 NGDIMM#5 part# 501-5030-03 serial# 241579, 256MB + 256MB, SC#0 NGDIMM#6 part# 501-5030-03 serial# 241573, 256MB + 256MB, SC#0 NGDIMM#7 part# 501-5030-03 serial# 241577, 256MB + 256MB, SC#0Probing CPU1 memory configuration NGDIMM#0 part# 501-5030-03 serial# 241516, 256MB + 256MB, SC#0 NGDIMM#1 part# 501-5030-03 serial# 241522, 256MB + 256MB, SC#0 NGDIMM#2 part# 501-5030-03 serial# 241601, 256MB + 256MB, SC#0 NGDIMM#3 part# 501-5030-03 serial# 241507, 256MB + 256MB, SC#0 NGDIMM#4 part# 501-5030-03 serial# 243281, 256MB + 256MB, SC#0 NGDIMM#5 part# 501-5030-03 serial# 243486, 256MB + 256MB, SC#0 NGDIMM#6 part# 501-5030-03 serial# 241594, 256MB + 256MB, SC#0 NGDIMM#7 part# 501-5030-03 serial# 241588, 256MB + 256MB, SC#0
SubTool output:Part#: 501-5030Desc: FRU,ASSY,SDRAM,DIMM,512MB Category: Boards Is a FRU but has no substitutable parts.
Sun Proprietary/Confidential: Internal Use Only
Prtfru -x output from V880:
<Location name="dimm-slot?Label=J8001">
<Container name="dimm-module">
<ContainerData>
<Segment name="SD">
<ManR>
<UNIX_Timestamp32 value="Mon Mar 3 19:39:25 MST 2003"/>
<Fru_Description value="256 MB NG SDRAM DIMM"/>
<Manufacture_Loc value="ONYANG,KOREA"/>
<Sun_Part_No value="5015401"/>
<Sun_Serial_No value="A4663A"/>
<Vendor_Name value="Samsung"/>
<Initial_HW_Dash_Level value="03"/>
<Initial_HW_Rev_Level value="50"/>
<Fru_Shortname value="DIMM"/>
</ManR>
<Fru_Type value="256 MB DIMM"/>
<DIMM_R>
<DIMM_Speed value="75"/>
<DIMM_Size value="256"/>
</DIMM_R>
</Segment>
</ContainerData>
</Container> <!-- dimm-module -->
</Location> <!-- dimm-slot?Label=J8001 -->
SubTool output:Part#: 501-5401Desc:FRU,ASSY,SDRAM,DIMM,256MB,18X8MX16 Category: Boards Is a FRU but has no substitutable parts.
Sun Proprietary/Confidential: Internal Use Only
Showfru is a commandline prtfru -x summary script available online from: http://pts-appl-z1.holland/showfru.html
From the commandline showfru needs to be run on Solaris 10 FCS or later, where the XML perl modules are installed by default.
The Showfru script aims to provide a concise summary of FRU data from a prtfru -x output This allows quick identification of FRUs installed and depending on the platform other additional information is available.
NOTE: Please link to the script rather than taking a private copy.
##################################################################
Latest version 0.74 /net/cores.uk/export/hotline/hotlocal/bin/showfru
Report bugs, RFEs or if you have questions email [email protected]
Further info from http://pts-platform/twiki/bin/view/Tools/ToolPageShowfru
###################################################################
Non RoHS example: http://gmpweb.uk/~db124859/showfru/v240_mixed_dimm_sizes.html
The script only runs on Solaris 10 and above so if you are stuck on a Solaris 9 sunray use the online version: http://pts-appl-z1.holland/showfru.html
Further details and example outputs here:http://pts-platform/twiki/bin/view/Tools/ToolPageShowfru
More on ROHS: http://sunsolve2.central.sun.com/handbook_internal/Systems/common-docs/RoHS_Communication.html#meaning
Sun Proprietary/Confidential: Internal Use Only
$ /net/cores.uk/export/hotline/hotlocal/bin/showfru prtfru_-x.out
################################################################################
FRU part and serial number info, use -v for install date and vendor
################################################################################
MB MOTHERBOARD 375-3346 RoHS H00ORF
PS0 PS 300-1846 RoHS 005530
IFB CHASSIS 371-0796 RoHS E2JB13
PS1 PS 300-1846 RoHS 005529
MB.P0.B0.D0 1 GB
MB.P0.B0.D1 1 GB
MB.P0.B1.D0 1 GB
MB.P0.B1.D1 1 GB
################################################################################
SPD DIMM info - FRU, vendor name, vendor part and serial number
################################################################################
MB.P0.B0.D0 Infineon (formerly Siemens) 72D128320GBR6C 0403E910
MB.P0.B0.D1 Infineon (formerly Siemens) 72D128320GBR6C 0403EA10
MB.P0.B1.D0 Infineon (formerly Siemens) 72D128320GBR6C 0403EA12
MB.P0.B1.D1 Infineon (formerly Siemens) 72D128320GBR6C 0409FD27
Sun Proprietary/Confidential: Internal Use Only
sc> showfru
FRU_PROM at PS0.SEEPROM
Manufacturer Record
Timestamp: TUE JUL 01 19:53:52 UTC 2003
Description: P/S,SSI MPS,680W,HOT PLUG
Manufacture Location: DELTA ELECTRONICS THAILAND
Sun Part No: 3001501
Sun Serial No: T00541
Vendor: Delta Electronics
Initial HW Dash Level: 06
Initial HW Rev Level: 50
Shortname: A42_PSU
FRU_PROM at C0.P0.B0.D0.SEEPROM
Timestamp: MON JUN 02 12:00:00 UTC 2003
Description: SDRAM DDR, 512 MB
Manufacture Location:
Vendor: Samsung
Vendor Part No: M3 12L6420DT0-CA2
FRU_PROM at C0.P0.B0.D1.SEEPROM
Timestamp: MON JUN 02 12:00:00 UTC 2003
Description: SDRAM DDR, 512 MB
Manufacture Location:
Vendor: Samsung
Vendor Part No: M3 12L6420DT0-CA2
Sun Proprietary/Confidential: Internal Use Only
The Findaft script, aims to provide a concise summary of AFT, CPU and PCI ECC errors found in the Solaris Operating System /var/adm/messages files.
This summary can then used to assist in diagnosing a customers' hardware fault. Note: Findaft is Sun Internal only and cannot be sent to customers.
Provides a concise summary of all CPU/Memory/PCI/ECC errors found in the messages.
(Makes an ideal case note or start point for an SGR template.)
Assists with identification of memory UE errors.Features highlighting of E-Cache events.Directs TSE's towards Best Practices, when to do "nothing".Features highlighting of Datapath faults.
Helps to identify the true, root cause of errors.Helps to prevent mis-diagnosis which could result in "wrong" parts being replaced.
Sun Proprietary/Confidential: Internal Use Only
Findaft is a standalone perl script, the latest version is runnable from here:/net/cores.uk/export/hotline/hotlocal/bin/findaft
(always use latest available versions of tools)
Or downloadable from here:http://gmpweb.uk/~db124859/findaft/
Findaft is always a good starting step to troubleshooting and diagnosing memory issues. Read the docs that findaft suggests, these will usually assist diagnosis.
Reference:Infodoc 80270: "Findaft an AFT, CPU, Memory and PCI ECC error message summary
script"http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft
Alias is available to provide tool support: [email protected]
Sun Proprietary/Confidential: Internal Use Only
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 862595 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7f9abdc2
May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30
May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2b4
May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 339206 kern.info] [AFT0] errID 0x0015bd43.7f9abdc2 Corrected Memory Error on U0302 is Intermittent
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 368593 kern.info] [AFT0] errID 0x0015bd43.7f9abdc2 ECC Data Bit 31 was in error and corrected
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 748639 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa13b30
May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30
May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2ac
May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 233778 kern.info] [AFT0] errID 0x0015bd43.7fa13b30 Corrected Memory Error on U0302 is Intermittent
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 315879 kern.info] [AFT0] errID 0x0015bd43.7fa13b30 ECC Data Bit 31 was in error and corrected
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 712106 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa59597
May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30
May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2b4
May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 957346 kern.info] [AFT0] errID 0x0015bd43.7fa59597 Corrected Memory Error on U0302 is Intermittent
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 356654 kern.info] [AFT0] errID 0x0015bd43.7fa59597 ECC Data Bit 31 was in error and corrected
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 585299 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa993cf
May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30
May 28 22:43:55 cht1ds004 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2a8
May 28 22:43:55 cht1ds004 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 342081 kern.info] [AFT0] errID 0x0015bd43.7fa993cf Corrected Memory Error on U0302 is Persistent
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 499012 kern.info] [AFT0] errID 0x0015bd43.7fa993cf ECC Data Bit 31 was in error and corrected
Sun Proprietary/Confidential: Internal Use Only
# /net/cores.uk/export/hotline/hotlocal/bin/findaft /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages
################################################################################
This script looks for Hardware errors including all AFT and pci ECC events
Written for 108528-16/112233-01 or above. Some tests may fail on other revisions
Report bugs,RFEs or if you have questions email [email protected]
Version 2.00 homepage http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft
Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft
Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script
################################################################################
Input file /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages is 0.1 MB
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Syndrome errors CE and UE errors are included
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10 Syndrome 0x58 Memory Module U0302
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other AFT Events
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5 [AFT0] Corrected Memory Error detected by CPU0,
1 [AFT0] Corrected Memory Error detected by CPU1,
2 [AFT0] Corrected Memory Error detected by CPU2,
2 [AFT0] Corrected Memory Error detected by CPU3,
3 [AFT0] Sticky Softerror encountered on Memory Module U0302
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Main Memory Correctable ECC events sorted by date
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 May 28 U0302 is Intermittent
1 May 28 U0302 is Persistent
3 May 30 U0302 is Sticky
2 May 31 U0302 is Persistent
1 Jun 01 U0302 is Persistent
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Sun Proprietary/Confidential: Internal Use Only
(continued)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Panics, Reboots, Fatal errors etc
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 Jun 01 cht1ds004 SunOS Release 5.8 Version Generic_117000-03 64-bit
###############################################################################
Correctable memory errors found, use cediag to determine if a DIMM needs to be
replaced, see Infodoc 83216 for examples of the cediag rule failure messages
Infodoc 79928: Sun Enhanced Memory DIMM Replacement Policy
################################################################################
cediag -e explorer_directory/
cediag -c SunOS,cht1ds004,5.8,sparc -k 117000-03 -u 2 /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages
################################################################################
Start of Ultrasparc II CE specific checks
Unique Simms total 1
################################################################################
U0302
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Unique Syndromes total 1
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CE Event Syndrome 0x58 Data Bit 31
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
USII CE Event type reported by each CPU
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Reporting CPU Intermittent Persistent Sticky
CPU0 3 2 0
CPU1 0 0 1 <<
CPU2 0 1 1 <<
CPU3 0 1 1 <<
Sun Proprietary/Confidential: Internal Use Only
CEDIAG is a memory error analysis tool, comprised of shell scripts and a few binary executables.Currently runs on Solaris SPARC architectures only.Reference: http://onestop/qco/dimm/tools/cediag.shtmlInternally runnable from: /net/cores.uk/export/hotline/hotlocal/bin/cediag
Usage:# cediag -e unpacked_explorer_dirMay also be run in verbose mode to gather additional information(such as total number of memory pages retired)Syntax example:# /net/cores.uk/export/hotline/hotlocal/bin/cediag -v -e /explorer-directory
Customers may download from: http://sunsolve.sun.com (Diagnostic Tools)Memory DIMM Replacement Management Tool(Download, install cediag 1.2.1)
Sun Proprietary/Confidential: Internal Use Only
# /net/cores.uk/export/hotline/hotlocal/bin/cediag -v -e /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/
cediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTC
cediag: info: cediag directory: /net/cores.uk/export/hotline/hotlocal/bin
cediag: info: Explorer directory: /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/
cediag: info: UltraSPARC Version: 2 (2)
cediag: info: OS Type: SunOS
cediag: info: OS Version: 5.8
cediag: info: Hostname: cht1ds004
cediag: info: Memory size: 524032 (8KB pages)
cediag: info: MPR (deduced) PRL pages: 497 (8KB pages) (MPR = Memory Page Retirement)
cediag: info: MPR-capable OS: true
cediag: info: KJP: 117000-03
cediag: info: MPR-aware kernel in-use: true
cediag: info: MPR enabled: true
cediag: info: MPR disabled in /etc/system: false
cediag: info: MPR force mode: n/a
cediag: info: MPR state: active
cediag: info: Rule#3 check: true
cediag: info: Rule#4 check: true
cediag: info: Rule#5 check: true
cediag: info: Rule#5 check via cestat: false
cediag: info: Rule#6 check: false
cediag: #### CE Summary prior to reboot at Jun 1 14:55:33 ###################
cediag: info: DIMM U0302 had 10 CE(s)
cediag: info: DIMM U0302 had 7 non-intermittent CE(s)
cediag: info: DIMM U0302 @ Data Bit 31 had 10 CE(s)
cediag: info: DIMM U0302 @ Data Bit 31 @ AFAR%64=48 had 10 CE(s) across 1 AFARs
Sun Proprietary/Confidential: Internal Use Only
cediag: info: messages files: 1 pages scheduled for retirement
cediag: info: messages files: 1 pages successfully retired
cediag: info: messages files: 0 pages scheduled for clearing
cediag: info: messages files: 0 pages successfully cleared
cediag: info: PRL deduced status: PRL reached = false
cediag: findings: 0 datapath fault message(s) found
cediag: findings: 0 UE(s) found - there is no rule#3 match
cediag: findings: 0 DIMMs with a failure pattern matching rule#4
cediag: findings: 0 DIMMs with a failure pattern matching rule#5
cediag: #### CE Summary since last detected reboot ###########################
cediag: #### last detected reboot at Jun 1 14:55:33 #########################
cediag: info: messages files: 0 pages scheduled for retirement
cediag: info: messages files: 0 pages successfully retired
cediag: info: messages files: 0 pages scheduled for clearing
cediag: info: messages files: 0 pages successfully cleared
cediag: info: PRL deduced status: PRL reached = false
cediag: findings: 0 datapath fault message(s) found
cediag: findings: 0 UE(s) found - there is no rule#3 match
cediag: findings: 0 DIMMs with a failure pattern matching rule#4
cediag: findings: 0 DIMMs with a failure pattern matching rule#5
#
Sun Proprietary/Confidential: Internal Use Only
Example cediag messages when a single DIMM needs to be replaced.
Rule 4 can identify UE DIMMs before they cause an outage.
cediag: findings: 1 DIMMs with a failure pattern matching rule#4
cediag: findings: DIMM 'Slot A: J8101' matched rule#4 failure pattern
cediag: advice:HIGH: replace DIMM 'Slot A: J8101' [A]s [S]oon [A]s [P]ossible
Rule 5 failures are low risk and should not cause an outage.
cediag: findings: 1 DIMMs with a failure pattern matching rule#5
cediag: findings: DIMM 'Slot B: J3101' matched rule#5 failure pattern
cediag: advice:MEDIUM: replace DIMM 'Slot B: J3101' during next maintenance period
Rule 6 applies when Solaris is not patched to the level to provide MPR and is low risk.
cediag: findings: 1 DIMMs with a failure pattern matching rule#6
cediag: findings: DIMM 'Slot C: J8200' matched rule#6 (24 in 24) failure pattern
cediag: advice:MEDIUM: replace DIMM 'Slot C: J8200' during next maintenance period
Sun Proprietary/Confidential: Internal Use Only
Example cediag messages for the more complex faults.
Uncorrectable UE errors are often seen as a result of single DIMM Rule 4 failures. cediag: findings: 1 UE(s) found - potential rule#3 match cediag: advice:HIGH: refer UE(s) to Sun Support [A]s [S]oon [A]s [P]ossible
Datapath fault - See Infodocs 70134 and 80288 for diagnosis of bad writers anddatapath faults from Solaris messages. cediag: findings: 4 datapath fault message(s) found cediag: findings: 8 DIMM(s) having CEs with Esynd of 0x0010 found cediag: advice:HIGH: possible datapath fault - refer to Sun Support ASAP
Whenever more than one DIMM fails rules 4,5 or 6 you will get this message.Make sure you really do have multiple failures before replacing any DIMMs cediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs
Sun Proprietary/Confidential: Internal Use Only
FIND_UE Utility:Used to identify those UE errors where a single DIMM from a memory bank can be reliably identified as the cause of the fault, or at least narrow down the number of suspect DIMMs.(Enhanced algorithms are being implemented to reduce the number of suspect components.) Identify those UEs which are likely to have been caused by FCO A0258.Field Change Order A0258-1:Mitsubishi 256MB DIMMs (Sun p/n 501-5658) showing significantly lower than expected reliability.
Info available at: http://pts-platform/twiki/bin/view/Tools/ToolPageFindUE
Alias list is available to provide tool support: [email protected]
FindUE is a commandline syndrome decoderUsage:/net/cores.uk/export/hotline/hotlocal/bin/findUE messages
Sun Proprietary/Confidential: Internal Use Only
###################################################################FindUE was written to assist in ECC syndrome history analysis, the script willunderstand Solaris messages, console logs, msgbuf, showlogs and wfail outputsSupported systems include USIII and USIV systems the E3000-6500s but not USIIIi.
Version 1.33 from /net/cores.uk/export/hotline/hotlocal/bin/findUEInfodocs 75538 and 74624 have further details. If you find bugs in the scriptemail [email protected] for syndrome decode bugs [email protected]##################################################################
Sun Proprietary/Confidential: Internal Use Only
Infodoc 80346: “Using the fin954 script to diagnose main memory versus L2SRAM errors”
The fin954 script was written by Mike Arnott in 2003. The aim was to automate the diagnosis of main memory versus L2SRAM errors on the UltraSparc III systems using the errors found in a Solaris messages file. The fin954 script implements the rules described in FIN I0954-1. These rules apply to all USIII and USIV systems including the VSP systems, SunBlade 1000/2000, 280R, Netra 20, 480/490 and the 880/890, but not the USIIIi based systems.
The latest version of fin954 is available from http://fde.aus/tools/fin954 it is a Perl script and needs to be run from the commandline.
Further information is available from:
FIN I0760-2 Sun Enhanced Memory DIMM Replacement Policy.
Infodoc 52427 L2SRAM/DIMM Misdiagnosis Issues
Infodoc 75538 Sun Fire[TM] Server: Using ECC Syndrome History to Troubleshoot Uncorrectable Errors (UE) in Memory
When to use fin954:
The fin954 is a special purpose diagnosis script and as such should not be used as an initial scan of messages. If you need a general summary of all AFT events found in messages use findaft. When cediag finds UE errors but cannot identify a single faulty DIMM, fin954 can be used to help diagnose the faulty FRU.
Example:
$ fin954 sample.4.messages.txt
============================================================================
Findings: from analysing sample.4.messages.txt
Total "Events" logged: 167
Total *significant* "Events" logged: 28
Total insignificant "Events" logged: 138 .
FRU "SB10/P1/B1 J14301 J14401 J14501 J14601" implicated as error source 22 times.
FRU "SB10/P1/B1/D0 J14301" implicated as error source 6 times.
Fin954 is a standalone script, the latest version is runnable from here:
/net/cores.uk/export/hotline/hotlocal/bin/fin954
Sun Proprietary/Confidential: Internal Use Only
Dimm PLL chip Issues:
Sun has determined that a limited subset of memory DIMMs shipped in 2001 and 2002 (less than one percent of the installed base) may begin to show reduced reliability after approximately two years of operation.
This reliability issue manifests itself in the form of UEs (Uncorrectable Errors), sometimes with CEs (Correctable Errors), originating from the DIMMs.
The reliability of these DIMMs is normal for approximately the first two years of use, after which they may start to degrade below the expected level.
The root cause of this issue is related to a PLL device on the DIMMs. This sub-population of DIMMs has PLL devices with a date code range between 0049 and 0215 inclusive.
No unique symptom will be experienced due to this issue, other than higher than expected UEs and CEs.
A DIMM lookup tool has been developed to assist in identifying suspect DIMMs (but sometimes manual inspection is required)
Impacted Platforms:
It has been determined that the following platforms if shipped between 1/01/2001 and 12/31/2002 could be impacted:
SB1000, SB2000, Netra20, 280R, V480, V490, V880, V890, V1280, SF3800, SF4800, SF4810, SF6800, F12K, F15K
References:
FCO A0253-1: A sub-population of DIMMs that shipped between 2001 and 2002 on the below platforms are showing significantly lower reliability than expected.
FAQ: http://onestop/qco/plldimm/index_plldimm.shtml
Sun Proprietary/Confidential: Internal Use Only
How this issue may “uniquely” affect V480/V490/V880/V890 platforms:
In some cases during UE DIMM errors, incorrect memory banks can be called out masking the true location of the faulty DIMM.
Due to this bug, not even POST can help since POST is also affected by the bug and will also call out the wrong DIMM location.
The Kernel or POST reports a memory group as the source of a CE or UE error, which might cause the engineer to believe there is a defective DIMM within that group.
However, on a system experiencing BugID #5034665, the reported group is USUALLY NOT the location of the defective DIMM.
You will know that the system is experiencing this bug because there will be multiple UE error messages calling out different groups of memory on the same CPU/Memory board.
Note: When this bug is exhibited, the UE DIMM errors will be confined to a single CPU/Memory board - the false errors will not span different CPU/Memory boards, they will only be on one board.
In light of the new information discovered by bug # 5034665, we no longer will look for one DIMM specifically. We now will remove all DIMMs containing a specific PLL chip within a range of date codes.
References:
Infodoc 77110: Sun Fire[TM] Server (V480, V880, V880z, V490, V890): How to Troubleshoot "Dimm PLL chip failure causes CE/UEs to be called out by POST & Solaris[TM] in any dimm location" BugID #5034665
SunAlert 101667 (formerly 57757): A Limited Subset of DIMMs (less than 1%) Shipped in 2001-2002 May Have a Reliability Issue
PLL Lookup Tool: http://pts-appl-z1.holland/pll.html (used to scan Explorer outputs)
Sun Proprietary/Confidential: Internal Use Only
Misc Memory ResourcesMemory reference site:
http://onestop/qco/dimm/https://onestop.sfbay.sun.com/qco/dimm/index_dimm.shtml
Infodoc 70361: "Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages"
Infodoc 72846: Event Messages for UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R], and UltraSPARC-IV[R] CPU Modules
Infodoc 72775: "How to determine if a correctable error (CE) on a memory DIMM should result in replacement of FRU"
Infodoc 70134 Diagnosis of bad writers and datapath faults from Solaris messages
Infodoc 79928: "Sun Enhanced Memory DIMM Replacement Policy"
Infodoc 82264: Memory DIMM Replacement Management Tool - cediag 1.2.1 FAQ
FIN 100271 (Formerly I0760-2) Sun Enhanced Memory DIMM Replacement Policy