Sept. 2009 Bill Minor IBM The DB2 Detective – The Case of the Diligent DBA
Jun 29, 2015
Sept. 2009
Bill MinorIBM
The DB2 Detective – The Case of the Diligent DBA
2
Highlights
No matter the extent of preparation, it seems that unexpected outages popup when least expected. As a result, unplanned downtime can causesignificant impact to business operations. When misfortune strikes, there arealways two predominant questions, "What happened?" and "When will thesystem be available again?" Mysteries can be resolved by good DB2 Detective work - identifying and circumventing actions that can lead to problems. When unexpected problems do occur, it is vital to be able to act decisively and swiftly.
The goal of this session is to introduce database inspection tools and provideinsights that can be used to help monitor the integrity and maintain theconsistency of your data. I will also provide guidance for salvage, repair, recovery, and mitigation operations that are vital when crisis strikes and timeis of the essence.
3
Agenda
• Help with defining a problem
• Help with collecting and understanding information/diagnostics related to a problem
• Tools to help in the event of an outage
• Tools to help avoid outages in the first place
• Features to help avoid, control, and reduce the impact of an unexpected problem (Technology Preview or TP)
• RAS (Reliability, Availability and Serviceability)
4
In the Beginning ….
• ‘There is an outage’ !!!!!! Where do we start?
…… Take a deep breath … (DB2 911)
• Keep it simple to begin; fundamentals are essential
• Order of Investigation/ ‘Case’:
Problem DescriptionProblem Determination
Problem Resolution
5
‘Tools’ in the Detective Kit
• The problem description
• db2diag.log
• db2trc
• db2pd
• db2dart and inspect
• db2top
• db2exfmt
• admin functions
• db2cos, db2support, db2fodc
• Resiliency (mitigation/toleration/avoidance)
6
An ‘Outage’
• “Outage” – subjective terminology: Data availability:
slow access no access
Unexpected crashes
• Notes:• There is always a trade-off between how much time can be spent
investigating a problem and making system available again to users• Other factors: risk mitigation; establishing cause; development of a
workaround (in absence of a ‘fix’ obviously don’t want to hit this again)
An outage can be planned (i.e. maintenance) or unplanned (unexpectederrors, terminations, slow or unavailable access to data, …).
‘Path to DBA diligence’ Handling and avoiding unplanned outages via: investigation, problem
definition, monitoring, prevention, education, …… (hopefully something from this presentation ) …
7
What Information Needs to be Collected?
• Each problem has it own set of one or more signatures
• Problems (even those with the same source – so ‘same problem’) can manifest themselves in different ways• Systems and environments are different: hardware,
software, different users, different data (scope and volume), …
• There is a set of fundamental or initial information/diagnostics that needs to be collected for every problem .. No matter what
• Keep in mind that additional information or diagnostics could be required depending on the initial analysis investigation• Goal: To minimize the amount of effort required in
collecting diagnostics
8
The Problem Description
• VERY VERY VERY IMPORTANT (may seem obvious)• Avoid taking the wrong path • Invest in:
• Establishing “What happened?”• Getting the correct information and diagnostics
REQUIRED FUNDAMENTALS:
• What is the current state/situation? • Describe what happened and what needs to happen• Include recent problem or change history• Client/Server configuration
• What changed? No really, what changed? (Usually something has but just not apparent how, why or when)
• Include specific actions known to be performed at the time of the incident• Product version and release number (FP? Special build?)• SQL error codes• Scope: System Wide? Instance Wide? Database? Application(s)?
9
Performance Issues – What? and When?
• Identification of such an issue requires a baseline – a before and after comparison
• Typically the issue is, ‘things are slow’ (really mean ‘slower’)• Slow query
• After running RUNSTATS
• After applying a fixpack
• After migrating to a new release of DB2
• After making a database configuration change
• Performance of triggers, stored procedures, UDFs with SQL, crashes/traps during query execution
• Differentiate between overall system performance and specific actions
• Date/time correlation? Other maintenance activities?
• Different systems? (db2pd –osinfo)
10
Problem ‘Scope’
• Scope and Cause are required in order develop and execute action plan; action plan may be ‘straightforward’ i.e. restore the database
• Manifest itself in one of the following outage scenarios:• Instance trap – db2start required• Database trap • Application failure
11
Diagnostics to Collect at the Outset
• DB2 Build Level• Operating System type and level• DB2 registry variable settings• MPP? SMP? Single-processor? Database partitioned? Which partition?• Syntax of command/statement associated with issue• DB2 error received• db2diag.log file which sufficient history• Notification log• DB2 Trap files (Call stacks), DB2 dump files• History File• db2 trace• Messages file • Core file, system error logs (Unix:errpt Windows: event logs)• Log files and log file control structures• …
Fundamental
12
db2diag.log file• Where is it? $HOME/sqllib/db2dump/• What is it? ‘Supplementary’ diagnostic logging facility• How to read? Each entry:
• Time ordered entries• Database specific; Database node/partition (i.e. DPF specific)• Process/thread ID; Application specific identifiers/handles• Function name where entry was generated• Messaging and perhaps some data
2009-03-03-10.26.14.523824-300 I37414837E502 LEVEL: Error
PID : 8083 TID : 54 PROC : db2sysc 20
INSTANCE: svtdbm3 NODE : 020 DB : MASUN1
APPHDL : 0-138 APPID: *N0.svtdbm3.090303150827
AUTHID : SVTDBM3
EDUID : 54 EDUNAME: db2agntp (MASUN1) 20
FUNCTION: DB2 UDB, buffer pool services, sqlbDMSCreatePool, probe:800
MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full."
DIA8312C Disk was full.
• Can be parsed with diag tool and/or admin function (examples in Backup Slides)
13
db2trc – Trace facility
• Useful for problems/issues that can be recreated
• Generates a flow and format of the code paths taken by DB2 while the trace is active• Traces all DB2 activity in an instance on a physical node (database
partition). If many logical db2 nodes, all activity is traced• Trace to memory buffer or file (slower)
• Trace to file for trace wraps
• Always include corresponding db2diag.log file with a db2 trace
Examples:
db2trc on –l 8M … trace to memory
db2trc on –f mytrace.trc … trace to file
db2trc inf …. Information returned when trace is active)
db2trc flw mytrace.trc mytrace.flw ……
db2trc fmt mystrace.trc mytrace.fmt …… Readable versions of trace data
14
db2dart (Database Analysis and Reporting Tool)
• Primarily to examine databases for architectural correctness• Check validity of meta-data structures, page headers, row headers, …• Runs against data as it exists on disk (primarily an offline tool)• Critical for investigating problems associated with data ‘corruption’• Granularity: database /db, tablespace /ts, table /t
EXAMPLES:
db2dart SAMPLE /db
db2dart SAMPLE /ts /tsi 4
db2dart SAMPLE /t /tsi 2 /oi 5
• Marking problematic objects as invalid; recovering or extracting data
• Consistency: Index key data checking
• HWM – space tablespace space management
15
Inaccessible Table – Salvaging Data• Example: db2dart SAMPLE /ddel
…Table ID or name, tablespace ID, first page, num of pages:(suffic page number with 'p' for pool relative)13 2 0 100
7 of 7 columns in the table will be dumped. Column numbers and datatypes of the columns dumped: 0 SMALLINT 1 VARCHAR() -VARIABLE LENGTH CHARACTER STRING 2 SMALLINT 3 CHAR() -FIXED LENGTH CHARACTER STRING 4 SMALLINT 5 DECIMAL 6 DECIMAL Default filename for output data file is TS2T13.DEL, do you wish to change filename used? y/nn
Filename used for output data file is TS2T13.DEL. If existing file, data will be appended to it.
16
Marking Indexes As Invalid – db2dart /MI
Example:
db2dart testdb /mi /tsi 2 /oi 4
• Time of index rebuild dictated by setting of INDEXREC database cfg variable• Usually one of RESTART (default) or ACCESS
• RESTART – database restarted
• ACCESS – when index is first accessed
• Messages written to db2diag.log file indicating index rebuild
17
Checking Database Backup Images
• Database backup images represent one of your ultimate recovery options
• Hence, checking their integrity is prudent
• db2ckbkp• Check the integrity of a database backup image• Can extract pages from the backup image to a file (-e option)
• This may be part of a problem causal analysis or repair procedure What ‘state’ is the data in the backup image?
• Tivoli Storage Manager (TSM): db2adutil -verify
18
History File
• Records information specific to the following database actions: • BACKUP, ROLLFORWARD, DROPPED TABLE, LOAD, CREATE
TABLESPACE, RENAME TABLESPACE, REORG, ALTER TABLESPACE, ARCHIVE LOG
• Dropped table recovery• CREATE or ALTER TABLESPACE option DROPPED TABLE RECOVERY• Allows recovery of dropped table data using table space-level restore and
rollforward operations
• Faster than database recovery and database available to users
19
LIST HISTORY – Dropped Table Recovery
1 Identify the table which was dropped
Op Obj Timestamp+Sequence Type Dev Earliest Log Current Log Backup ID -- --- ------------------ ---- --- ------------ ------------ -------------- D T 20090223125627 000000000000693f00020013 ---------------------------------------------------------------------------- "BMINOR "."STAFF3" resides in 1 tablespace(s):
00001 USERSPACE1 ---------------------------------------------------------------------------- Comment: DROP TABLE Start Time: 20090223125627 End Time: 20090223125627 Status: A ---------------------------------------------------------------------------- EID: 9
DDL: CREATE TABLE "BMINOR "."STAFF3" ( "ID" SMALLINT NOT NULL , "NAME" VARCHAR(9) , "DEPT" SMALLINT , "JOB" CHAR(5) , "YEARS" SMALLINT , "SALARY" DECIMAL(7,2) , "COMM" DECIMAL(7,2) ) IN "USERSPACE1" ;
----------------------------------------------------------------------------
20
Dropped Table Recovery (continued)
2 Restore a database- or table space-level backup image taken before the table was dropped
3 Create an export directory to which files containing the table data are to be written
4 Roll forward to a point in time after the table was dropped, using the RECOVER DROPPED TABLE option on the ROLLFORWARD DATABASE command (alternatively, roll forward to the end of the logs, so that updates to other tables in the table space or database are not lost)
5 Re-create the table using the CREATE TABLE statement from the recovery history file.
6 Import the table data that was exported during the rollforward operation into the table
21
Performance – Explain Plans
• Optimizer access plans change … could be legitimate, maybe not
• Compare plans, stats, configuration: before vs after
• Monitor
Example:
db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL
db2 explain all for “select * from bminor.staff where id=10”
db2exfmt -d sample -n % -e % -s % -w -l -# 0 -o explainPlan.out -g TIC
22
db2exfmt Output - Example
Access Plan:----------- Total Cost: 11.5946 Query Degree: 1
Rows RETURN ( 1) Cost I/O | 6.28 FETCH ( 2) 11.5946 1 /---+---\ 6.28 157 IXSCAN TABLE: BMINOR ( 3) STAFF 3.2496 Q1 0 | 157 INDEX: BMINOR I1_STAFF Q1
Objects Used in Access Plan:---------------------------
Schema: BMINOR Name: I1_STAFF Type: Index Time of creation: 2009-03-02-15.52.41.836445 Last statistics update: Number of columns: 1 Number of rows: 157 Width of rows: -1 Number of buffer pool pages: 1 Distinct row values: No Tablespace name: USERSPACE1 Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000 Source for statistics: Single Node Prefetch page count: 32 Container extent page count: 32 … Schema: BMINOR Name: STAFF Type: Table Time of creation: 2009-03-02-15.50.29.727941 Last statistics update: Number of columns: 7 Number of rows: 157 Width of rows: 40 Number of buffer pool pages: 1 Number of data partitions: 1 Distinct row values: No Tablespace name: USERSPACE1 Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000
23
db2pd – Monitor and Troubleshoot DB2
• Indispensable for monitoring and troubleshooting• Non-intrusively capable of obtaining various statistics, snapshot
information, internal meta-data, … from an instance• Relevant options (some of my favorites):
• -osinfo, -mempools, -memblocks, -edus (more to come), -stack, -applications, -transactions, -bufferpools, -logs, -pages, -tablespaces, -tcbstats index, -apinfo, -fvp, …
24
Achieving Understanding via SQL Routines • DB2 has a very rich and diverse set of SQL routines that can be used probe
and monitor one’s system
• For example:• PD_GET_DISG_HIST• PD_GET_LOG_MSGS• PDLOGMSGS_LAST24HOURS• DB_HISTORY• LOG_UTILIZATION• LONG_RUNNING_SQL• LOCKS_HELD• LOCKWAITS• CONTAINER UTILIZATION• TBSP_UTILIZATION• ADMIN_GET_TAB_INFO, ADMIN_GET_INDEX_INFO • APPL_PERFORMANCE• BP_HITRATIO, BP_READ_IO, BP_WRITE_IO• TOP_DYNAMIC_SQL• MON_GET_BUFFERPOOL, MON_GET_TABLE, MON_GET_TABLESPACE, ….
25
db2top
• What is it?• DB2 dynamic monitoring tool. But so are snapshots?
• Snapshot data is typically cumulative (point in time)
• db2top provides delta values in real time
• db2top integrates multiple types of snapshot data together• Diagnose performance and resource problems, optimize• Provides a unified, single-system view of a multi-partition database or
single-partition database for the purposes of providing a dynamic real-time view of a running DB2 system
• Availability: V8 FP17, V9.1 FP6, V9.5 FP2• developerworks whitepaper:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/
26
When Unexpected Problems Occur – db2cos :DB2 Call Out Script
• Invoked automatically by Database invoked by default when the database manager cannot continue processing due to a panic, trap, segmentation violation or exception
• Each default db2cos script will invoke db2pd commands to collect information in an unlatched manner
• Order of events when trap/panic occurs:• Trap file is created • Signal handler is called • db2cos script is called • An entry is logged in the administration notification log • An entry is logged in the db2diag log file • Information is generated by db2pd (operating system, the Version and Service
Level installed, the database manager and database configuration, as well as information about the state of the agents, memory pools, memory sets, memory blocks, applications, utilities, transactions, …)
27
db2support – Diagnostic Data Collection
• Collects environment data about either a client or server machine and places the files containing system data into a compressed file archive
• Collects db and dbm cfg info, reg var settings, db2level, db2diag.log, … all into one file called db2support.zip
• Collects up diagnostics generated by db2cos* scripts• db2supp_opt.zip within db2support.zip
This file contains a host of optimizer related info: query text, exfmt output (explain of query), db2look and db2batch output, catalog table info, …
• To assist with Optimizer (plan) problems
db2support . –db <dbname> -st “select * from bminor.staff”
28
db2pdcfg
• Basically a tool which can set or change flags which influence the
database system behavior for problem determination purposes • Control behaviors of cos and fodc (more to come on this one)
• Neat option: db2pdcfg –flushbp –db <dbname>• Allow faster failover from primary to secondary in HADR setups• Monitor and tune bufferpools for number of dirty pages (use
inconjunction with –pages and –bufferpools)• Sync in-memory tablespace meta-data structures with on-disk
versions• Execute before using db2dart /DHWM against active database
29
First Occurrence Data Capture (FODC)
• A facility (V9.5) to capture diagnostic information when a problem first occurs• Automatic mode: invoked by DB2 engine when outage or error condition
is detected
• Traps, panics, corruptions• Manual mode: user initiated – db2fodc
• Hangs, performance issues
• Current setting: db2pdcfg –fodc status• db2pdcfg can be used to change defaults
• Information collected and placed in subdirectory under DIAGPATH• FODC_<outagetype>_<timestamp>
• Run db2support after db2fodc to collect all the fodc information and package into one zip file that can be sent to Support
30
db2fodc ExampleExample: db2fodc –hang –db SAMPLE
2009-03-03-11.33.19.657457-300 E1683E654 LEVEL: WarningPID : 619 TID : 47350926685104PROC : db2fodcINSTANCE: bminor NODE : 000FUNCTION: DB2 UDB, RAS/PD component, pdDb2FODCMain, probe:30MESSAGE : ADM14003W FODC has been invoked by the user from db2fodc tool for symptom "hang" and diagnostic information has been recorded in directory "/home/bminor/sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318". Please look in this directory for detailed evidence about what happened and contact IBM support if necessary to diagnose the problem.
File Listing for ../sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318
6116.1.000.stack.txt 6116.17.000.stack.txt 6116.27.000.stack.txt 6116.31.000.stack.txt DB2CONFIG6116.11.000.stack.txt 6116.18.000.stack.txt 6116.28.000.stack.txt 6116.32.000.apm.bin6116.12.000.stack.txt 6116.19.000.stack.txt 6116.29.000.apm.bin 6116.32.000.dump.bin 6116.13.000.stack.txt 6116.20.000.stack.txt 6116.29.000.dump.bin 6116.32.000.stack.txt DB2PD6116.14.000.stack.txt 6116.21.000.stack.txt 6116.29.000.stack.txt 6116.33.000.apm.bin DB2SNAPS6116.15.000.dump.bin 6116.22.000.stack.txt 6116.30.000.apm.bin 6116.33.000.dump.bin DB2TRACE6116.15.000.stack.txt 6116.23.000.stack.txt 6116.30.000.dump.bin 6116.33.000.stack.txt OSCONFIG6116.16.000.apm.bin 6116.24.000.stack.txt 6116.30.000.stack.txt 6116.34.000.apm.bin OSSNAPS6116.16.000.dump.bin 6116.25.000.stack.txt 6116.31.000.apm.bin 6116.34.000.dump.bin OSTRACE6116.16.000.stack.txt 6116.26.000.stack.txt 6116.31.000.dump.bin 6116.34.000.stack.txtdb2eventlog.000 db2fodc_hang.log
31
Resiliency
• Avoiding and dealing with ‘errors’ that can result in unplanned outages
32
Online Data Consistency Checking (V9.5)
• Index to Data consistency problems can be categorized as:
Orphaned index entry:An index entry points to a deleted or non-existent record slot
Invalid index entry:An index entry points to a valid slot on a data page, but index key data does not match record data
Missing key:A data row is not indexed in at least one of the indexes on the table
Duplicate Record IDs (RIDs):A data row is indexed more than once
These are classified as logical corruptions.
33
Key-Data Checking via DB2 INSPECT (V9.5)
• New INDEXDATA cross-object checking clause to the INSPECT utility
Example:
db2 inspect check table name STAFF indexdata results keep insp.out
NOTE:
The INDEXDATA option only examines the logical inconsistency between
index and data. Therefore, it is recommended that you first run INDEX and
DATA checking separately, to rule out any physical corruption, before
running INDEXDATA checking.
34
Threaded Engine Model for Unix in V9.5
• DB2 has always had a threaded engine model on Windows platforms
• In V9.5, DB2 Unix engine has now evolved from process-based to thread-based model
• All threads within a process can share the memory allocated by, or within, that process
• Several advantages: basically performance and memory management
• From a DBA perspective:• Better administration and tuning of memory through administration• Resiliency improvements (upcoming slides)
35
DB2 Processes vs. Threads: ps -ef
• V9.1: bminor 6107356 4780132 0 13:00:53 - 0:00 db2sysc bminor 6148108 6717616 0 13:01:03 - 0:00 db2agent (SAMPLE) bminor 6180952 1458238 0 13:08:22 - 0:00 db2pclnr bminor 6193316 1458238 0 13:08:22 - 0:00 db2pclnr bminor 6660150 6107356 0 13:00:53 - 0:00 db2licc bminor 6717616 6107356 0 13:00:53 - 0:00 db2ipccmbminor 884882 1458238 0 13:08:22 - 0:00 db2loggw (SAMPLE) bminor 1237024 1458238 0 13:08:22 - 0:00 db2lfr (SAMPLE) bminor 1458238 6107356 0 13:00:53 - 0:00 db2gds bminor 1556708 6717616 0 13:01:07 - 0:14 db2stmm (SAMPLE) bminor 1757202 1458238 0 13:08:22 - 0:00 db2dlock (SAMPLE) bminor 1908750 1458238 0 13:08:22 - 0:00 db2pclnr bminor 2195682 1 0 13:08:21 - 0:00 /home4/bminor/sqllib/bin/db2bp
• V9.5:bminor 10364 10362 1 13:15 pts/19 00:00:00 db2sysc
bminor 10377 10362 1 13:15 pts/19 00:00:00 db2acd,0,0,0,1,0,0,1,0,8a18bminor 10385 1 0 13:15 pts/19 00:00:00/home/bminor/sqllib/bin/db2bp
36
Running Threads: db2pd -edus
Database Partition 0 -- Active -- Up 2 days 17:59:14
List of all EDUs for database partition 0
db2sysc PID: 1187db2wdog PID: 1185db2acd PID: 1201
EDU ID TID Kernel TID EDU Name USR SYS =========================================================================================================65 47726691608896 20490 db2agntdp (SAMPLE ) 0.000000 0.00000064 47726695803200 999 db2agent (SAMPLE) 0.240000 0.11000063 47726708386112 998 db2agent (SAMPLE) 0.720000 0.16000061 47726712580416 308 db2evmli (DB2DETAILDEADLOCK) 0.000000 0.00000060 47726716774720 307 db2wlmd (SAMPLE) 0.000000 0.00000059 47726720969024 306 db2pfchr (SAMPLE) 0.000000 0.00000057 47726737746240 304 db2pfchr (SAMPLE) 0.000000 0.00000055 47726733551936 302 db2pclnr (SAMPLE) 0.000000 0.00000054 47726729357632 301 db2pclnr (SAMPLE) 0.000000 0.00000053 47726746134848 300 db2dlock (SAMPLE) 0.000000 0.00000052 47726762912064 32767 db2lfr (SAMPLE) 0.000000 0.00000051 47726741940544 32766 db2loggw (SAMPLE) 0.020000 0.03000050 47726758717760 32765 db2loggr (SAMPLE) 0.000000 0.03000049 47726699997504 1554 db2taskd (SAMPLE) 0.000000 0.00000025 47726754523456 1365 db2stmm (SAMPLE) 8.020000 0.46000016 47726767106368 1265 db2agent (SAMPLE) 1.420000 0.81000015 47726771300672 1199 db2resync 0.000000 0.00000014 47726775494976 1196 db2ipccm 0.080000 0.03000013 47726779689280 1195 db2licc 0.000000 0.00000012 47726783883584 1194 db2thcln 0.000000 0.00000011 47726788077888 1192 db2alarm 0.020000 0.0200001 47726528031040 1191 db2sysc 5.420000 5.620000
37
Memory Protection via Storage Keys (V9.5)
• DB2 is leveraging technology which is specific to the POWER6 platform and AIX (TL5300-06 or greater)• Hardware feature called Storage Keys• Allows for memory protection functionality
• DB2 V9.5 has the capability to detect erroneous access to critical regions of it’s own memory, primarily bufferpool memory
• This is pro-active protection: • When such access is detected it is stopped before a corruption can
result• Offending UDF or Stored Procedure receives an error code
indicating that an incorrect memory access was performed
• Cannot protect against all types of bugs but it does provide an added layer of protection
38
Enablement of Memory Protection viaStorage Keys (V9.5)
• Registry variable: DB2_MEMORY_PROTECT = [ NO | YES ]
• Error DBI1301E on attempted set if:• Hardware, OS does not support storage keys• Large page support (i.e. DB2_LARGE_PAGE_MEM) is enabled
39
Higher Availability with Storage Key Support
• Thread Suspension: An additional level of protection that will increase availability of systems when abnormal termination of a DB2 Kernel Thread occurs
DB2_THREAD_SUSPENSION = [ OFF | ON ]
• What is it? It allows you to control whether a DB2 instance sustains a trap by
suspending a faulty DB2 kernel thread (a thread that has tried illegally to access memory protected with storage protection keys).
40
Error Toleration (TP)
• Goal: Reduce unplanned outages• Error toleration via:
• Trap resilience• Physical read error toleration• Logical read error toleration
• Registry variable:• DB2RESILIENCE=ON/TRUE or not defined (ie default behavior)
activates error toleration support for the above• DB2RESILIENCE=OFF/FALSE disables these capabilities.
41
Extended Trap Resilience (TP)
• Registry Variable:
DB2RESILIENCE= [ ON | OFF ] • When a DB2 server detects a trap the following actions will be taken by
DB2: First Occurrence Data Capture (FODC) will be activated Execution stack is dumped If Storage Key support is active then process thread suspension
decisions If Storage Key suspension decisions allow for thread suspension,
suspend thread If Storage Key support is not active or fails decision points to suspend
thread, then proceed with non Storage Key survival decisions (ie. Utilize this new capability)
DB2 will determine if the current environment is safe to allow recovery
42
Messaging:
2008-11-17-09.56.27.542572-300 I75439E563 LEVEL: SeverePID : 28040 TID : 47643040409920PROC : db2syscINSTANCE: marvin NODE : 000 DB : TESTDBAPPHDL : 0-21 APPID: *LOCAL.marvin.081117145537AUTHID : MARVIN EDUID : 29 EDUNAME: db2agent (TESTDB)FUNCTION: DB2 UDB, RAS/PD component, pdResilienceIsSafeToRollback,probe:800DATA #1 : String, 37 bytesTrap Sustainability Criteria CheckingDATA #2 : Hex integer, 8 bytes0x00000000000A0400DATA #3 : Boolean, 1 bytestrue
DATA #1 A generic description of the data followingDATA #2 The hex dump of the trap sustainment criteriaDATA #3 The decision on whether to attempt to sustain the trap (true/false)
43
Physical Read Error Toleration (TP)
• Physical read errors are detected when retrieving DB2 pages from the storage medium (for disk drive).
• This error could be caused by issues with the contents of the data on disk or by problems accessing the physical device on which the data resides. Currently, when DB2 detects either of these errors the DB2 instance is shutdown.
• This solution will provide enhanced toleration of physical read errors such that if such an error is encountered, rather than force the DB2 instance offline, the SQL statement or operation being executed will fail, a new SQLCODE will be returned to the application when appropriate, logging of appropriate diagnostics will occur, and the DB2 instance will remain active.
44
Toleration of Physical Read Errors
• When a DB2 server detects a physical read error there will be a maximum of two more attempts to read the page from disk and validate the DB2 page header. If at the end of the multiple attempts an error is still determined to exist then:• If DB2RESILIENCE=ON
• Sqlcode -1655E will be returned to the application• The SQL statement or DB2 operation fails• Administration message ADM6006E will be placed in the DB2 Administration log
at notify level URGENT• Administration message ADM6006E will be placed in the DB2 diagnostic log at
level URGENT• First Occurrence Data Capture (FODC) will be activated when the error is
detected and will be turned off when the sqlcode is returned to the application. This will allow collection of diagnostic data generated due to this failure detection for further analysis by DB2 Support.
• The database instance remains available• Application connection remains available for continued operations
• If DB2RESILIENCE=OFF then panic the instance (that is, revert to old behavior)
• This algorithm will aid in reducing application errors which are a result of transient file system I/O errors.
45
Logical Read Error Toleration (TP)
• Inconsistency detected by DB2 when processing data contained within a DB2 page
• Instead of hard failure (database or instance shutdown), isolate the issue and allow overall database use to continue• Problematic page is removed from the bufferpool, relevant diagnostics
are generated, notification is returned (messages to admin and db2diag.log and new sqlcode to application)
• Leveraged within the engine where it makes sense
46
Messaging
SQL1656E An error occurred while processing data. The operation could not be completed, but the
database remains accessible. Contact IBM Software Support.Explanation: The failure to complete the operation is due to a detected DB2 page inconsistency. The SQL
statement has failed or the operation has been aborted. The database remains accessible.User response: Contact IBM Software Support for instructions on what diagnostic data to collect to assist in
resolving this issue. The application can retry the operation, although it may continue to fail.sqlcode: -1656sqlstate: 58004
ADM6007C
DB2 detected an error while processing page page-number from table space tbspace-id for object object-id of object type object-type.
ExplanationDB2 was not able to complete the operation, but the database remains accessible. This may
indicate the presence of a serious problem that could result in a future outage.User responsePlease contact IBM Software Support and they will guide you through the proper corrective
actions.
47
Resources To Help You
• Developerworks (Best Practices Papers, …)
• IBM Database Magazine
• IDUG Web Site and Solutions Journal
• DB2 LUW Chat with the Lab
http://www-01.ibm.com/software/data/db2/9/labchats.html
• IBM DB2 Support home:
www.ibm.com/software/data/db2/support/db2_9
• DB2 Fixes:
http://www-01.ibm.com/software/data/db2/support/db2_9/download.html
• Conferences, RDUGs, Customer Lab Advocate, Early Adoption Programs, …