Informix健康检查13.ppt

© 2010 IBM Corporation2023年4月17日

Informix 数据库管理员快速指南

IBM Informix 中国开发中心Informix Enablement Team

谭永贻技术经理

Information Management – Informix

© 2010 IBM Corporation2

大纲

Informix 快速健康检查

定期执行的数据库管理任务 (task)

运行时的监控

性能优化

让 Informix 数据库管理员的从第一天就发挥生产力




CPU – 监控 CPU 的空闲值

内存– 查看总的可用内存– 查看分配给 IDS 的内存– 检查 IDS 的内存参数

I/O– 监测系统 I/O 带宽– 查看各个磁盘上的 I/O 吞吐量是否均衡

网络– “查看与 IDS ”的网络连接相关的信息

online.log 文件– 查看错误和警告信息



CPU

A． top Tasks: 209 total, 1 running, 207 sleeping, 0 stopped, 1 zombieCpu(s): 0.3%us, 0.3%sy, 0.0%ni, 50.0%id, 49.2%wa, 0.0%hi, 0.2%si, 0.0%st

B． sar 5 10010:37:06 AM CPU %user %nice %system %iowait %steal %idle10:37:11 AM all 0.70 0.00 0.40 49.20 0.00 49.7010:37:16 AM all 0.50 0.00 0.30 49.45 0.00 49.7510:37:21 AM all 0.50 0.00 0.50 49.30 0.00 49.7010:37:26 AM all 0.50 0.00 0.40 49.35 0.00 49.75

分析：（ 1 ）普通情况下 CPU 的空闲率是否低于 20 ％（ 2 ）峰值时 CPU 的空闲率是否低于 2% 当（ 1 ）或（ 2 ）的回答是肯定时，添加更多的 CPU



内存A． top Cpu(s): 0.5%us, 0.2%sy, 0.0%ni, 49.9%id, 49.4%wa, 0.0%hi, 0.0%si, 0.0%stMem: 4044192k total, 4016092k used, 28100k free, 32344k buffersSwap: 8193140k total, 138236k used, 8054904k free, 3544312k cached

B． onstat -g segid key addr size ovhd class blkused blkfree 12943367 52564801 44000000 895340544 10925608 R 218587 2 12976136 52564802 795dd000 334397440 3920416 V 17035 64605 ………………………………………………………………… V ……………….

Total: - - 1229737984 - - 235622 64607

分析：总的空闲内存有多少？分配给 IDS 的内存有多少？ IDS 是否有 4 个以上的虚拟内存段？了解 IDS 的内存参数

– BUFFERPOOL– SHMVIRTSIZE– SHMADD



I/O（ 1 ）A． iostat 5 100 avg-cpu: %user %nice %system %iowait %steal %idle 14.29 0.00 0.29 6.38 0.00 79.04Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtnsda 25.24 386.78 293.22 131991652 100065298sda1 0.00 0.01 0.00 1866 14sda2 24.76 380.97 287.73 130010482 98190344

B． Sysmaster SQL for I/O select d.name dbspace, fname[1,125] chunk_name, sum(pagesread) diskreads, sum(pageswritten) diskwrites, sum(pagesread)+sum(pageswritten) disk_rwes from sysmaster:syschkio c, sysmaster:syschunks k, sysmaster:sysdbspaces d where d.dbsnum = k.dbsnum and k.chknum = c.chunknum --# c.chknum group by 1, 2 order by 5 desc;

dbspace chunk_name diskreads diskwrites disk_rwesdbs11 /opt/dbschk/dbs11 475 23232301 23234759demodbs /opt/dbschk/demodbchk 2493 9843421 91323498llogdbs /opt/dbschk/llogch 66 11156850 11156916rootdbs /opt/dbschk/online_root 37513 10001 47514



I/O（ 2 ）C． onstat -D address chunk/dbs offset page Rd page Wr pathnameb2d481c0 1 1 0 475 23232301 /opt/dbschk/dbs11b417dce8 2 2 0 2493 9843421 /opt/dbschk/demodbchkb4179028 3 3 0 66 11156850 /opt/dbschk/llogchb4179218 4 4 0 37513 2738 /opt/dbschk/online_root

D． onstat -g iof AIO global files:gfd pathname bytes read page reads bytes write page writes io/s3 dbs11 972800 475 47579752448 23232301 916.64 demodbchk 5105 664 2493 20159326208 9843421 916.65 llogch 135168 66 22849228800 11156850 916.66 online_root 76826624 37513 5607424 2738 916.6

确定读 (Read) 操作最多的 Chunk 确定写 (Write) 操作最多的 Chunk



I/O （ 3 ）

问题：系统 I/O 带宽是否达到 Informix 的要求 ? I/O 流量只发生在少数特定的磁盘 ? 是否有不可靠的（例如异常缓慢的）磁盘 ?

分析：观测系统的 I/O 吞吐量和每个磁盘的 I/O 吞吐量考虑使用更好的存储设备在不同的 dbspace 之间移动数据表，从而达到更平衡的 I/O 对数据表进行分片将一些 Attached Index 改为 Detached Index 在磁盘毁损前，替换不可靠的磁盘



网络A． onstat -g ntu|grep sqlexec|wc -l 323

B． onstat -g ntu#netscb connects read write q-free q-limits q-exceed alloc/max

6/ 12 5222 112957 114819 6/ 10 135/ 10 0/ 0 10/ 10 C. onstat -g ntd Client Type Calls Accepted Rejected Read Writesqlexec yes 5212 123 59347 66366

分析： DBA 必须为 IDS 配置足够多的网络连接。 Rejected 的数目较大意味着 IDS 没有足够多

的网络连接。建议：当 Rejected 的数目较大时，修改 NETTYPE 参数，增大 IDS 的网络连接数

– NETTYPE soctcp， 10, 350， CPU– NETTYPE connection_type， poll_threads， c_per_t， vp_class– poll_threads 不能超过 NUMCPUVPS 。当 c_per_t 超过 350 ，建议将 vp_class 设为 NET



online.log

A． grep Error online.log 10:05:43 SCHAPI: Error -23197 Database locale information mismatch.

B． grep Thread online.log13:41:16 Who: Session(654, prpsvr@hljpicc, 213218, 70000035702aff8) Thread(805, sqlexec, 700000357393c98, 4) File: rsdebug.c Line: 106713:41:16 Results: Possible inconsistencies in 'prpalldb:"piccprp".prprepay'13:41:16 Action: Run 'oncheck -cD 6292799'

C. grep "Assert Failed" online.log00:57:53 Assert Failed: Unexpected virtual processor termination, pid =22, exit = 0x9 00:57:53 Who: Session(1122, demodb@demo_no, 6238, 721214359) Thread(62340, sqlexec, 3845e36, 1)

分析 Error -23197 ：可使用 “ finderr -23197” 命令查看详细信息 oncheck -cD ：硬盘上的数据出现问题，需使用 oncheck 命令对硬盘上的数据进行检查 Assert Failed ：查看 af.xxx 文件，联络 IBM 技术支持



大纲



运行时的监控

性能优化





dbspace– 检查 root dbspace– 监控每个 dbspace 的空闲空间– 计算每个 dbspace 的数据增长速度

temp dbspace– 应该设置多少个 temp dbspace– 在系统繁忙时监控 temp dbspace 的空闲空间

逻辑日志和物理日志– 监控逻辑日志和物理日志的使用情况– 检查逻辑日志缓冲区和物理日志缓冲区的配置情况

更新统计数据（ Update Statistics ）– 根据需要更新某些数据表的统计数据



dbspaces — 建立在 rootdbs 上的数据表A ． Sysmaster SQLselect distinct t.dbsname database, d.name dbspace, t.tabnamefrom sysmaster:sysdbstab d, sysmaster:syschunks c, sysmaster:sysextents twhere t.chunk = c.chknum and c.dbsnum=d.dbsnum and t.dbsname not like 'sys%' and t.dbsname != 'onpload' and t.tabname not like 'sys%' and d.name ='rootdbs' ;

database dbspace tabnamebank18030 rootdbs customerhtyuan rootdbs t100_1htyuan rootdbs t101_20

分析：系统数据库之外的数据库不应被建在 rootdbs 上。临时表、逻辑日志、物理日志应该有它们自己的 dbspace ，它们不应被建在 rootdbs 上。



dbspaces — 空闲空间A．Sysmaster SQL (dbaccess sysmaster)

select name dbspace, sum(chksize) allocated, sum(nfree) free,round(((sum(chksize) - sum(nfree))/sum(chksize))*100,2) pcusedfrom sysmaster:sysdbspaces d, sysmaster:syschunks cwhere d.dbsnum = c.dbsnumgroup by name order by 4 desc,name;

dbspace allocated free pcused%demodbs 2000000 28800 98.56dbs11 2000000 860150 56.99dbs12 2000000 860150 56.99

分析：为空闲空间较少的 dbspace 添加空间。将数据表从空闲空间较少的 dbspace 移动到空闲空间较多的 dbspace 。



temp dbspace（ 1 ）A. onstat -c |grep DBSPACETEMP or env |grep DBSPACETEMP DBSPACETEMP tmpdbs01,tmpdbs02,tmpdbs03,tmpdbs04

B． dbaccess sysmasterselect d.name,d.pagesize, t.fname,t.chksize,t.nfreefrom syschunks t,sysdbspaces dwhere t.dbsnum=d.dbsnum and d.is_temp=1 order by 1;

d.name, d.pagesize, t.fname, t.chksize,t.nfree tmpdbs012048 /opt/dbschk/tmpchk011000000 147 tmpdbs022048 /opt/dbschk/tmpchk021000000 39947 tmpdbs032048 /opt/dbschk/tmpchk031000000 439 tmpdbs042048 /opt/dbschk/tmpchk041000000 4947

分析：必须在 onconfig 文件中为 DBSPACETEMP 参数设置有效的值检查 temp dbspace 的空闲空间



temp dbspace（ 2 ）C. onstat -d |grep TBaddress number flags fchunk nchunks pgsize flags owner name 7b430600 5 0x42001 5 1 2048 N TB informix tmpdbs017b430798 6 0x42001 6 1 2048 N TB informix tmpdbs027b430930 7 0x42001 7 1 2048 N TB informix tmpdbs037b430ac8 8 0x42001 8 1 2048 N TB informix tmpdbs04

D. onstat -d |grep tmpaddress chunk/dbs offset size free bpages flags pathname7b432ac0 5 5 0 1000000 147 PO-B- /opt/dbschk/tmpchk017b432cb0 6 6 0 1000000 39947 PO-B- /opt/dbschk/tmpchk027b433028 7 7 0 1000000 439 PO-B- /opt/dbschk/tmpchk037b433218 8 8 0 1000000 4937 PO-B- /opt/dbschk/tmpchk04

分析：使用 onstat -d 查看 temp dbspace 的信息。建议每个 instance 至少 4 个 temp dbspace 。建议每个 temp dbspace 的大小为 2GB 确保在 IDS 运行时 temp dbspaces 有足够的空闲空间



逻辑日志和物理日志（ 1 ）

A. onstat -lPhysical LoggingBuffer bufused bufsize numpages numwrits pages/io P-1 12 64 331813 6434 51.57 phybegin physize phypos phyused %used 2:53 999500 30664 12 0.00 Logical LoggingBuffer bufused bufsize numrecs numpages numwrits recs/pages pages/io L-2 0 64 1851858 209365 159898 8.8 55.3 …address number flags uniqid begin size used %usedb2d47f00 15 U-B---- 566 3:250053 50000 50000 100.00b2d47f68 16 U---C-L 567 3:300053 50000 23015 46.03... 20 active, 20 total



逻辑日志和物理日志（ 2 ）

逻辑日志检查逻辑日志的大小和数量。对于大的 OLTP 系统，我们建议：每个逻辑日志的大小

为 100MB ，逻辑日志的数量为 20-50 。不要把逻辑日志建在 rootdbs 上。检查 LOGBUFF 的大小。 LOGBUFF 的值应该大于或等于 128 。计算 pages/io 除以

LOGBUFF 得到的值。如果 pages/io 除以 LOGBUFF 得到的值为 75% 左右，那么说明 logical log buffer 被有效的使用；如果该值小于 75% ，那么说明 logical log buffer 太大；如果该值大于 75% ，那么说明 logical log buffer 太小。

物理日志检查物理日志的大小。我们建议物理日志的大小为 2GB 或更大。较小的物理日志将

较容易触发 checkpoint 。不要把物理日志建在 rootdbs 上。检查 PHYSBUFF 的大小。 PHYSBUFF 的值应该大于或等于 128 。计算 pages/io 除

以 PHYSBUFF 得到的值。如果 pages/io 除以 PHYSBUFF 得到的值为 75% 左右，那么说明 physical log buffer 被有效的使用；如果该值小于 75% ，那么说明 physical log buffer 太大；如果该值大于 75% ，那么说明 physical log buffer 太小。



更新统计数据（ Update Statistics ）A. Get the actual row number per table

dbaccess -e dbname sqlfile.sql Select count(*) from manufact; --1,000,000 Select count(*) from stock; --100123 Select count(*) from customer; --10,000,028

B. Get the statistics row numbers in systable

select tabname,nrows from systables where tabid >99

tabname nrows manufact 9.0 stock 100123.0 customer 10000028.0

C. Contrast A, B results, determine which table does not promptly update statistics

分析：根据数据表中数据量的大小对数据表进行合适级别的“统计数据更新”：

HIGH、MEDIUM、 LOW 将“统计数据更新”定义为定期自动执行的任务 (task)



大纲



运行时的监控

性能优化




运行时的监控 instance概况

– 缓冲区的命中率– 索引和顺序扫描– 等待缓冲区的次数– 找不到空闲缓冲区的次数– 回滚和提交– 请求锁的次数和等待锁的次数– 分配锁的次数– 死锁次数

使用了较多锁 (lock) 的数据表

检查点 (Checkpoint)– Checkpoint duration

长事务 (Long Transaction)



Instance 概况（ 1 ） onstat -p

dskreads pagreads bufreads %cached dskwrits pagwrits bufwrits %cached6924024 6960444 57054963 87.87 276611 368870 2581449 89.32 isamtot open start read write rewrite delete commit rollbk28117252 826169 1545943 19339955 720261 240523 170898 336161 34gp_read gp_write gp_rewrt gp_del gp_alloc gp_free gp_curs 0 0 0 0 0 0 0 ovlock ovuserthread ovbuff usercpu syscpu numckpts flushes 1 0 0 1481.63 89.86 54 29 bufwaits lokwaits lockreqs deadlks dltouts ckpwaits compress seqscans 207092 179 89705878 0 0 5 66122 60111 ixda-RA idx-RA da-RA RA-pgsused lchwaits 500 97 5730949 5727768 186142



Instance 概况（ 2 ）

%cached : 当请求数据页 (page) 时数据页已经在内存中了的比率。在 OLTP环境中，它应该大于 95% 。如果读操作的 %cached 较低，我们可以通过修改BUFFERPOOL 参数来增加 IDS的 BUFFER 数目。

seqscans 和 isamtot : 如果 seqscans / cisamtot 大于 1% ，我们需要检查 IDS 使用的索引是否太少。

bufwaits ：用户线程等待缓冲区 (buffer) 的次数。如果多个用户线程在某一个时刻要修改同一个 page ，那么其中一个用户线程可以使用这个 page 对应的 buffer ，其它用户线程需要等待这个 page 对应的 buffer 。如果 bufwaits 较大，那么很有可能某些page 被修改得太多了。另外，如果 LRU_MIN 被设为 0， page 被频繁的从 buffer中flush 到磁盘上，那么 bufwaits也有可能较大。

ovbuff： IDS找不到空闲缓冲区 (buffer)的次数。当没有空闲缓冲区时， IDS将把一个脏的缓冲区写到磁盘从而得到一个空闲缓冲区。如果 ovbuffer的值较大，例如100000，那么我们需要修改 BUFFERPOOL参数来增加 IDS的 BUFFER数目，从而减小 ovbuff 的值，减少 IDS的响应时间。



Instance 概况（ 3 ）

rollbk / commit ：如果 rollbk / commit 的值大于 1% “ ”，我们需要查明出现大量事务回滚 “ ”的原因。很多时候，出现大量事务回滚的原因是应用程序设计得不合理。

lokwaits / lockreqs– lokwaits 是用户线程等待锁的次数。– lockreqs 是用户线程请求锁的次数。– 如果 lokwaits / lockreqs的值较大，那么应用程序有可能在多线程方面设计得不合

理。

ovlock “：如果 IDS ”分配锁的次数小于或等于 15， ovlock的值为 0；如果 IDS分配锁的次数大于 15， ovlock “的值为 IDS ”分配锁的次数减去 15。如果 ovlock的值不为 0，我们需要考虑增大 onconfig文件中 LOCKS参数的值。

deadlks：如果 1个潜在的死锁被检测并防止，那么 deadlks将增加 1。



使用了较多锁 (lock) 的数据表 onstat -g ppfpartnum lkrqs lkwts dlks touts isrd iswrt isrwt isdel bfrd bfwrt seqsc rhitratio 0x100123 8698 0 0 0 3163 243 242 62 12209 1219 3 1000x100124 4660 322 31 53 6553 3797 148 964 41278 11832 0 1000x100125 1366 0 0 0 652 210 148 62 3850 1138 1 1000x100126 1015 0 0 0 65 306 0 111 5330 1410 1 1000x100127 771 0 0 251 110 0 0 139 0 1380 0 1000x100128 506 0 0 0 72 357 0 0 1207 778 0 100

问题哪个数据表使用了较多的锁 (lock)?分析： lkrqs – 请求锁的次数 lkwts – 等待锁的次数 dlks – 死锁的次数 touts – 远程死锁超时 (remote deadlock timeout) 的次数通过 partnum 我们可以找到锁问题较多的数据表。查看该数据表的隔离级别，分析

应用程序中与该数据表相关的部分。

Partition Profiles



检查点（ Checkpoint ）A ． grep Checkpoint online.log 16:28:31 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 69, Llog used 44016:33:32 Checkpoint Completed: duration was 60 seconds.16:33:32 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 1152, Llog used 1216:38:32 Checkpoint Completed: duration was 0 seconds.

B． onstat -g ckp

IDS11.x Non-Blocking Checkpoit

问题 checkpoint duration 是否太长？分析：自动调整 (IDS11.5)

– AUTO_CKPTS– AUTO_LRU_TUNNING

手动调整– LRU_MIN_DIRTY, LRU_MAX_DIRTY – CKPINTVL, LRUS, CLEANERS, NUMAIOVPS



长事务（ 1 ）A． grep Long Transaction online.log|wc -l 13

B．Monitor long trunsactiononstat -IBM Informix 9.40.FC7 On-Line (LONGTX) -- Up 35 days 16:41:40 -- 3920896 Kbytesonstat -x1cf0a6748 A-R-- 1cd55c618 642073 119403 119405 0x1aa91e4 DIRTY 0

onstat -u |grep 1cd55c6181cd55c618 --RPX-- 1880841 informix - 0 0 642073 256446 323049onstat -g ses 1880841



长事务（ 2 ）

分析：找出长事务 (long transaction)，修改应用程序以减小该事务的长度。增大每个逻辑日志的大小，增加逻辑日志的个数。 LTXHWM : 如果一个事务 (transaction)跨越的逻辑日志个数占逻辑日志总个数

的比例大于或等于 LTXHWM， IDS将回滚该事务。 LTXEHWM : 如果一个事务 (transaction)跨越的逻辑日志个数占逻辑日志总个

数的比例大于或等于 LTXEHWM， IDS将回滚该事务并挂起其它事务。 DYNAMIC_LOGS：回滚长事务时 IDS “有可能因为逻辑日志用完而挂起。动态

”逻辑日志可以避免这种情况的发生。当 DYNAMIC_LOGS参数的值为 2时， IDS自动添加动态逻辑日志；当 DYNAMIC_LOGS参数的值为 1时， IDS暂停各项活动并通知 DBA手动添加动态逻辑日志；当 DYNAMIC_LOGS参数的值为 0 时， IDS不使用动态逻辑日志。



大纲



运行时的监控

性能优化




性能调优的关注点

哪些方面对性能的影响最大？

I/O 最多的数据表

大数据表

数据表的 extent 数

索引的层数

顺序扫描次数较多的数据表



I/O 最多的数据表A． dbaccess sysmaster

select dbsname, tabname, (isreads + pagreads) diskreads, (iswrites + pagwrites) diskwrites,(isreads + pagreads)+ (iswrites + pagwrites) disk_rswsfrom sysmaster:sysptprofwhere tabname not like 'sys%‘ and dbsname not like 'sys%'order by 5 desconstat -D

dbsname tabname diskreads diskwrites disk_rsws demodb customer 53793 0 53793 demodb orders 4397 31112 35509 demodb stock 42152 201 42353 demodb test_cn 589 11000 11589

分析：找出 I/O 最多的表。对于这些表，我们可以考虑以下措施：

– 对数据表进行分片– 在不同的 dbspace 之间移动数据表，从而达到更平衡的 I/O



大数据表A． dbaccess dbname

Select tabname,nrows,(npused*pagesize)/1024/1024 used_space_mfrom systableswhere nrows> 1000000 order by 2 desc

tabname nrows used_space_m orders 50000000 4069.015625 customer 10000041 1395.099609375 cust_calls 4318028.0 2811.21875 test_cn 4194304.0 32.638671875 t_fragment_test 1788486.0 1164.380859375

分析：如果一个表的数据行数超过 1百万，或它使用的空间超过 500MB ，我们需要考虑把该

表变成一个分片表。



数据表的 extent 数A． dbaccess dbname

select dbsname, tabname, count(*) num_of_extents, sum( pe_size ) total_size from sysmaster:systabnames, sysmaster:sysptnext where partnum = pe_partnum and dbsname="demodb" and tabname not like 'sys%'group by 1, 2 having count(*)>50 order by 3 desc

dbsname tabname num_of_extents total_size demodb cust_calls2 156 220408 demodb cust_calls 140 1515552 demodb orders_sum 132 70008

分析：若果某个表的 extent 数大于 50 ，我们需要考虑删除并重建该表 — 在重建该表时为

该表设置更大的 extent size 和 next size 。有时我们还需检查这些 extent 的物理位置。



索引的层数A． dbaccess dbname

select t.tabname,i.idxname, i.levels from sysindexes i, systables twhere i.tabid = t.tabid and i.levels>=4 order by 3 desc;

tabname idxname levels orders_sum 117_29 5 orders 112_16 4 cust_calls 111_19 4 customer 110_13 4 cust_calls 111_20 4 cust_calls 111_15 4

分析：如果某个索引的层数大于 4 ，那么我们需要考虑删除并重建该索引。重建索引后，

索引的层数会减少，检索效果会变好。



顺序扫描次数较多的数据表 A． dbaccess dbnameselect p.dbsname , t.tabname , sum(p.seqscans) seqscans , max(t.nrows) nrows from sysmaster:sysptprof p , systables t where p.tabname = t.tabname and t.nrows > 100 and p.seqscans>0 and p.dbsname not like "sys%" and p.tabname not like "sys%" group by 1,2 order by 3 desc;

dbname tabname seqscans nrowsdemodb stock 3102 10demodb test_cn 132 9882121288demodb state 121 45276 demodb items 11 1324552

分析：找出顺序扫描 (sequential scan)次数较多的表。

– 如果被找出的某个表的数据行数较少，则不需修改，例如上面的 stock 表– 如果被找出的某个表的数据行数较多，我们需要考虑为该表添加索引以减少该表

的顺序扫描次数，例如上面的 test_cn 表。



总结： IDS 健康检查

检测资源耗尽问题检测可通过修改参数解决的资源短缺问题检测过时的配置监控数据增长情况检测过大的数据表，考虑对这些数据表进行分片随着时间的推移，对数据在各磁盘上的存储情况进行重新规划

Slide 36



总结：监控与性能

Instance 概况– 缓冲区的命中率，索引和顺序扫描，等待缓冲区的次数，找不

到空闲缓冲区的次数，回滚和提交，请求锁的次数和等待锁的次数，分配锁的次数，死锁次数

使用了较多锁 (lock) 的数据表检查点（ Checkpoint ）

– Checkpoint duration 长事务（ Long Transaction ）性能调优的关注点

– I/O 最多的数据表– 大数据表– 数据表的 extent 数–索引的层数–顺序扫描次数较多的数据表





Informix健康检查13.ppt

Documents