Materials may not be reproduced in whole or in part without the prior written permission of IBM. 2011 IBM Power Systems Technical University October 10-14 | Fontainebleau Miami Beach | Miami, FL IBM Title: Running Oracle DB on IBM AIX and Power7 Best Pratices for Performance & Tuning Session: 17.11.2011, 16:00-16:45, Raum Krakau Speaker: Frank Kraemer Frank Kraemer IBM Systems Architect mailto:[email protected]Francois Martin IBM Oracle Center mailto:[email protected]
57
Embed
IBM Power Systems Technical University October 10-14 ... · PDF fileAgenda CPU Power 7 Memory AIX VMM tuning IO Storage consideration AIX LVM Striping Disk/Fiber Channel driver optimization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Materials may not be reproduced in whole or in part without the prior written permission of IBM.
2011IBM Power Systems Technical UniversityOctober 10-14 | Fontainebleau Miami Beach | Miami, FL
IBMTitle: Running Oracle DB on IBM AIX and Power7
Best Pratices for Performance & Tuning
Session: 17.11.2011, 16:00-16:45, Raum Krakau Speaker: Frank Kraemer
• Mixed SMT modes supported within same LPAR– Requires use of “Resource Groups”
Presenter
Presentation Notes
Dynamic Runtime SMT scheduling: the first thread of each core is dispatch until there are physical cores or Virtual processors free. No matter if shedo Folding option is enabled (CPU Folding activated by default) Requires use of “Resource Groups” reference to WPARs
• Use SMT4Gives a cpu boost performance to handle more concurrent threads in parallel
• Disabling HW prefetching.Usually improve performance on Database Workload on big SMP Power system
# dscrctl –n –b –s 1 (this will dynamically disable HW memory prefetch and keep this configuration across reboot) # dscrctl –n –b –s 0 (to reset HW prefetching to default value)
• Use Terabyte Segment aliasingImprove CPU performance by reducing SLB miss (segment address resolution)# vmo –p –o esid_allocator=1 (by default on AIX7)
• Use Large Pages (LP) 16MB memory pagesImprove CPU performance by reducing TLB miss (Page address resolution)# vmo -r -o lgpg_regions=xxxx -o lgpg_size=16777216 (xxxx= # of segments of 16M you want) (check large page usage with smon –mwU oracle)# vmo –p –o v_pinshm = 1
Enable Oracle userid to use Large Pages# chuser capabilities=CAP_NUMA_ATTACH,CAP_BYPASS_RAC_VMM,CAP_PROPAGATE oracleexport ORACLE_SGA_PGSZ=16M before starting oracle (with oracle user)
Presenter
Presentation Notes
Use of Terabyte segment avoid miss when looking to the SLB (Segment Lookaside Buffer) (on Power7 there are 12 memory segments free for applications, so miss SLB appears if more than 3GB memory allocation (per process ?) AIX supports Application segments that are TeraByte (TB) in size starting from 61L/710 and prior to that supported only segment of size 256MB. With the TB segment support we nearly eliminate a huge amount of SLB misses (and therefore SLB reload overhead in the kernel). Applications using a large amount of shared memory segments (it was 270GB leading to over 1300 256MB segments in a customer benchmark will incur a lot of SLB hardware faults as data referenced is scattered across all of these segments. This problem is alleviated with TB segment support.
• Objective :Tune the VMM to protect computational pages (Programs, SGA, PGA) from being paged out and force the LRUD to steal pages from FS-CACHE only.
Kernel
Programs
SGA + PGA
FS CACHE
SGA + PGA
MEMORY PAGING SPACE
ACTION
2 - DB Activity
1 - System Startup
Performance degradation!
3 - => Paging Activity
1 - AIX is started, applications load some computational pages into the memory.As a UNIX system, AIX will try to take advantage of the free memory by using it as a cache file to reduce the IO on the physical drives.
2 - The activity is increasing, the DB needs more memory but there is no free pages available. LRUD (AIX page stealer) is starting to free some pages into the memory.
3 - With the default setting, LRUD will page out some computational pages instead of removing only pages from the File System Cache.
Memory : VMM Tuning for Filesystems (lru_file_repage)
This VMM tuning tip is applicable for AIX 5.2 ML4+ or AIX 5.3 ML1+ or AIX 6.1+– lru_file_repage :
• This new parametter prevents the computational pages to be paged out.• By setting lru_file_repage=0 (1 is the default value) you’re telling the VMM that
your preference is to steal pages from FS Cache only.
– “New” vmo tuning approach : Recommendations based on VMM developer experience.Use “vmo –p –o” to set the following parametters :
lru_file_repage = 0page_steal_method = 1 (reduce the scan:free ratio)
minfree = (maximum of either 960 or 120 x # lcpus*) / # memory pools** maxfree = minfree +(j2_maxPageAhead*** x # lcpus*) / # memory pools*** to see # lcpus, use “bindprocessor –q” ** for # memory pools, use “vmstat –v”
*** to see j2_maxPageReadAhead, use “ioo –L j2_maxPageReadAhead”Increase minfree and recalculate maxfree if “Free Frame Waits” increase over time (vmstat –s)
minperm% = from 3 to 10maxperm% = maxclient% = from 70 to 90
Memory : Use jointly AIX dynamic LPAR and Oracle dynamic allocation of memory + AMM
Scenario :
Memory_max_size= 18 GB
Real memory= 12 GB
Memory_target= 8 GB
SGA+
PGA
Oracle tuning advisor indicates that SGA+PGA need to be increased to 11GB:memory_target can be increased dynamically to 11GB but real memory is only 12GB, so it needs to be increased as well.
– Step 1 - Increase physical memory allocated to the LPAR to 15 GB
– Step 2 - Increase SGA+PGA allocated to the instance of the database to 11GB
AIX + free
Memory_max_size= 18 GB
Initial configuration Final configuration
Memory allocated to the system has been increased dynamically, using AIX DLPARMemory allocated to Oracle (SGA and PGA) has been increased on the fly
ssh hscroot@hmc "chhwres -r mem -m <system> -o a -p <LPAR name> -q 3072"
2. PP (Disk Physical Partition) striping (AKA spreading)– Create a Volume Group with a 8M,16M or 32M PPsize. (PPsize will be the
“Stripe size”)• AIX 5.2 : Choose a Big VG and change the “t factor” : # mkvg –B –t <factor> –s
<PPsize>...• AIX 5.3 +: Choose a Scalable Volume Group : # mkvg –S –s <PPsize> ...
– Create LV with “Maximum range of physical volume” option to spread PP on different hdisk in a Round Robin fashion : # mklv –e x ...
jfs2 filesystem creation advice :If you create a jfs2 filesystem on a striped (or PP spreaded) LV, use the INLINE logging option. It will avoid « hot spot » by creating the log inside the filesystem (which is striped) instead of using a uniq PP stored on 1 hdisk.
# crfs –a logname=INLINE …
Presenter
Presentation Notes
Same for ASM, 1MB stripe size, can be changed from 11g
Check if the definition of your disk subsystem is present in the ODM.
If the description shown in the output of “lsdev –Cc disk” the word “Other”, then it means that AIX doesn’t have a correct definition of your disk device in the ODM and use a generic device definition.# lsdev –Cc disk
In general, a generic device definition provides far from optimal performance since it doesn’t properly customize the hdisk device :
exemple : hdisk are created with a queue_depth=1
1. Contact your vendor or go to their web site to download the correct ODM definition for your storage subsystem. It will setup properly the “hdisk” accordingly to your hardware for optimal performance.
2. If AIX is connected to the storage subsystem with several Fiber Channel Cards for performance, don’t forget to install a multipath device driver or path control module.
1. Each AIX hdisk has a “Queue” called queue depth. This parameter set the number of // queries that can be send to Physical disk.
2. To know if you have to increase qdepth, use nmon (DDD in interactive mode, -d in recording) and monitor : service time, wait time and servQ full
1. If you have :service time < 2-3ms => this mean that Storage behave well (can handle more load)And “wait time” > 1ms => this mean that disk queue are full, IO wait to be queued=> INCREASE hdisk queue depth (chdev –l hdiskXX –a queue_depth=YYY)
1. Each HBA fc adapter has a queue “nb_cmd_elems”. This queue has the same role for the HBS as the qdepth for the disk.
2. Rule of thumb: nb_cmd_elems= (sum of qdepth) / nb HBA
3. Changing nb_cmd_elems : chdev –l fcsX –o nb_cmd_elems=YYYYou can also change the max_xfer_size=0x200000 and lg_term_dma=0x800000 with the same command
These changes use more memory and must be made with caution, check first with “fcstat fcsX”:
If Oracle data are stored in a Filesystem, some mount option can improve performance :
Direct IO (DIO) – introduced in AIX 4.3.
• Data is transfered directly from the disk to the application buffer, bypassing the file buffer cache hence avoiding double caching (filesystem cache + Oracle SGA). • Emulates a raw-device implementation.
To mount a filesystem in DIO$ mount –o dio /data
Concurrent IO (CIO) – introduced with jfs2 in AIX 5.2 ML1
• Implicit use of DIO. • No Inode locking : Multiple threads can perform reads and writes on the same file at the same time. • Performance achieved using CIO is comparable to raw-devices.
To mount a filesystem in CIO:$ mount –o cio /data
Bench throughput over run duration – higher tps indicates better performance.
Presenter
Presentation Notes
CIO: JFS2 uses a read-shared, write exclusive inode lock (multiple readers can access simult. the file but when write access is made, a lock in exclusive mode must be held) write serialization at the file level. Oracle, for example, implements its own data serialization which ensure that dara inconsistencies don’t occur. So, it doesn’t need the file-system to implement this serialization. For such applications AIX 5.2 offers CIO. With CIO, multiple threads can simult. perform reads and writes on a shared file.
Benefits :1. Avoid double caching : Some data are already cache in the Application layer (SGA)
2. Give a faster access to the backend disk and reduce the CPU utilization
3. Disable the inode-lock to allow several threads to read and write the same file (CIO only)
IO : Benefits of CIO for Oracle
Restrictions :1. Because data transfer is bypassing AIX buffer cache, jfs2 prefetching and write-behind can’t be
used. These functionnalities can be handled by Oracle.
⇒ (Oracle parameter) db_file_multiblock_read_count = 8, 16, 32, ... , 128 according to workload
2. When using DIO/CIO, IO requests made by Oracle must by aligned with the jfs2 blocksize to avoid a demoted IO (Return to normal IO after a Direct IO Failure)
=> When you create a JFS2, use the “mkfs –o agblksize=XXX” Option to adapt the FS blocksize with the application needs.
Rule : IO request = n x agblksize
Exemples: if DB blocksize > 4k ; then jfs2 agblksize=4096
Redolog are always written in 512B block; So jfs2 agblksize must be 512
Presenter
Presentation Notes
Do not allocate a JFS with the large file enabled (bf) attribute. The big file attribute increases the minimum DIO transfer size from 4K to 128K, forcing Oracle to read and write a minimum of 128K bytes to exploit DIO. The CIO mount option should only be used for filesystems containing data which is intended to be accessed in concurrent mode, such as Oracle datafiles, online redo logs and control files, and should not contain libraries and executables, such as the filesystem containing the $ORACLE_HOME. 512 = OS physical block size
• Allows multiple requests to be sent without to have to wait until the disk subsystem has completed the physical IO.
• Utilization of asynchronous IO is strongly advised whatever the type of file-system and mount option implemented (JFS, JFS2, CIO, DIO).
IO : Asynchronous IO (AIO)
Posix vs Legacy
Since AIX5L V5.3, two types of AIO are now available : Legacy and Posix. For the moment, the Oracle code is using the Legacy AIO servers.
aioQ
Application aioservers Disk
1
2
3 4
Presenter
Presentation Notes
Use legacy aio. DBWR_IO_SLAVES is relevant only on systems with only one database writer process (DBW0). It specifies the number of I/O server processes used by the DBW0 process. The DBW0 process and its server processes always write to disk. By default, the value is 0 and I/O server processes are not used. If you set DBWR_IO_SLAVES to a nonzero value, the number of I/O server processes used by the ARCH and LGWR processes is set to 4. However, the number of I/O server processes used by Recovery Manager is set to 4 only if asynchronous I/O is disabled (either your platform does not support asynchronous I/O or disk_asynch_io is set to false. Typically, I/O server processes are used to simulate asynchronous I/O on platforms that do not support asynchronous I/O or that implement it inefficiently. However, you can use I/O server processes even when asynchronous I/O is being used. In that case the I/O server processes will use asynchronous I/O. I/O server processes are also useful in database environments with very large I/O throughput, even if asynchronous I/O is enabled. Here are some suggested rules of thumb for determining what value to set maximum number of servers to: The first rule of thumb suggests that you limit the maximum number of servers to a number equal to ten times the number of disks that are to be used concurrently, but not more than 80. The minimum number of servers should be set to half of this maximum number. Another rule of thumb is to set the maximum number of servers to 80 and leave the minimum number of servers set to the default of 1 and reboot. Monitor the number of additional servers started throughout the course of normal workload. After a 24-hour period of normal activity, set the maximum number of servers to the number of currently running aios + 10, and set the minimum number of servers to the number of currently running aios - 10. In some environments you may see more than 80 aios KPROCs running. If so, consider the third rule of thumb. A third suggestion is to take statistics using vmstat -s before any high I/O activity begins, and again at the end. Check the field iodone. From this you can determine how many physical I/Os are being handled in a given wall clock period. Then increase the maximum number of servers and see if you can get more iodones in the same time period.
• minserver : number of kernel proc. Aioservers to start (AIX 5L system wide).
• maxserver : maximum number of aioserver that can be running per logical CPU
Monitoring :In Oracle’s alert.log file, if maxservers set to low : “Warning: lio_listio returned EAGAIN” “Performance degradation may be seen”
#aio servers used can be monitored via “ps –k | grep aio | wc –l” , “iostat –A” or nmon (option A)
Presenter
Presentation Notes
Use legacy aio. DBWR_IO_SLAVES is relevant only on systems with only one database writer process (DBW0). It specifies the number of I/O server processes used by the DBW0 process. The DBW0 process and its server processes always write to disk. By default, the value is 0 and I/O server processes are not used. If you set DBWR_IO_SLAVES to a nonzero value, the number of I/O server processes used by the ARCH and LGWR processes is set to 4. However, the number of I/O server processes used by Recovery Manager is set to 4 only if asynchronous I/O is disabled (either your platform does not support asynchronous I/O or disk_asynch_io is set to false. Typically, I/O server processes are used to simulate asynchronous I/O on platforms that do not support asynchronous I/O or that implement it inefficiently. However, you can use I/O server processes even when asynchronous I/O is being used. In that case the I/O server processes will use asynchronous I/O. I/O server processes are also useful in database environments with very large I/O throughput, even if asynchronous I/O is enabled. Here are some suggested rules of thumb for determining what value to set maximum number of servers to: The first rule of thumb suggests that you limit the maximum number of servers to a number equal to ten times the number of disks that are to be used concurrently, but not more than 80. The minimum number of servers should be set to half of this maximum number. Another rule of thumb is to set the maximum number of servers to 80 and leave the minimum number of servers set to the default of 1 and reboot. Monitor the number of additional servers started throughout the course of normal workload. After a 24-hour period of normal activity, set the maximum number of servers to the number of currently running aios + 10, and set the minimum number of servers to the number of currently running aios - 10. In some environments you may see more than 80 aios KPROCs running. If so, consider the third rule of thumb. A third suggestion is to take statistics using vmstat -s before any high I/O activity begins, and again at the end. Check the field iodone. From this you can determine how many physical I/Os are being handled in a given wall clock period. Then increase the maximum number of servers and see if you can get more iodones in the same time period.
With fast_path, IO are queued directly from the application into the LVM layer without any “aioservers kproc” operation.
Better performance compare to non-fast_pathNo need to tune the min and max aioserversNo ioservers proc. => “ps –k | grep aio | wc –l” is not relevent, use “iostat –A” instead
AIX Kernel
Presenter
Presentation Notes
Use legacy aio. DBWR_IO_SLAVES is relevant only on systems with only one database writer process (DBW0). It specifies the number of I/O server processes used by the DBW0 process. The DBW0 process and its server processes always write to disk. By default, the value is 0 and I/O server processes are not used. If you set DBWR_IO_SLAVES to a nonzero value, the number of I/O server processes used by the ARCH and LGWR processes is set to 4. However, the number of I/O server processes used by Recovery Manager is set to 4 only if asynchronous I/O is disabled (either your platform does not support asynchronous I/O or disk_asynch_io is set to false. Typically, I/O server processes are used to simulate asynchronous I/O on platforms that do not support asynchronous I/O or that implement it inefficiently. However, you can use I/O server processes even when asynchronous I/O is being used. In that case the I/O server processes will use asynchronous I/O. I/O server processes are also useful in database environments with very large I/O throughput, even if asynchronous I/O is enabled. Here are some suggested rules of thumb for determining what value to set maximum number of servers to: The first rule of thumb suggests that you limit the maximum number of servers to a number equal to ten times the number of disks that are to be used concurrently, but not more than 80. The minimum number of servers should be set to half of this maximum number. Another rule of thumb is to set the maximum number of servers to 80 and leave the minimum number of servers set to the default of 1 and reboot. Monitor the number of additional servers started throughout the course of normal workload. After a 24-hour period of normal activity, set the maximum number of servers to the number of currently running aios + 10, and set the minimum number of servers to the number of currently running aios - 10. In some environments you may see more than 80 aios KPROCs running. If so, consider the third rule of thumb. A third suggestion is to take statistics using vmstat -s before any high I/O activity begins, and again at the end. Check the field iodone. From this you can determine how many physical I/Os are being handled in a given wall clock period. Then increase the maximum number of servers and see if you can get more iodones in the same time period.
ASYNCH : enables asynchronous I/O on file system files (default)DIRECTIO : enables direct I/O on file system files (disables AIO)SETALL : enables both asynchronous and direct I/O on file system filesNONE : disables both asynchronous and direct I/O on file system files
Since the version 10g, Oracle will open data files located on the JFS2 file system with the O_CIO option if the filesystemio_options initialization parameter is set to either directIO or setall.
Advice : set this parameter to ‘ASYNCH’, and let the system managed CIO via mount option (see CIO/DIO implementation advices) …
Note : set the disk_asynch_io parameter to ‘true’ as well
• NUMA stands for Non uniform memory access. It is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.
Oracle DB NUMA support have been introduced since 1998 on the first NUMA systems. It provides a memory/processes models relying on specific OS features to better perform on this kind of architecture.
Since 10.2.0.4 NUMA features are enabled by default on x86 hardware.
On AIX, the NUMA support code has been ported, default is off.
Prepare system for Oracle NUMA Optimization The test is done on a POWER7 machine with the following CPU and memory distribution
(dedicated LPAR). It has 4 domains with 8 CPU and >27GB each. If the lssrad output shows unevenly distributed domains, fix the problem before proceeding.
• Listing SRAD (Affinity Domain)# lssrad -vaREF1 SRAD MEM CPU0
0 27932.94 0-311 31285.00 32-63
12 29701.00 64-953 29701.00 96-127
• We will set up 4 rsets, namely SA1/0, SA1/1, SA1/2, and SA1/3, one for each domain.# mkrset -c 0-31 -m 0 SA1/0# mkrset -c 32-63 -m 0 SA1/1# mkrset -c 64-95 -m 0 SA1/2# mkrset -c 96-127 -m 0 SA1/3
• Required Oracle User Capabilities # lsuser -a capabilities oracleoracle capabilities=CAP_NUMA_ATTACH,CAP_BYPASS_RAC_VMM,CAP_PROPAGATE
• Once started Oracle set the following parameter_NUMA_instance_mapping Not specified_NUMA_pool_size Not specified_db_block_numa 4_enable_NUMA_interleave TRUE_enable_NUMA_optimization FALSE_enable_NUMA_support TRUE_numa_buffer_cache_stats 0_numa_trace_level 0_rm_numa_sched_enable TRUE_rm_numa_simulation_cpus 0_rm_numa_simulation_pgs 0
• The following messages are found in the alert log. It finds the 4 rsets and treats them as NUMA domains.
LICENSE_MAX_USERS = 0SYS auditing is disabledNUMA system found and support enabled (4 domains - 32,32,32,32)Starting up Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
• Take a look at the Oracle processes. The Oracle background processes that have multiple instantiations are automatically attached to the rsets in a round-robin fashion.
• All other Oracle background processes are also attached to an rset, mostly SA1/0. lgwr and psp0 are attached to SA1/1.
• If Oracle shadow processes are allowed to migrate across domains, the benefit of NUMA-enabling Oracle will be lost. Therefore, arrangements need to be made to affinitize the user connections.
• For network connections, multiple listeners can be arranged with each listener affinitized to a different domain. The Oracle shadow processes are children of the individual listeners and inherit the affinity from the listener.
• For local connections, the client process can be affinitized to the desired domain/rset. These connections do not go through any listener, and the shadows are children of the individual clients and inherit the affinity from the client. Example below using sqlplus as local client:
• Database registration: the init.ora service_names parameter should be set to: *.service_names = SA# su - oracle -c sqlplus sys/ibm4tem3 as sysdba <<EOFalter system register;EOF
• It is apparent that the biggest boost NUMA optimization achieves is when user connections are affinitized, and the workload does not share data across domains (cases 4 and 5).
• Cases 2 and 3 have user connection affinity, but the workload shares data across domains. In this sample test, it does not perform any better than without connection affinity (case 1).
• The meaning of _enable_NUMA_interleave parameter is not detailed. One would assume that it enables/disables some kind of memory interleaving. But consider test cases 2 and 3, the throughput is the same regardless whether it is enabled. It might be because these test cases have a low memory locality anyway. In contrast, test case 5 has a decent boost over case 4 by disabling interleaving. What this says is that if the workload already has a high memory locality, disabling _enable_NUMA_interleaving can further increase memory locality.
• Very promising features for CPU bound workload– Speed up up to 45% on an ISV benchmark case, p780 & 11gR2– Applicable from 740 to 795 LPARs with more than 8 cores
• Further benchmark tests under progress– NUMA_optimization parameter– Size of the Oracle rsets: P7 chip vs P7 node– Compatibility with SPLPARs, AME …