Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC
Grant McAlisterSenior Database EngineerAmazon.com
Paper 32110
Agenda
Why Oracle on Linux and RACThe Tests ScalingPerformanceAvailabilityChoice of InterconnectConclusion
Why Linux
Lower Total Cost of Ownership Near commodity hardware and support Multiple O/S and hardware vendors Common platform (IA-32) for entire enterprise
Unix look and feel New enterprise kernel No database conversions when changing
Linux hardware or O/S
Why RAC on LinuxCost
Ability to use near commodity systems (2-4 processors)Lower level of support needed on system units
The need for availabilityYoung and rapidly evolving O/S Near commodity hardware and support
The need to scale database beyond 8 processorsThe need for large amounts of memory > 32GBytes
The Tests
Real life workloadsNot modified or partitioned to support RACUsed automatic space management
Workload #1 Simple workload of small queries with little locking.
Workload #2Typical nasty workload with many inserts, updates and select
for updates causing a lot of locking and blocking.
Workload #1 Single Instance ProfileLoad Profile~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 77,516.28 1,460.05 Logical reads: 4,134.57 77.88 Block changes: 462.54 8.71 Physical reads: 155.70 2.93 Physical writes: 27.14 0.51 User calls: 11,012.73 207.43 Parses: 432.50 8.15 Sorts: 187.32 3.53 Executes: 432.89 8.15 Transactions: 53.09 % Blocks changed per Read: 11.19 Recursive Call %: 0.68 Rollback per transaction %: 0.82 Rows per Sort: 353.26
Top 5 Wait Events on a single instance Avg Total Wait wait WaitsEvent Waits Timeouts Time (s) (ms) /txn---------------------------- ------------ ---------- ---------- ------ --------db file sequential read 560,060 0 1,249 2 2.9log file sync 180,813 494 676 4 0.9log file parallel write 188,017 181,946 143 1 1.0latch free 87,584 6,309 141 2 0.5db file parallel write 5,794 2,895 14 2 0.0
Workload #2 Single Instance ProfileLoad Profile~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 244,988.60 5,306.31 Logical reads: 14,562.36 315.41 Block changes: 1,802.47 39.04 Physical reads: 319.45 6.92 Physical writes: 91.52 1.98 User calls: 2,877.06 62.32 Parses: 457.06 9.90 Sorts: 290.13 6.28 Executes: 456.73 9.89 Transactions: 46.17 % Blocks changed per Read: 12.38 Recursive Call %: 4.16 Rollback per transaction %: 0.96 Rows per Sort: 13.09Top 5 wait events on a single instance Avg Total Wait wait WaitsEvent Waits Timeouts Time (s) (ms) /txn---------------------------- ------------ ---------- ---------- ------ --------db file sequential read 346,048 0 1,412 4 1.7enqueue 177 119 369 2087 0.0free buffer waits 752 32 348 463 0.0db file scattered read 141,564 0 325 2 0.7log file sync 207,109 37 306 1 1.0
The Hardware and Software
SoftwareOracle 9.2.0.1 Red Hat Advanced Server 2.1 (2.4.9-e.3)
Hardware3 types of clusters that each have 4 nodes
2 Pentium III Xeon Processors @ 1.126GHz & 5 Gbytes of RAM2 Pentium 4 Xeon DP Processors @ 2.4GHz & 4 Gbytes of RAM4 Pentium 4 Xeon MP Processors @ 1.6GHz & 10 Gbytes of RAM
Database files were on raw partitions
Scaling
The ability to produce higher transactional volumes when adding additional processors or additional nodes.
Scaling of workload #1
Scaling of Workload #1
1.0
1.9
2.6
3.6
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Single Instance Two Nodes Three Nodes Four Nodes
Tra
nsa
ctio
nal
Vo
lum
e 2 Proc - 1.126GHz
Scaling of Workload #1
1.0
1.9
2.6
3.6
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Single Instance Two Nodes Three Nodes Four Nodes
Tra
nsa
ctio
nal
Vo
lum
e
2 Proc - 1.126GHz
Scaling of workload #2Scaling of Workload #2
1.0
2.11.8
3.0
3.9
4.9
2.7
1.6
0
1
2
3
4
5
6
Single Instance Two Nodes Three Nodes Four Nodes
Tra
nsa
ctio
nal
Vo
lum
e
2 Proc - 2.4GHz 4 Proc - 1.6GHz
Some workloads scale better
Scaling of Different Workloads
1.0
1.9
2.6
3.6
1.0
1.6
2.1
2.7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Single Instance Two Nodes Three Nodes Four Nodes
Tra
ns
ac
tio
na
l Vo
lum
e
Workload #1 Workload #2
Some of the differences
Event Waits Time (s) %Total Elapsed Time
CPU time 2,386 33.15
global cache null to x 62,646 2,067 28.71
db file sequential read 391,474 1,063 14.76
buffer busy global cache 15,125 560 7.78
log file sync 158,560 347 4.82
Top 5 workload #1 timed events
Event Waits Time (s) %Total Elapsed Time
global cache cr request 1,324,756 19,080 27.28
buffer busy global cache 53,411 11,531 16.49
enqueue 38,795 11,084 15.85
global cache null to x 88,908 6,449 9.22
CPU time 5,085 7.27
Top 5 workload #2 timed events
Performance
The time taken to perform a query is importantExecution time influences transactional volumeCan cause dramatic changes in the end user response time
Stock ExchangeInternet RetailerBank
Only you know what is reasonable for your database and application
Execution times for workload #1 Execution Times for Workload #1
2 Processors @ 2.4 GHz
44% 48%59%
109%
52% 49%58%
193%
53% 48%61%
312%
0%
50%
100%
150%
200%
250%
300%
350%
Update Select Select for Update Insert
Per
cen
t In
crea
se
Two Nodes Three Nodes Four Nodes
Execution times for workload #2
Execution Times for Workload #24 Processors @1.6GHz
94%
17%38%
233%
59%
316%
147% 138%
34%
75% 84%
134%
0%
50%
100%
150%
200%
250%
300%
350%
Update Select Select for Update Insert
Per
cen
t In
crea
se
Two Nodes Three Nodes Four Nodes
Some ways to improve
Make sure you database is well tuned for single instance operation
Consider using different block sizes for hot indexes
Hash partition hot tables and indexesPartition the workload
Availability
Minimize failures by building clusters with as few single points of failure as possible.
Setup your RAC cluster to recover from node and instance failure as quickly as possible.
Redundant RAC Configuration
Instance recovery time
MTTR Target=120 MTTR Target=240 MTTR Target Not Set
Cluster Reconfigured
2 2 2
Recovery Started 9 10 12
Redo Log First Pass
1 1 13
Redo Log Second Pass
23 56 329
Total Time 35 69 356
fast_start_mttr_target is the key
Node failure recovery timeRecovery Time= Failure detection + Instance recoveryFailure detection = (MissCount * 1 second) MissCount parameter in found in cmcfg.ora
When MissCount = 20 and fast_start_mttr_target=120All workload #2 processing resumed in less than 1
minute after crashing a node.
Impact of a single node failure
0
500
1000
1500
2000
2500
3000
-20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
Seconds
Tra
ns
actio
ns
per
Sec
ond
Node failed Cluster Reconfigured
CM ejects node Recovery Complete
Choice of Interconnect
1000Mbit (Gigabit) EthernetLatency ~ 0.07 msTransfer Rate - 30+ MBytes per secondMore expensive but becoming common with the
advent of gigabit over copper.
100Mbit EthernetLatency ~ 0.20 msTransfer Rate - 10 MBytes per second Common and inexpensive
100mbit vs. Gigabit
Oracle Interconnect Latency
15.5
144.3
10.5
108.4
0
20
40
60
80
100
120
140
160
Aveverage receive time for CRblock
Aveverage receive time for currentblock
ms
100Mbit Gigabit
Conclusions
RAC scaled at 90% on a simple workloadRAC scaled consistently at 55+% on a
complex workloadThere is an impact to query performance
depending on your workloadYou can recover from failures in less than 1
minuteWhen configured correctly a RAC cluster can
scale, perform and be highly available.
AQ&Q U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S