Technical Report MetroCluster in Clustered Data ONTAP 8.3 Verification Tests Using Oracle Workloads Business Workloads Group, PSE, NetApp April 2015 | TR-4396 Abstract This document describes the results of functional testing of NetApp ® MetroCluster ™ software on the NetApp clustered Data ONTAP ® 8.3 operating system in an Oracle Database 11g R2 environment. Proper operation is verified as well as expected behavior during each of the test cases. Specific equipment, software, and functional failover tests are included along with results.
38
Embed
TR-4396: MetroCluster on Clustered Data ONTAP 8.3 ... · on the NetApp clustered Data ONTAP® 8.3 operating system in an Oracle Database 11g R2 ... 9.8 Maintenance Requiring Planned
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Technical Report
MetroCluster in Clustered Data ONTAP 8.3 Verification Tests Using Oracle Workloads Business Workloads Group, PSE, NetApp
April 2015 | TR-4396
Abstract
This document describes the results of functional testing of NetApp® MetroCluster
™ software
on the NetApp clustered Data ONTAP® 8.3 operating system in an Oracle Database 11g R2
environment. Proper operation is verified as well as expected behavior during each of the test
cases. Specific equipment, software, and functional failover tests are included along with
1.1 Best Practices .................................................................................................................................................4
5 Value Proposition ................................................................................................................................. 8
8 Test Case Overview and Methodology ............................................................................................. 10
9 Test Results ........................................................................................................................................ 12
9.1 Loss of Single Oracle Node (TC-01) ............................................................................................................. 12
9.2 Loss of Oracle Host HBA (TC-02) ................................................................................................................. 12
9.3 Loss of Individual Disk (TC-03) ..................................................................................................................... 13
9.4 Loss of Disk Shelf (TC-04) ............................................................................................................................ 14
9.5 Loss of NetApp Storage Controller (TC-05) .................................................................................................. 15
9.6 Loss of Back-End Fibre Channel Switch (TC-06) .......................................................................................... 16
9.7 Loss of Interswitch Link (TC-07) ................................................................................................................... 17
9.8 Maintenance Requiring Planned Switchover from Site A to Site B (TC-08) .................................................. 18
9.9 Disaster Forcing Unplanned Manual Switchover from Site A to Site B (TC-09) ............................................ 19
Detailed Test Cases ............................................................................................................................................. 21
Data Layout .......................................................................................................................................................... 35
Materials List ........................................................................................................................................................ 37
LIST OF TABLES
Table 1) Test case summary. ....................................................................................................................................... 11
Table 9) FC back-end switches. ................................................................................................................................... 34
Table 10) Materials list for testing. ................................................................................................................................ 37
Figure 3) Test environment. ...........................................................................................................................................9
Figure 4) Data layout. ................................................................................................................................................... 10
Figure 5) Test phases................................................................................................................................................... 11
Figure 6) Loss of Oracle node. ..................................................................................................................................... 12
Figure 7) Loss of an Oracle server host HBA. .............................................................................................................. 13
Figure 8) Loss of an individual disk. ............................................................................................................................. 14
Figure 9) Loss of disk shelf. .......................................................................................................................................... 15
Figure 10) Loss of NetApp storage controller. .............................................................................................................. 16
Figure 11) Loss of an FC switch. .................................................................................................................................. 17
Figure 12) Loss of ISL. ................................................................................................................................................. 18
Figure 13) Loss of primary site for planned maintenance. ............................................................................................ 19
Figure 14) Loss of primary site. .................................................................................................................................... 20
Figure 15) Aggregate and volume layouts and sizes. ................................................................................................... 35
Figure 16) Volume and LUN layouts for site A. ............................................................................................................ 36
This section describes the NetApp hardware and software used in the solution.
FAS8000 Series Storage Systems
NetApp FAS8000 series storage systems combine a unified scale-out architecture with leading data-
management capabilities. They are designed to adapt quickly to changing business needs while
delivering core IT requirements for up-time, scalability, and cost efficiency. These systems offer the
following advantages:
Speed the completion of business operations. Leveraging a new high-performance, multicore architecture and self-managing flash acceleration, FAS8000 unified scale-out systems boost throughput and decrease latency to deliver consistent application performance across a broad range of SAN and NAS workloads.
Streamline IT operations. Simplified management and proven integration with cloud providers let you deploy the FAS8000 in your data center and in a hybrid cloud with confidence. Nondisruptive operations simplify long-term scaling and improve uptime by facilitating hardware repair, tech refreshes, and other updates without planned downtime.
Deliver superior total cost of ownership. Proven storage efficiency and a two-fold improvement in price/performance ratio over the previous generation reduce capacity utilization and improve long-term return on investment. NetApp FlexArray
™ storage virtualization software lets you integrate
existing arrays with the FAS8000, increasing consolidation and providing even greater value to your business.
Clustered Data ONTAP Operating System
NetApp clustered Data ONTAP 8.3 software delivers a unified storage platform that enables unrestricted,
secure data movement across multiple cloud environments and paves the way for software-defined data
centers, offering advanced performance, availability, and efficiency. Data ONTAP clustering capabilities
help you keep your business running nonstop.
Clustered Data ONTAP is an industry-leading storage operating system. Its single feature-rich platform
allows you to scale infrastructure without increasing IT staff. Clustered Data ONTAP provides the
following benefits:
Nondisruptive operations:
Perform storage maintenance, hardware lifecycle operations, and software upgrades without interrupting your business.
Eliminate planned and unplanned downtime.
Proven efficiency:
Reduce storage costs by using one of the most comprehensive storage efficiency offerings in the industry.
Consolidate and share the same infrastructure for workloads or tenants with different performance, capacity, and security requirements.
Seamless scalability:
Scale capacity, performance, and operations without compromise, regardless of application.
Scale SAN and NAS from terabytes to tens of petabytes without reconfiguring running applications.
MetroCluster Solution
A self-contained solution, NetApp MetroCluster high-availability (HA) and DR software lets you achieve
continuous data availability for mission-critical applications at half the cost and complexity.
as key personnel, facilities, crisis communication, and reputation protection, and it should refer to the
disaster recovery plan (DRP) for IT-related infrastructure recovery or continuity.
Generically, a disaster can be classified as either logical or physical. Both categories are addressed with
HA, recovery processing, and/or DR processes.
4.1 Logical Disasters
Logical disasters include, but are not limited to, data corruption by users or technical infrastructure.
Technical infrastructure disasters can result from file system corruption, kernel panics, or even system
viruses introduced by end users or system administrators.
4.2 Physical Disasters
Physical disasters include the failure of any storage component on site A or site B that supersedes the
resiliency features of an HA pair of NetApp controllers not based on MetroCluster that would normally
result in downtime or data loss.
In certain cases, mission-critical applications should not be stopped even in a disaster. By leveraging
Oracle RAC extended-distance clusters and NetApp storage technology, it is possible to address those
failure scenarios and provide a robust deployment for critical database environments and applications.
5 Value Proposition
Typically, mission-critical applications must be implemented with two requirements:
RPO = 0 (recovery point objective equal to zero), meaning that data loss from any type of any failure is unacceptable
RTO ~= 0 (recovery time objective as close to zero as possible), meaning that the time to recovery from a disaster scenario should be as close to 0 minutes as possible
The combination of Oracle RAC on extended-distance clusters with NetApp MetroCluster technology
meets these RPO requirements by addressing the following common failures:
Any kind of Oracle Database instance crash
Switch failure
Multipathing failure
Storage controller failure
Storage or rack failure
Network failure
Local data center failure
Complete site failure
6 High-Availability Options
Multiple options exist for spanning sites with an Oracle RAC cluster. The best option depends on the
available network connectivity, the number of sites, and customer business needs. NetApp Professional
Services can offer assistance with configuration planning and, when necessary, can offer Oracle
consulting services as well.
6.1 ASM Mirroring
Automatic storage management (ASM) mirroring, also called ASM normal redundancy, is a frequent
choice when only a very small number of databases must be replicated. In this configuration, the Oracle
RAC nodes span sites and leverage ASM to replicate data. Storage mirroring is not required, but
scalability is limited because as the number of databases increases the administrative burden to maintain
many mirrored ASM disk groups becomes excessive. In these cases, customers generally prefer to mirror
data at the storage layer.
This approach can be configured with and without a tiebreaker to control the Oracle RAC cluster quorum.
6.2 Two-Site Storage Mirroring
The configuration chosen for these tests was two-site storage mirroring because it reflects the most
common use of site-spanning Oracle RAC with MetroCluster.
As described in detail in the following section, this option establishes one of the sites as a designated
primary site and the other as the designated secondary site. This is done by first selecting one site to host
the active storage site and then configuring two Oracle Cluster Ready Services (CRSs) and voting
resources on it. The other site is a synchronous but passive replica. It does not directly serve data. It also
contains only one CRS and voting resource.
7 High-Level Topology
Figure 3 shows the architecture of the configuration used for our validation testing. These tests used a
two-node Oracle RAC database environment with a RAC node deployed at both site A and site B with the
following specifics:
The sites were separated by a 20km distance, and fiber spools were used for both the MetroCluster and the RAC nodes.
The RAC configuration used the FC protocol and ASM to provide access to the database.
A WAN emulator was used to simulate a 20km distance between the RAC nodes for the private interconnect and to introduce approximately 10ms of latency into the configuration.
To increase the load on the test environment, we made sure that the Oracle RAC node that was installed
in site B participated in the load generation by driving IOPS to the FAS8060 storage controllers on site A
across the network.
Table 1) Test case summary.
Test Case Description
TC01 Loss of a single Oracle node
TC02 Loss of an Oracle host HBA
TC03 Loss of an individual disk in an active data aggregate
TC04 Loss of an entire disk shelf
TC05 Loss of a NetApp storage controller
TC06 Loss of a back-end FC switch on the MetroCluster cluster
TC07 Loss of an ISL
TC08 Sitewide maintenance requiring a planned switchover from site A to site B
TC09 Sitewide disaster requiring an unplanned manual switchover from site A to site B
For more information about how we conducted each of these tests, see the appendix of this document.
Each test was broken into the following three phases:
1. A baseline stage, indicative of normal operations. A typical duration for this stage was 15 minutes.
2. A fault stage, during which the specific fault under test was injected and allowed to continue in this stage for 15 minutes to provide sufficient time to verify correct database behavior.
3. A recovery stage, in which the fault was corrected and database behavior was verified. When applicable, this stage generally included 30 additional minutes of run time after the fault was corrected.
Figure 5 shows the process. Before each stage of a specific test, we used the automatic workload
repository (AWR) functionality of the Oracle database to create a snapshot of the current condition of the
database. After the test was complete, we captured the data between the snapshots to understand the
impact of the specific fault on database performance and behavior. Finally, we monitored the CPU, IOPS,
and disk utilization on the storage controllers throughout the tests.
As expected, during this test we observed no impact to the Oracle RAC functionality and minimal impact
to the overall performance driven from the database while both RAC nodes continued to drive load to the
FAS8060 controllers on site A. The database remained operational during the 15 minutes we allowed the
test to continue in the failed state.
With the use of SyncMirror in MetroCluster, shelf failure at either site is transparent. There are two plexes,
one at each site. In normal operation, all reads are fulfilled from the local plex, and all writes are
synchronously updated on both plexes. If one plex fails, reads continue seamlessly on the remaining plex,
and writes are directed to the remaining plex. If the hardware can be powered on for recovery, the
resynchronization of the recovered plex is automatic. If the failed shelf must be replaced, the new disks
are added to the mirrored plex. Afterward, resynchronization again becomes automatic.
9.5 Loss of NetApp Storage Controller (TC-05)
This test resulted in the unplanned loss of an entire storage controller. As Figure 10 shows, this was
accomplished by powering off one of the FAS8060 storage controllers at site A. The surviving storage
controller automatically took over the workload that was initially shared evenly across both storage
controllers.
Note: The storage controller takeover and giveback process used for this test differs from the MetroCluster switchover and switchback process used in test cases TC-08 and TC-09.
For this test we ran the workload for a total of 60 minutes and allowed the controller to be powered off for
15 minutes before reapplying power to correct the fault and performing a storage controller giveback to
bring both FAS8060 controllers back on line at site A.
Figure 13) Loss of primary site for planned maintenance.
For this test, we ran the workload for a total of 60 minutes. After 15 minutes, we initiated the MetroCluster
switchover command from the FAS8060 controllers on site B. After the switchover was successfully
completed, we observed that the workload was picked up by the FAS8060 controllers at site B and that
both Oracle RAC nodes continued to operate normally without interruption.
Note: The switchover was accomplished by using a single command to switch over the entire storage resource from site A to site B while preserving the configuration and identity of the LUNs. The result was that no action, rediscovery, remapping, or reconfiguration was required from the perspective of the Oracle RAC database.
We allowed the test to continue in the switched-over state for another 15 minutes and then initiated the
MetroCluster switchback process to restore site A as the primary site for the Oracle RAC database. After
successfully completing the MetroCluster switchback process, we observed the FAS8060 in site A
resuming the processing of the workload from both RAC nodes, and the database operation continued
without interruption.
During this test, were observed no problems with the operation of the Oracle RAC database.
9.9 Disaster Forcing Unplanned Manual Switchover from Site A to Site B (TC-09)
This test resulted in the unexpected complete loss of site A because of an unspecified disaster. As Figure
14 shows, this loss was accomplished by powering off both of the FAS8060 storage controllers and the
Oracle RAC node located at site A.
Our expectation was that the second Oracle RAC node running on site B would lose access to the
database LUNs hosted on the FAS8060 controllers on site A and shut down. After officially declaring the
loss of site A, we manually initiated a MetroCluster switchover from site A to site B and restarted the
NetApp MetroCluster software in clustered Data ONTAP 8.3 provides native continuous availability for
business-critical applications, including Oracle. Our tests demonstrated that even under heavy
transactional workloads Oracle databases continue to function normally during a wide variety of failure
scenarios that could potentially cause downtime and data loss.
In addition, clustered Data ONTAP provides the following benefits:
Nondisruptive operations leading to zero data loss
Set-it-once simplicity
Zero change management
Lower cost and complexity of competitive solutions
Seamless integration with storage efficiency, SnapMirror®, nondisruptive operations, and virtualized
storage
Unified support for both SAN and NAS
Together, these products create a winning combination for continuous data availability.
Appendix
This appendix provides detailed information about the test cases described in this document as well as
about deployment, the network, the data layout, and the list of materials used.
Detailed Test Cases
TC-01: Loss of Single Oracle Node
Test Case Details
Test case number TC-01
Test case description No single point of failure should exist in the solution. Therefore, the loss of one of the Oracle servers in the cluster was tested. This test was accomplished by halting a host in the cluster while running a test workload.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results The loss of an Oracle RAC node causes no interruption of Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at a lower rate because of the loss of one of the RAC nodes. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Halt one of the Oracle RAC servers and allow the test to continue for 15 minutes.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Bring the halted server back online and verify that it is placed back into the RAC environment.
7. Allow the test to continue for the remainder of the 60-minute duration.
8. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault.
TC-02: Loss of Oracle Host HBA
Test Case Details
Test number TC-02
Test case description No single point of failure should exist in the solution. Therefore, the loss of one of the Oracle servers in the cluster was tested. This test was accomplished by halting a host in the cluster while running a test workload.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results Removal of the HBA connection from the Oracle RAC node causes no interruption of Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at prefailure levels. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Remove the cable from the FC HBA on the Oracle RAC server on site B and allow the test to continue for 15 minutes.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Reinstall the cable.
7. Allow the test to continue for the remainder of the 60-minute duration.
8. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
Test case description No single point of failure should exist in the solution. Therefore, the loss of a single disk was tested. This test was accomplished by removing a disk drive from the shelf hosting the database data files on the FAS8060 running on site A while running an active workload.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results The removal of the disk drive causes no interruption of Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at prefailure levels. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Remove one of the disks in an active aggregate and allow the test to continue for 15 minutes.
5. Reinstall the disk drive.
6. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
7. Allow the test to continue for the remainder of the 60-minute duration.
8. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
Test case description No single point of failure should exist in the solution. Therefore, the loss of an entire shelf of disks was tested. This test was accomplished by turning off both power supplies on one of the disk shelves hosting the database data files on the FAS8060 running on site A while running an active workload.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results The loss of a disk shelf causes no interruption of the Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at prefailure levels. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Turn off the power supplies on the designated disk shelf and let the test continue for 15 minutes.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Turn on the power supplies on the affected disk shelf.
7. Allow the test to continue for the remainder of the 60-minute duration.
8. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
Test case description No single point of failure should exist in the solution. Therefore, the loss of one of the FAS8060 controllers serving the database on site A was tested while an active workload was running.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results The loss of a controller of an HA pair has no impact on Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at a lower rate in the time frame while the second storage controller is halted and the surviving storage controller is handling the entire workload.
After the storage giveback process is completed, performance returns to prefailure levels because both storage controllers are again servicing the workload. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Without warning, halt one of the controllers of the FAS8060 HA pair on site A.
5. Initiate a storage takeover by the surviving node and let the test continue for 15 minutes.
6. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
7. Reboot the halted storage controller.
8. Initiate a storage giveback operation to bring the failed node back into the storage cluster.
9. Allow the test to continue for the remainder of the 60-minute duration.
10. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
Test case description No single point of failure should exist in the solution. Therefore, the loss of an entire FC switch supporting the MetroCluster cluster was tested. This test was accomplished by simply removing the power cord from one of the Brocade 6510 switches in site A while running an active workload.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results The loss of a single MetroCluster FC switch causes no interruption of the Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at prefailure levels. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Power off one of the MetroCluster Brocade 6510 switches in site A and allow the test to run for 15 minutes.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Power on the Brocade 6510 switch.
7. Allow the test to continue for the remainder of the 60-minute duration.
8. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
Test case description No single point of failure should exist in the solution. Therefore, the loss of one of the ISLs was tested. This test was accomplished by removing the FC cable between two Brocade 6510 switches on site A and site B while running an active workload.
Test assumptions A completely operational NetApp MetroCluster cluster has been properly installed and configured.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers
Expected results The loss of one of the ISL switch links between site A and site B causes no interruption of the Oracle RAC operation. During the failure period, IOPS continue to the FAS8060 at prefailure levels. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. Disconnect one of the MetroCluster ISLs.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Reconnect the affected ISL.
7. Allow the test to continue for the remainder of the 60-minute duration.
8. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
TC-08: Maintenance Requiring Planned Switchover from Site A to Site B
Test Case Details
Test number TC-08
Test case description If there is a required maintenance window for the FAS8060 storage controllers at site A, the MetroCluster switchover feature should be capable of moving the production workload to site B and presenting the Oracle RAC database LUNs from the FAS8060 storage controllers at site B, allowing the database to continue operations. To test this premise, we initiated a MetroCluster switchover and switchback from site A to site B and then back to site A after the maintenance was complete.
Test assumptions A completely operational NetApp MetroCluster cluster has been installed and configured properly.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers on site A and site B
Expected results Moving the production operations from site A to site B by using the MetroCluster switchover operations causes no interruption of the Oracle RAC operation. After the MetroCluster switchover, IOPS are directed to the FAS8060 storage controllers at site B from both RAC nodes. After the MetroCluster switchback, IOPS are again directed at the FAS8060 storage controllers on site A. No database errors are detected.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. On site B, initiate a MetroCluster switchover of production operations and let the test continue to run in switchover mode for 15 minutes.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Heal the aggregates on site A.
7. Perform a MetroCluster switchback to return to normal operation.
8. Verify successful switchback.
9. Allow the test to continue for the remainder of the 60-minute duration.
10. SLOB creates a final AWR snapshot at the end of the test to capture database-level IOPS and latency for the period after the fault is corrected.
TC-09: Disaster Forcing Unplanned Manual Switchover from Site A to Site B
Test Case Details
Test number TC-09
Test case description If an unplanned disaster at site A takes out the FAS8060 storage controllers and the Oracle RAC node at site A, the MetroCluster switchover feature should be capable of moving the production workload to site B and presenting the Oracle RAC database LUNs from the FAS8060 storage controllers at site B, allowing the database to continue operations.
To test this premise, we powered off the FAS8060 storage controllers and the Oracle RAC node located at site A to simulate a site failure. We then manually initiated a MetroCluster switchover and switchback from site A to site B and then back to site A after mitigating the disaster at site A.
Test assumptions A completely operational NetApp MetroCluster cluster has been properly installed and configured.
A completely operational Oracle RAC environment has been installed and configured.
The SLOB utility has been installed and configured to generate a workload consisting of 90% reads and 10% writes with a 100% random access pattern.
Test data or metrics to capture
AWR data as described in section 8, “Test Case Overview and Methodology”
IOPS, CPU, and disk utilization data from both NetApp FAS8060 controllers on site A and site B
Expected results As a result of the disaster, the FAS8060 at site A is lost, which ultimately causes the RAC node at site B to lose access to the database LUNs and stop running. Manually moving the production operations from site A to site B through the MetroCluster switchover operations allows the Oracle RAC database to be restarted by using the surviving database node.
After the MetroCluster switchover and restart of the database, IOPS are directed to the FAS8060 storage controllers at site B from the surviving RAC node on site B. After the disaster is repaired and the MetroCluster switchback is completed, the repaired Oracle RAC node on site A is restarted and added back into the database. IOPS are again directed at the FAS8060 storage controllers on site A from both Oracle RAC nodes.
Test Methodology
1. Initiate the defined workload by using the SLOB tool for a total of 60 minutes. SLOB generates an initial AWR snapshot.
2. Allow the workload to run for 15 minutes to establish consistent performance.
3. Initiate an AWR snapshot to capture database-level IOPS and latency information before the fault is injected.
4. On site B, initiate a MetroCluster switchover of production operations and let the test continue to run in switchover mode for 15 minutes.
5. Initiate an AWR snapshot to capture database-level IOPS and latency during the fault.
6. Heal the aggregates on site A.
7. Perform a MetroCluster switchback to return to normal operation.
8. Verify successful switchback.
9. Allow the test to continue for the remainder of the 60-minute duration.
Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.
Trademark Information
NetApp, the NetApp logo, Go Further, Faster, ASUP, AutoSupport, Campaign Express, Cloud ONTAP, Customer Fitness, Data ONTAP, DataMotion, Fitness, Flash Accel, Flash Cache, Flash Pool, FlashRay, FlexArray, FlexCache, FlexClone, FlexPod, FlexScale, FlexShare, FlexVol, FPolicy, GetSuccessful, LockVault, Manage ONTAP, Mars, MetroCluster, MultiStore, NetApp Insight, OnCommand, ONTAP, ONTAPI, RAID DP, SANtricity, SecureShare, Simplicity, Simulate ONTAP, Snap Creator, SnapCopy, SnapDrive, SnapIntegrator, SnapLock, SnapManager, SnapMirror, SnapMover, SnapProtect, SnapRestore, Snapshot, SnapValidator, SnapVault, StorageGRID, Tech OnTap, Unbound Cloud, and WAFL are trademarks or registered trademarks of NetApp, Inc., in the United States and/or other countries. A current list of NetApp trademarks is available on the Web at http://www.netapp.com/us/legal/netapptmlist.aspx.
Cisco and the Cisco logo are trademarks of Cisco in the U.S. and other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. TR-4396-0415
Software derived from copyrighted NetApp material is subject to the following license and disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice. NetApp assumes no responsibility or liability arising from the use of products described herein, except as expressly agreed to in writing by NetApp. The use or purchase of this product does not convey a license under any patent rights, trademark rights, or any other intellectual property rights of NetApp.
The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.277-7103 (October 1988) and FAR 52-227-19 (June 1987).