Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc.

Post on 29-Dec-2015

220 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

Kevin Closson

Chief Software Architect

Oracle Platform Solutions, Polyserve Inc

The Un-”Show Stopper”• NAS for Oracle is not “file serving”, let me explain…

• Think of GbE NFS I/O paths from Oracle Servers to the NAS device that are totally direct. No VLANing sort of indirection.

– In these terms, NFS over GbE is just a protocol as is FCPover FiberChannel

– The proof is in the numbers.• A single dual-socket/dual-core ADM server running Oracle10gR2 can push through

273MB/s of large I/Os (scattered reads, direct path read/write, etc) of triple-bonded GbE NICs!

• Compare that to infrastructure and HW costs of 4GbE FCP (~450MB/s, but you need 2 cards for redundancy)

– OLTP over modern NFS with GbE is not a challenging I/O profile.

• However, not all NAS devices are created equal by any means

Agenda

• Oracle on NAS

• NAS Architecture

• Proof of Concept Testing

• Special Characteristics

Oracle on NAS

Oracle on NAS

• Connectivity– Fantasyland Dream Grid™ would be nearly impossible with FibreChannel

switched fabric, for instance:• 128 nodes == 256 HBAs, 2 switches each with 256 ports just for the servers then you

have to work out storage paths

• Simplicity– NFS is simple. Anyone with a pulse can plug in cat-5 and mount filesystems.– MUCH MUCH MUCH MUCH MUCH simpler than:

• Raw partitions for ASM• Raw, OCFS2 for CRS• Oracle Home? Local Ext3 or UFS?• What a mess

– Supports shared Oracle Home, shared APPL_TOP too– But not simpler than a Certified Third Party Cluster Filesystem , but that is a

different presentation• Cost

– FC HBAs are always going to be more expensive than NICs– Ports on enterprise-level FC switches are very expensive

Oracle on NAS

• NFS Client Improvements– Direct IO

• open(,O_DIRECT,) works with Linux NFS clients, Solaris NFS client, likely others

• Oracle Improvements• init.ora filesystemio_options=directIO• No async I/O on NFS, but look at the numbers• Oracle runtime checks mount options

• Caveat: It doesn’t always get it right, but at least it tries (OSDS)• Don’t be surprised to see Oracle offer a platform-independent NFS client

• NFS V4 will have more improvements

NAS Architecture

NAS Architecture• Single-headed Filers

• Clustered Single-headed Filers

• Asymmetrical Multi-headed NAS

• Symmetrical Multi-headed NAS

Single Headed Filer Architecture

NAS Architecture: Single-headed Filer

Filesystems/u01/u02/u03

GigE Network

Oracle Database Servers

Filesystems/u01/u02/u03

A single one of these…

Has the same (or more) bus bandwidth

as this!

Oracle Servers Accessing a Single-headed Filer: I/O Bottleneck

I/O Bottleneck

Oracle Servers Accessing a Single-headed Filer: Single Point of Failure

Oracle Database Servers

Filesystems/u01/u02/u03

Single Point of Failure

Highly Available through failover-HA,DataGuard, RAC, etc

Clustered Single-headed Filers

Architecture: Cluster of Single-headed Filers

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Servers Accessing a Cluster of Single-headed Filers

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

Architecture: Cluster of Single-headed Filers

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

What if /u03 I/O saturates this Filer?

Filer I/O Bottleneck. Resolution == Data Migration

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

Filesystems/u04

Migrate some of the “hot” data to /u04

Data Migration Remedies I/O Bottleneck

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

Filesystems/u04

Migrate some of the “hot” data to /u04

NEW Single Point of Failure

Summary: Single-headed Filers

• Cluster to mitigate S.P.O.F– Clustering is a pure afterthought with filers– Failover Times?

• Long, really really long. – Transparent?

• Not in many cases.• Migrate data to mitigate I/O bottlenecks

– What if the data “hot spot” moves with time? The Dog Chasing His Tail Syndrome

• Poor Modularity• Expanded by pairs for data availability• What’s all this talk about CNS?

Asymmetrical Multi-headed NAS Architecture

Asymmetrical Multi-headed NAS Architecture

FibreChannel SAN

Three Active NAS Heads / Three For Failover and

“Pools of Data”

Note: Some variants of this architecture support M:1 Active:Standbybut that doesn’t really change much.

Oracle Database Servers

SAN Gateway

Asymmetrical NAS Gateway Architecture

• Really not much different than clusters of single-headed filers:

– 1 NAS head to 1 filesystem relationship

– Migrate data to mitigate I/O contention

– Failover not transparent

• But:

– More Modular

• Not necessary to scale up by pairs

Symmetric Multi-headed NAS

HP Enterprise File Services Clustered Gateway

Symmetric vs Asymmetric

NASHead

NASHead

NASHead

/Dir1/File1 /Dir2/File2 /Dir3/File3

/Dir1/File1 /Dir2/File2 /Dir3/File3

/Dir3/File3/Dir2/File2

NAS Head

NAS Head

NAS Head

/Dir1/File1

/Dir1/File1

/Dir2/File2

/Dir3/File3

/Dir3/File3/Dir2/File2

/Dir1/File1

/Dir2/File2

/Dir1/File1

EFS-CG

Enterprise File Services Clustered Gateway Component Overview

• Cluster Volume Manager– RAID 0– Expand Online

• Fully Distributed, Symmetric Cluster Filesystem– The embedded filesystem is a fully distributed, symmetric cluster filesystem

• Virtual NFS Services– Filesystems are presented through Virtual NFS Services

• Modular and Scalable– Add NAS heads without interruption– All filesystems can be presented for read/write through any/all NAS heads

EFS-CG Clustered Volume Manager

• RAID 0 – LUNS are RAID 1, so this implements S.A.M.E.

• Expand online– Add LUNS, grow volume

• Up to 16TB– Single Volume

The EFS-CG Filesystem

• All NAS devices have embedded operating systems and file systems, but the EFS-CG is:

– Fully Symmetric• Distributed Lock Manager• No Metadata Server or Lock Server

– General Purpose clustered file system– Standard C Library and POSIX support– Journaled with Online recovery

• Proprietary format but uses standard Linux file system semantics and system calls including flock() and fcntl() clusterwide

• Expand a single filesystem online up to 16TB, up to 254 filesystems in current release.

EFS-CG Filesystem Scalability

Scalability. Single Filesystem Export Using x86 Xeon-based NAS Heads (Old Numbers)

123246

493

739

986 1,0841,196

0

200

400

600

800

1,000

1,200

Meg

aByt

es p

er

Sec

on

d (

MB

/s)

1 2 4 6 8 9 10

Cluster Size (Nodes)

# Servers Total bytes (Mbytes) Time (sec.) Mbytes/Sec. Gbits/Sec Scale Factor Scaling Coefficient1 16,384 133 123.19 0.96 1.00 100%2 32,768 133 246.38 1.92 2.00 100%4 65,536 133 492.75 3.85 4.00 100%6 98,304 133 739.13 5.77 6.00 100%8 131,072 133 985.50 7.70 8.00 100%9 147,456 136 1,084.24 8.47 8.80 98%

10 163,840 137 1,195.91 9.34 9.71 97%

123246

493

739

986 1,0841,196

Meg

aByt

es p

er

Sec

ond

(MB

/s)

Cluster Size (Nodes)

NAS I/O Throughput (via NFS)

HP StorageWorks Clustered File System is optimized for both READ and WRITE performance.

ApproximateSingle-headed

Filer limit

NAS Heads

Virtual NFS Services

• Specialized Virtual Host IP

• Filesystem groups are exported through VNFS

• VNFS failover and rehosting are 100% transparent to NFS client– Including active file descriptors, file locks (e.g. fctnl/flock), etc

EFS-CG Filesystems and VNFS

/u01/u02

NAS Head

/u04/u03

vnfs2b

/u03

NASHead

/u01

vnfs1

Enterprise File Services Clustered Gateway

/u04

NAS Head

/u02

NASHead

/u04/u03

vnfs1b vnfs3b

Enterprise File Services Clustered Gateway

Oracle Database Servers

EFS-CG Management Console

EFS-CG Proof of Concept

EFS-CG Proof of Concept

• Goals

– Use Oracle10g (10.2.0.1) with a single high performance filesystem for the RAC database and measure:

– Durability

– Scalability

– Virtual NFS functionality

EFS-CG Proof of Concept

• The 4 filesystems presented by the EFS-CG were:

– /u01. This filesystems contained all Oracle executables (e.g., $ORACLE_HOME)

– /u02. This filesystem contained the Oracle10gR2 clusterware files (e.g., OCR, CSS) and some datafiles and External Tables for ETL testing

– /u03. This filesystem was lower-performance space used for miscellaneous tests such as backup disk-to-disk

– /u04. This filesystem resided on a high-performance volume that spanned two storage arrays. It contained the main benchmark database

EFS-CG P.O.C. Parallel Tablespace Creation

• All datafiles created in a single exported filesystem

– Proof of multi-headed, single filesystem write scalability

EFS-CG P.O.C. Parallel Tablespace Creation

Multi-headed EFS-CG Tablespace Creation Scalability

111

208

0

50

100

150

200

250

Single-head, Single GigE Path Multi-headed, dual GigE Paths

MB

/s

EFS-CG P.O.C. Full Table Scan Performance

• All datafiles located in a single exported filesystem

– Proof of multi-headed, single filesystem sequential I/O scalability

EFS-CG P.O.C.Parallel Query Scan Throughput

Multi-headed EFS-CG Full Table Scan Scalability

98

188

0

50

100

150

200

250

Single-head, Single GigE Path Multi-headed, dual GigE Paths

MB

/s

EFS-CG P.O.C.OLTP Testing

• OLTP Database based on an Order Entry Schema and workload

• Test areas

– Physical I/O Scalability under Oracle OLTP – Long Duration Testing

EFS-CG P.O.C.OLTP Workload Transaction Avg Cost

Oracle Statistics Average Per Transaction

SGA Logical Reads 33

SQL Executions 5

Physical I/O 6.9 *

Block Changes 8.5

User Calls 6

GCS/GES Messages Sent 12

* Averages with RAC can be deceiving, be aware of CR sends

EFS-CG P.O.C.OLTP Testing

10gR2 RAC Scalability on EFS-CG

650

1246

1773

2276

0

500

1000

1500

2000

2500

1 2 3 4

RHEL4-64 RAC Servers

Tra

nsa

ctio

ns

per

S

eco

nd

EFS-CG P.O.C.OLTP Testing. Physical I/O Operations

RAC OLTP I/O Scalability on EFS-CG

5214

8831

1161913743

0

5000

10000

15000

1 2 3 4

RHEL4-64 RAC Servers

Ran

do

m 4

K I

Op

s

EFS-CG Handles all OLTP I/O Types Sufficiently—no Logging Bottleneck

OLTP I/O by Type

893

5593

8150

0100020003000400050006000700080009000

redo writes datafile writes datafile reads

I/O

Op

s p

er S

eco

nd

Long Duration Stress Test• Benchmarks do not prove durability

– Benchmarks are “sprints”

– Typically 30-60 minute measured runs (e.g., TPC-C)

• This long duration stress test was no benchmark by any means

– Ramp OLTP I/O up to roughly 10,000/sec

– Run non-stop until the aggregate I/O breaks through 10 Billion physical transfers

– 10,000 physical I/O transfers per second for every second of nearly 12 days

Long Duration Stress Test

Long Duration Stress Test

Long Duration Stress Test

Special Characteristics

Special Characteristics

• The EFS-CG NAS Heads are Linux Servers

– Tasks can be executed directly within the EFS-CG NAS Heads at FCP speed:

– Compression

– ETL, data importing

– Backup

– etc..

Example of EFS-CG Special Functionality

• A table is exported on one of the RAC nodes

• The export file is then compressed on the EFS-CG NAS head:

– CPU from NAS Head, instead of database servers• The NAS heads are really just protocol engines. I/O DMAs are offloaded to the I/O

subsysystems. There are plenty of spare cycles.

– Data movement at FCP rate instead of GigE• Offload the I/O fabric (NFS paths from servers to the EFS-CG)

Export a Table to NFS Mount

Compress it on the NAS Head

Questions and Answers

Backup Slide

EFS-CG NAS Head EFS-CG NAS Head

SAN

Ethernet Switch

FiberChannel Switches

3 GbE NFS Paths:Can be triple bonded, etc

EFS-CG Scales “Up” and “Out”

top related