Business Continuity

Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.

Business Continuity Introduction - 1

© 2007 EMC Corporation. All rights reserved.

Section 4 - Business ContinuitySection 4 - Business Continuity

Welcome to Section 4 of Storage Technology Foundations – Business Continuity.

Copyright © 2007 EMC Corporation. All rights reserved.

These materials may not be copied without EMC's written consent.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

EMC2, EMC, Navisphere, CLARiiON, and Symmetrix are registered trademarks and EMC Enterprise Storage, The Enterprise Storage Company, The EMC Effect, Connectrix, EDM, SDMS, SRDF, Timefinder, PowerPath, InfoMover, FarPoint, EMC Enterprise Storage Network, EMC Enterprise Storage Specialist, EMC Storage Logix, Universal Data Tone, E-Infostructure, Access Logix, Celerra, SnapView, and MirrorView are trademarks of EMC Corporation.

All other trademarks used herein are the property of their respective owners.



© 2007 EMC Corporation. All rights reserved. Business Continuity Introduction - 2

Section ObjectivesUpon completion of this section, you will be able to:

Describe business continuity

Describe the solutions and the supporting technologies that enable business continuity and uninterrupted data availability– Backup and Recovery– Local Replication– Remote Replication

Describe basic disaster recovery techniques

The objectives for this section are shown here. Please take a moment to read them.




In this Section …This section contains the following modules:

1. Business Continuity Overview2. Backup and Recovery3. Local Replication4. Remote Replication

Additional Information:– Apply Your Knowledge– Case Study (Student Resource Guide ONLY)

This section is comprised of the 4 modules shown here.

This section also contains Apply Your Knowledge information and Case Studies.

The Apply Your Knowledge information is presented on-line at the end of selected modules.

The Case Studies are only available in this guide. Please make sure review the Case Studies prior to taking the on-line assessment.






Business Continuity OverviewAfter completing this module, you will be able to:

Define and differentiate between Business Continuity and Disaster Recovery

Differentiate between Disaster Recovery and Disaster Restart

Define terminology such as Recovery Point Objective and Recovery Time Objective

Describe (at high level) Business Continuity Planning

Identify Single Points of Failure and describe solutions to eliminate them

Information has become a critical asset for businesses. The survival of a business depends on uninterrupted availability of the data. Steps should be taken to ensure continuous availability of data in the event of a disaster.

The objectives for this module are shown here. Please take a moment to review them.




What is Business Continuity?Business Continuity is the preparation for, response to, and recovery from an application outage that adversely affects business operations

Business Continuity Solutions address systems unavailability, degraded application performance, or unacceptable recovery strategies

Before we can talk about business continuity and solutions for business continuity, we must first define the terms.

Business Continuity is the preparation for, response to, and recovery from an application outage that adversely affects business operations

Business Continuity Solutions address systems unavailability, degraded application performance, or unacceptable recovery strategies




Lost RevenueKnow the downtime costs (per hour, day, two days...)• Number of employees

impacted (x hours out * hourly rate)

Damaged Reputation

• Customers• Suppliers• Financial markets• Banks• Business partners

Financial Performance

• Revenue recognition• Cash flow• Lost discounts (A/P)• Payment guarantees• Credit rating• Stock price

Other ExpensesTemporary employees, equipment rental, overtime costs, extra shipping costs, travel expenses...

Why Business Continuity

• Direct loss• Compensatory payments• Lost future revenue• Billing losses• Investment losses

Lost Productivity

There are many factors that need to be considered when calculating the cost of downtime. A formula to calculate the costs of the outage should capture both the cost of lost productivity of employees and the cost of lost income from missed sales.

The Estimated average cost of 1 hour of downtime = (Employee costs per hour) *( Number of employees affected by outage) + (Average Income per hour).Employee costs per hour is simply the total salaries and benefits of all employees per week, divided by the average number of working hours per week. Average income per hour is just the total income of an institution per week, divided by average number of hours per week that an institution is open for business.




Information Availability

20 min 10 sec17 hrs 31 min0.2%99.8%

1 hr 41 min3.65 days1%99%

10 min 5 sec8 hrs 45 min0.1%99.9%

0.6 sec31.5 sec0.0001%99.9999%

6 sec5.25 min0.001%99.999%

1 min52.5 min0.01%99.99%

3hrs 22 min7.3 days2%98%

Downtime per WeekDowntime per Year% Downtime% Uptime

Information Availability ensures that applications and business units have access to information whenever it is needed. The primary components of information availability are:

Protection from data lossEnsuring data accessAppropriate data security

Since information is a major business asset, high information availability increases productivity and efficiency. Therefore, it is necessary to make this information reliable, available any time it is required, and sharable by different platforms, anywhere, at anytime. Ensuring access to this information and appropriate data security are also very important and all must be done in a cost effective manner.

Most availability limits will be defined in terms of “Nines.” This chart translates the percentage down time into amounts of downtime per year and per week. Downtime translates to lost revenue. In healthcare, Gartner group considers 99.5% of system availability as outstanding. This is the equivalent 43 hours of unplanned downtime and 50 hours of planned downtime per year.

The online window for some critical applications has moved to 99.999% of time.




Importance of Business Continuity and PlanningMillions of US Dollars per Hour in Lost Revenue

6.53.6

2.82.6

2.01.61.6

1.51.3

1.21.1

Retail brokerage

Point of sale

Energy

Credit card sales authorization

Telecommunications

Call location

Manufacturing

Financial institutions

Information technology

Insurance

Retail

Source Meta Group, 2005

This chart shows how much money each industry loses for each hour of downtime. As you can see, downtime is expensive!




Tape

B

acku

p

Perio

dic

Rep

licat

ion

Recovery Point Objective (RPO)

Wks Days Hrs Mins Secs

Recovery Point Recovery TimeRecovery Point Recovery Time

Tape

B

acku

p

Perio

dic

Rep

licat

ion

Asy

nchr

onou

s R

eplic

atio

nA

sync

hron

ous

R

eplic

atio

n

Sync

hron

ous

Rep

licat

ion

Sync

hron

ous

R

eplic

atio

n

Secs Mins Hrs Days Wks

Recovery Point Objective (RPO) is the point in time to which systems and data must be recovered after an outage. This defines the amount of data loss a business can endure. Different business units within an organization may have varying RPOs.




Recovery Time Objective (RTO)

Recovery Time includes:

Fault detection

Recovering data

Bringing apps back online

Glo

bal

Clu

ster

Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks

Recovery Point Recovery TimeRecovery Point Recovery Time

Glo

bal

Clu

ster

Man

ual

Mig

ratio

nM

anua

l M

igra

tion

Tape

Res

tore

Tape

Res

tore

Recovery Time Objective (RTO) is the period of time within which systems, applications, or functions must be recovered after an outage. This defines the amount of downtime that a business can endure, and survive. Recovery time includes: fault detection, data recovery, and bringing applications back on-line.




Disaster Recovery versus Disaster RestartMost business critical applications have some level of data interdependencies

Disaster recovery– Restoring previous copy of data and applying logs to that copy to bring it to

a known point of consistency – Generally implies the use of backup technology– Data copied to tape and then shipped off-site – Requires manual intervention during the restore and recovery processes

Disaster restart – Process of restarting mirrored consistent copies of data and applications– Allows restart of all participating DBMS to a common point of consistency

utilizing automated application of recovery logs during DBMS initialization– The restart time is comparable to the length of time required for the

application to restart after a power failure

Disaster recovery is the process of restoring a previous copy of the data and applying logs or other necessary processes to that copy to bring it to a known point of consistency.

Disaster restart is the restarting of dependent write consistent copies of data and applications, utilizing the automated application of DBMS recovery logs during DBMS initialization to bring the data and application to a transactional point of consistency.

There is a fundamental difference between Disaster Recovery and Disaster Restart. Disaster recovery is the process of restoring a previous copy of the data and applying logs to that copy to bring it to a known point of consistency. Disaster restart is the restarting of mirrored consistent copies of data and applications.

Disaster recovery generally implies the use of backup technology in which data is copied to tape and then it is shipped off-site. When a disaster is declared, the remote site copies are restored and logs are applied to bring the data to a point of consistency. Once all recoveries are completed, the data is validated to ensure it is correct.

While it might seem like semantics, there is an important difference between recovery and restart. The key difference between the two is the RTO. In a recovery situation one might have to restore data from tape or disk, roll forward committed transactions, roll back uncommitted transactions, and restore to a point of application (or database) consistency. These processes will elongate the RTO. In a restart situation, the application or the database “self-heals” so to speak. As mentioned in the slide, this is very much like starting back up after a power failure.




Disruptors of Data Availability

Disaster (<1% of Occurrences)Natural or man made

Flood, fire, earthquakeContaminated building

Unplanned Occurrences (13% of Occurrences)

FailureDatabase corruptionComponent failureHuman error

Planned Occurrences (87% of Occurrences)Competing workloads

Backup, reportingData warehouse extractsApplication and data restore

Source: Gartner, Inc.

Elevated demand for increased application availability confirms the need to ensure business continuity practices are consistent with business needs.

Interruptions are classified as either planned or unplanned. Failure to address these specific outage categories seriously compromises a company’s ability to meet business goals.

Planned downtime is expected and scheduled, but it is still downtime causing data to be unavailable. Causes of planned downtime include:

New hardware installation/integration/maintenanceSoftware upgrades/patches BackupsApplication and data restoreData center disruptions from facility operations (renovations, construction, other)Refreshing a testing or development environment with production dataPorting testing/development environment over to production environment




Causes of Downtime

Human ErrorSystem Failure

Infrastructure Failure

Disaster

Today, the most critical component of an organization is information. Any disaster occurrence will affect information availability critical to run normal business operations.

In our definition of disaster, the organization’s primary systems, data, applications are damaged or destroyed. Not all unplanned disruptions constitute a disaster.




Business Continuity vs. Disaster RecoveryBusiness Continuity has a broad focus on prevention:– Predictive techniques to identify risks – Procedures to maintain business functions

Disaster Recovery focuses on the activities that occur after an adverse event to return the entity to ‘normal’functioning

Business continuity and disaster recovery are not the same. Business Continuity is a holistic approach to planning, preparing, and recovering from an adverse event. The focus is on prevention, identifying risks, and developing procedures to ensure the continuity of business function. Disaster recovery planning should be included as part of business continuity.

Objectives of Business Continuity:Facilitate uninterrupted business support despite the occurrence of problems. Create plans that identify risks and mitigate them wherever possible.Provide a road map to recover from any event.

Disaster Recovery is more about specific cures, to restore service and damaged assets after an adverse event. In our context, Disaster Recovery is the coordinated process of restoring systems, data, and infrastructure required to support key ongoing business operations.




Business Continuity Planning (BCP)Includes the following activities:

Identifying the mission or critical business functions

Collecting data on current business processes

Assessing, prioritizing, mitigating, and managing risk– Risk Analysis– Business Impact Analysis (BIA)

Designing and developing contingency plans and disaster recovery plan (DR Plan)

Training, testing, and maintenance

Business Continuity Planning (BCP) is a risk management discipline. It involves the entire business--not just IT. BCP proactively identifies vulnerabilities and risks, planning in advance how to prepare for and respond to a business disruption. A business with strong BC practices in place is better able to continue running the business through the disruption and to return to “business as usual.”

BCP actually reduces the risk and costs of an adverse event because the process often uncovers and mitigates potential problems.




Objectives

Train, Test, and

Document

Implement,

Maintain, and

Assess

Analysis

Design

Develop

Business Continuity Planning Lifecycle

The Business Continuity Planning process includes the following stages:

Objectives Determine business continuity requirements and objectives including scope and budgetTeam selection (include all areas of the business and subject matter expertise (internal/external)Create the project plan

Perform analysisCollect information on data, business processes, infrastructure supports, dependencies, frequency of useIdentify critical needs and assign recovery priorities.Create a risk analysis (areas of exposure) and mitigation strategies wherever possible.Create a Business Impact Analysis (BIA)Create a Cost/benefit analysis – identify the cost (per hour/day, etc.) to the business when data is unavailable.Evaluate Options

3. Design and Develop the BCP/StrategiesEvaluate optionsDefine roles/responsibilitiesDevelop contingency scenariosDevelop emergency response proceduresDetail recovery, resumption, and restore proceduresDesign data protection strategies and develop infrastructureImplement risk management/mitigation proceduresTrain, test, and document, implement, maintain, and assess




Business Impact Analysis (BIA)

$1800

$8000

18000

$55619

$55768

$69517

Loss p/y

$400

$16,000

$16,000

$279,098

$279,066

$279,056

Single Loss Expectancy

1.0

0.5

1.0

0.2

0.2

.25

# Event p/y

$5,000

$122,000

$80,000

$10,000

$66,456

$5,800

Est cost of mitigation

No failover for development webserver

12IT-Intranet/B2B

6

Computer room does not have sufficient UPS capacity to run on single unit

34Entire Company

5

Primary dev platforms don’t have failover

34IT-All4

Relocate net equip to a separate physical rack

15Entire Company

3

Cisco net backbone switch not redundant

15Entire Company

2

No redundant UPS for Networking/phone equip

15Entire Company

1

High Risk SPOF ItemProbability (1-5)

Impact (1 -5)

Business Area Affected

#

The Business Impact Analysis quantifies the impact that an outage will have to the business and potential costs associated with the interruption. It helps businesses channel their resources based on probability of failure and associated costs. In the example shown, the dollar values are arbitrary and are used just for illustration.




User & Application Clients

IP

Identifying Single Points of Failure

Primary Node

Earlier, we discussed the importance of mitigating potential problems. Now, let’s walk through a data storage infrastructure example to identify the single points of failure and solutions to eliminate them.




HBA Failures

HBAHBA

Host

Switch

Storage

PortPortHBA

Configure multiple HBAs, and use multi-pathing software

Protects against HBA failure

Can provide improved performance (vendor dependent)

HBA

One component that could fail is the HBA on the server. Configuring multiple HBA’s and using multi-pathing software provides path redundancy. Upon detection of a failed HBA, the software can re-drive the I/O through another available path. This eliminates the HBA from being a single point of failure




Switch/Storage Array Port Failures

HBAHBA

Host SwitchStorage

PortPortHBAHBA

PortPort

Configure multiple switches

Make the devices available via multiple storage array ports

A switch or a storage array port could also fail. As shown in this example, configuring multiple switches, and making the devices available via multiple storage array ports, provides protection against switch or storage array port failures.




Disk Failures

HBAHBA

Host SwitchStorage

PortPortHBAHBA

PortPort

Use some level of RAID

As seen earlier, using some level of RAID, such as RAID-1 or RAID-5, ensures continuous operation in the event of disk failures.




Host Failures

HBAHBA

HostSwitch

Storage

PortPortHBAHBA

PortPort

Storage

Clustering protects against production host failures

Host

HBA

HBA

Planning and configuring clusters is a complex task. At a high level:A cluster is two or more hosts with access to the same set of storage (array) devicesSimplest configuration is a two node (host) clusterOne of the nodes would be the production server while the other would be configured as a

standby. This configuration is described as Active/Passive.Participating nodes exchange “heart-beats” or “keep-alives” to inform each other about their

health.In the event of the primary node failure, cluster management software will shift the production workload to the standby server.Implementation of the cluster failover process is vendor specific.A more complex configuration would be to have both the nodes run production workload on the same set of devices. Either cluster software or application/database should then provide a locking mechanism so that the nodes do not try to update the same areas on disk simultaneously. This would be an Active/Active configuration.




Site/Storage Array Failures

HBAHBA

Host SwitchStorage

PortPortHBAHBA

PortPort

Storage

Remote replication helps protect against either entire site or storage array failures

HBA

HBA

Host

It is also possible for the site or the storage array to fail. Remote replication of data to a secondary array at a secondary site will protect against these failures.




User & Application Clients

IP

Primary Node

IP

Redundant Network

Kee

p A

live

Clustering Software

Failover Node

Redundant PathsRedundant Disks (RAID 1/RAID 5)

Redundant Site

Switches

Storage Array Storage Array

Resolving Single Points of Failure

This slide summarizes what we have seen in the previous few. It uses clustering, redundant paths, RAID protected disks, remote replication of data to a secondary site, and a redundant Local Area Network.




Business Continuity Technology SolutionsLocal Replication

Remote Replication

Backup/Restore

Business Continuity technology solutions include local replication, remote replication, and backup/restore. This module provides a very high level overview of some of these solutions. They are covered in more detail in later modules.




Local ReplicationData from the production devices is copied over to a set of target (replica) devices within the same array

After some time, the replica devices will contain identical data as those on the production devices

Subsequently copying of data can be halted. At this point-in-time, the replica devices can be used independently of the production devices

The replicas can then be used for restore operations in the event of data corruption or other events

Alternatively the data from the replica devices can be copied to tape. This off-loads the burden of backup from the production devices

Local replication technologies offer fast and convenient methods for ensuring data availability. The different technologies and the uses of replicas for BC/DR operations will be discussed in a later module in this section. Typically, local replication uses replica disk devices. This greatly speeds up the restore process, thus minimizing the RTO. Frequent point-in-time replicas also help in minimizing RPO.




Remote ReplicationData from the production devices is copied over to a set of target (replica) devices on a different array at some distance away

Target devices can be kept continuously synchronized with the production devices

In the event of a failure of the production devices, applications can continue to run from the target devices

Remote replication typically involves a pair of arrays separated by some distance. To achieve near-zero RPO and a very small RTO, production and target devices are kept synchronized at all times. Periodic local replicas of the target devices may also be taken, to protect against data corruption on the production devices. The various alternatives for remote replication are discussed later in this section.




Backup/RestoreBackup to tape has been the predominant method for ensuring data availability and business continuity

Low cost, high capacity disk drives are now being used for backup to disk. This considerably speeds up the backup and the restore process

Frequency of backup will be dictated by defined RPO/RTO requirements as well as the rate of change of data

Far from being antiquated, periodic backup is still a widely used method for preserving copies of data. In the event of data loss due to corruption or other events, data can be restored up to the last backup. Evolving technologies now permit faster backups to disks. Magnetic tape drive speeds and capacities are also continually being enhanced. The various backup paradigms and the role of backup in BC/DR planning are discussed in detail later in this section.




Module SummaryKey points covered in this module:

Importance of Business Continuity

Types of outages and their impact to businesses

Business Continuity Planning and Disaster Recovery

Definitions of RPO and RTO

Difference between Disaster Recovery and Disaster Restart

Identifying and eliminating Single Points of Failure

These are the key points covered in this module. Please take a moment to review them.




Check Your Knowledge Which concerns do business continuity solutions address?

What is the difference between RPO and RTO?

What is the difference between Disaster Recovery and Disaster Restart?

What are some of the Single Points of Failure in a typical data center environment?

How can the loss of a storage array port be mitigated?

Check your knowledge of this module by taking some time to answer the questions shown on the slide.




Apply Your Knowledge

How EMC PowerPath Improves Business Continuity

in Storage Environments

At this point, let’s apply what we’ve learned to some real world examples. In this case, we look at how EMC PowerPath improves Business Continuity in storage environments.




What is EMC PowerPath

DBMSDBMS ManagementManagementUtilsUtils

File SystemFile System

Logical Volume ManagerLogical Volume Manager

ApplicationsApplicationsOpen Systems Host

SER

VER

STO

RA

GE Interconnect

Topology

SCSISCSIDriverDriver






SCSISCSIControllerController






PowerPathPowerPath

Host Based Software

Resides between application and SCSI device driver

Provides Intelligent I/O path management

Transparent to the application

Automatic detection and recovery from host-to-array path failures

PowerPath is host-based software that resides between the application and the disk device layers. Every I/O from the host to the array must pass through the PowerPath driver software. This allows PowerPath to work in conjunction with the array and connectivity environment to provide intelligent I/O path management. This includes path failover and dynamic load balancing, while remaining transparent to any application I/O requests as it automatically detects and recovers from host-to-array path failures.

PowerPath is supported on various hosts and Operating Systems such as Sun- Solaris, IBM-AIX, HP-UX, Microsoft Windows, Linux, and Novell. Storage arrays from EMC, Hitachi, HP, and IBM are supported. The level of OS and array models supported varies between PowerPath software versions.




PowerPath FeaturesMultiple paths, for higher availability and performance

Dynamic multipath load balancing

Proactive path testing and automatic path recovery

Automatic path failover

Online path configuration and management

High-availability cluster support

PowerPath Delivers:

PowerPath maximizes application availability, optimizes performance, and automates online storage management while reducing complexity and cost, all from one powerful data path management solution. PowerPath supports the following features:

Multiple path support - PowerPath supports multiple paths between a logical device and a host. Multiple paths enables the host to access a logical device, even if a specific path is unavailable. Also, multiple paths enable sharing of the I/O workload to a given logical device.Dynamic load balancing - PowerPath is designed to use all paths at all times. PowerPath distributes I/O requests to a logical device across all available paths, rather than requiring a single path to bear the entire I/O burden.Proactive path testing and automatic path recovery - PowerPath uses a path test to ascertain the viability of a path. After a path fails, PowerPath continues testing it periodically to determine if it is fixed. If the path passes the test, PowerPath restores it to service and resumes sending I/O to it.Automatic path failover - If a path fails, PowerPath redistributes I/O traffic from that path to functioning paths.Online configuration and management - PowerPath management interfaces include a command line interface and a GUI interface on Windows.High availability cluster support - PowerPath is particularly beneficial in cluster environments, as it can prevent operational interruptions and costly downtime.




PowerPath ConfigurationAll volumes are accessible through all pathsMaximum 32 paths to a logical volume Interconnect support for – SAN– SCSI– iSCSI

Host Application(s)

HBA HBA

SD SDSD

HBA Host BusAdapter

SCSIDriver

Storage

SER

VER

STO

RA

GE Interconnect

Topology

SD

HBA

PowerPath

Without PowerPath, if a host needed access to 40 devices, and there were four Host Bus Adapters, you would most likely configure the host to present 10 unique devices to each HB-. With PowerPath, the host is given access to all 40 devices via all four HBA’s.

PowerPath supports up to 32 paths to a logical volume. The host can be connected to the array using a number of interconnect topologies such as SAN, SCSI, or iSCSI.




The PowerPath Filter Driver

Platform independent base driverApplications direct I/O to PowerPath

PowerPath directs I/O to optimal path based on current workload and path availability

When a path fails PowerPath chooses another path in the set

Host Application(s)

HBA HBA

SD SDSD

HBA Host BusAdapter

SCSIDriver

Storage

SER

VER

STO

RA

GE Interconnect

Topology

SD

HBA

PowerPath Filter Driver

The PowerPath filter driver is a platform independent driver that resides between the application and HBA driver.

The driver identifies all paths that read and write to the same device and builds a routing table for the device, called a volume path set. A volume path set is created for each shared device in the array.

PowerPath can use any path in the set to service an I/O request. If a path fails, PowerPath can redirect an I/O request from that path to any other available path in the set. This redirection is transparent to the application, which does not receive an error.




Path Fault without PowerPathIn most environments, a host will have multiple paths to the Storage SystemVolumes are spread across all available pathsEach volume has a single pathHost adapter and cable connections are single points of failureWork load not balanced among all paths

Storage

Host Application(s)

SD

HBA

SD

HBA

SD

HBA

SD

HBA Host BusAdapter

SCSIDriver

SER

VER

STO

RA

GE Interconnect

Topology

Without PowerPath, the loss of a channel (as indicated in the diagram by a red dotted line) means one or more applications may stop functioning. This can be caused by the loss of a Host Bus Adapter, Storage Array Front-end connectivity, Switch port, or a failed cable. In a standard non-PowerPath environment, these are all single points of failure. In this case, all I/O that was heading down the path highlighted in red is now lost, resulting in an application failure and the potential for data loss or corruption.




Path Fault with PowerPathIf a host adapter, cable, or channel director/Storage Processor fails, the device driver returns a timeout to PowerPath

PowerPath responds by taking the path offline and re-driving I/O through an alternate path

Subsequent I/Os use surviving path(s)

Application is unaware of failure

Host Application(s)

HBA HBA

SD SDSD

HBA Host BusAdapter

SCSIDriver

Storage

SER

VER

STO

RA

GE Interconnect

Topology

SD

HBA

PowerPath

This example depicts how PowerPath failover works. When a failure occurs, PowerPath transparently redirects the I/O down the most suitable alternate path. The PowerPath filter driver looks at the volume path set for the device, considers current workload, load balancing, and device priority settings, and chooses the best path to send the I/O down. In the example, PowerPath has three remaining paths to redirect the failed I/O, and to load balance.




SummaryKey points covered in this module:

PowerPath is server based software that provides multiple paths between the host bus adapter and the Storage Subsystem– Redundant paths eliminate host adapter, cable connection, and

channel adapters as single points of failures and increase availability– Improves performance by dynamically balancing the workload across

all available paths– Application transparent

Enhances data availability and accessibility





Backup and Recovery - 1

© 2007 EMC Corporation. All rights reserved. Backup and Recovery - 1

Backup and RecoveryUpon completion of this module, you will be able to:

Describe best practices for planning Backup and Recovery

Describe the common media and types of data that are part of a Backup and Recovery strategy

Describe the common Backup and Recovery topologies

Describe the Backup and Recovery Process

Describe Management considerations for Backup and Recovery

This module looks at backup and recovery. Backup and recovery are a major part of the planning process for Business Continuity.

The objectives for the module are shown here. Please take a moment to read them.




In this Module….This module contains the following lessons:

Planning for Backup and Recovery

Backup and Recovery Methods

Backup Architecture Topologies

Managing the Backup Process

This module contains the four lessons shown here. These lessons provide an overview of backup and recovery, including the business and technical aspects.




Lesson: Planning for Backup and RecoveryUpon completion of this lesson, you be able to:

Define Backup and Recovery

Describe common reasons for a Backup and Recovery plan

Describe the business considerations for Backup and Recovery

Define RPO and RTO

Describe the data considerations for Backup and Recovery

Describe the planning for Backup and Recovery

This lesson provides an overview of the business drivers for backup and recovery and introduces some of the common terms used when developing a backup and recovery plan.

The objectives for this lesson are shown here. Please take a moment to read them.




What is a BackupBackup is an additional copy of data that can be used for restore and recovery purposes

The Backup copy is used when the primary copy is lost or corrupted

This Backup copy can be created as a:– Simple copy (there can be one or more copies)– Mirrored copy (the copy is always updated with whatever is written to

the primary copy)

A Backup is a copy of the online data that resides on primary storage. The backup copy is created and retained for the sole purpose of recovering deleted, broken, or corrupted data on the primary disk.

The backup copy is usually retained over a period of time, depending on the type of the data, and on the type of backup. There are three derivatives for backup: disaster recovery, archival, and operational backup.

The data that is backed up may be on such media as disk or tape, depending on the backup derivative the customer is targeting. For example, backing up to disk may be more efficient than tape in operational backup environments.




Backup and Recovery StrategiesSeveral choices are available to get the data to the backup

media such as:

Copy the data

Mirror (or snapshot) then copy

Remote backup

Copy then duplicate or remote copy

Several choices are available to get the data written to the backup media. You can simply copy the data from the primary storage to the secondary storage (disk or tape), onsite. This is a simple strategy, easily implemented, but impacts the production server where the data is located, since it uses the server’s resources. This may be tolerated on some applications, but not high demand ones.To avoid an impact on the production application, and to perform serverless backups, you can mirror (or snap) a production volume. For example, you can mount it on a separate server and then copy it to the backup media (disk or tape). This option completely frees up the production server, with the added infrastructure cost associated with additional resources.Remote Backup can be used to comply with offsite requirements. A copy from the primary storage is done directly to the backup media that is sitting on another site. The backup media can be a real library, a virtual library or even a remote filesystem. You can do a copy to a first set of backup media, which will be kept onsite for operational restore requirements, and then duplicate it to another set of media for offsite purposes. To simplify the procedure, replicate it to an offsite location to remove any manual procedures associated with moving the backup media to another site.




It’s All About RecoveryBusinesses back up their data to enable its recovery in case of potential loss

Businesses also back up their data to comply with regulatory requirements

Types of backup derivatives:– Disaster Recovery– Archival– Operational

Disaster Recovery addresses the requirement to be able to restore all, or a large part of, an IT infrastructure in the event of a major disaster.

Archival is a common requirement used to preserve transaction records, email, and other business work products for regulatory compliance. The regulations could be internal, governmental, or perhaps derived from specific industry requirements.

Operational is typically the collection of data for the eventual purpose of restoring, at some point in the future, data that has become lost or corrupted.




Reasons for a Backup Plan Hardware Failures

Human Factors

Application Failures

Security Breaches

Disasters

Regulatory and Business Requirements

Reasons for a backup plan include: Physical damage to a storage element (such as a disk) that can result in data loss. People make mistakes and unhappy employees or external hackers may breach security and maliciously destroy data. Software failures can destroy or lose data and viruses can destroy data, impact data integrity, and halt key operations. Physical security breaches can destroy equipment that contains data and applications.Natural disasters and other events such as earthquakes, lightning strikes, floods, tornados, hurricanes, accidents, chemical spills, and power grid failures can cause not only the loss of data but also the loss of an entire computer facility. Offsite data storage is often justified to protect a business from these types of events.Government regulations may require certain data to be kept for extended timeframes. Corporations may establish their own extended retention policies for intellectual property to protect them against litigation. The regulations and business requirements that drive data as an archive generally require data to be retained at an offsite location.




How does Backup WorkClient/Server Relationship

Server – Directs Operation– Maintains the Backup Catalog

Client– Gathers Data for Backup (a backup client sends backup data to a

backup server or storage node).

Storage Node

Backup products vary, but they do have some common characteristics. The basic architecture of a backup system is client-server, with a backup server and some number of backup clients or agents. The backup server directs the operations and owns the backup catalog (the information about the backup). The catalog contains the table-of-contents for the data set. It also contains information about the backup session itself.

The backup server depends on the backup client to gather the data to be backed up. The backup client can be local or it can reside on another system, presumably to backup the data visible to that system. A backup server receives backup metadata from backup clients to perform its activities.

There is another component called a storage node. The storage node is the entity responsible for writing the data set to the backup device. Typically, there is a storage node packaged with the backup server and the backup device is attached directly to the backup server’s host platform. Storage nodes play an important role in backup planning as it can be used to consolidate backup servers.




How does Backup Work (continued)

DiskStorage

TapeBackup

Data SetMetadataCatalog

Backup Server& Storage Node

Servers Backup Clients

Clients

The following represents a typical Backup process:The Backup Server is policy driven, and initiates the backup processThe Backup Server sends a request to a Backup Client to "send me your meta-data"The Backup Client sends the meta-data to the Backup ServerThe Backup Server writes the meta-data to the meta-data catalog on disk The Backup Client sends the data to the Storage Node The Storage Node writes the data to the tape storage device

When all of the data has been written by the Storage Node to the tape device, the Storage Node closes the connection to the tape device, and the Backup Server writes status completion to the meta-data catalog on disk

Note: The Backup Server and the Storage Node might be hosted on the same physical machine. Some backup architectures refer to the Storage Node as the Media Server.




Business Considerations Customer business needs determine:– What are the restore requirements – RPO & RTO?– Where and when will the restores occur?– What are the most frequent restore requests?– Which data needs to be backed up?– How frequently should data be backed up?

hourly, daily, weekly, monthly– How long will it take to backup?– How many copies to create?– How long to retain backup copies?

Some important decisions that need consideration before implementing a Backup/Restore solution are shown here. Some examples include:

The Recovery Point Objective (RPO)The Recovery Time Objective (RTO)The media type to be used (disk or tape) Where and when the restore operations occur, especially if an alternative host is used to receive the restore dataWhen to perform backupThe granularity of backups, Full, Incremental or cumulativeHow long to keep the backup; For example, some backups need to be retained for 4 years, others just for 1 month Is it necessary or not to take copies of the backup?




Data Considerations: File Characteristics Location

Size

Number

Location: Many organizations have dozens of heterogeneous platforms that support a complex application. Consider a data warehouse where data from many sources is fed into the warehouse. When this scenario is viewed as “The Data Warehouse Application”, it easily fits this model. Some of the issues are:−How the backups for subsets of the data are synchronized−How these applications are restored

Size: Backing up a large amount of data that consists of a few big files may have less system overhead than backing up a large number of small files. If a file system contains millions of small files, the very nature of searching the file system structures for changed files can take hours, since the entire file structure is searched. Number: a file system containing one million files with a ten-percent daily change rate will potentially have to create 100,000 entries in the backup catalog. This brings up other issues such as:−How a massive file system search impacts the system− Search time/Media impact− Is there an impact on tape start/stop processing?




Data Considerations: Data CompressionCompressibility depends on the data type, for example:

Application binaries – do not compress well

Text – compresses well

JPEG/ZIP files – are already compressed and expand if compressed again

Many backup devices, such as tape drives, have built-in hardware compression technologies. To effectively use these technologies, it is important to understand the characteristics of the data. Some data, such as application binaries, do not compress well. Text data can compress very well, while other data, such as JPEG and ZIP files, are already compressed.




Data Considerations: Retention PeriodsOperational– Data sets on primary media (disk) up to the point where most restore

requests are satisfied, then moved to secondary storage (tape)

Disaster Recovery– Driven by the organization’s disaster recovery policy

Portable media (tapes) sent to an offsite location / vaultReplicated over to an offsite location (disk)Backed up directly to the offsite location (disk, tape or emulated tape)

Archiving– Driven by the organization’s policy– Dictated by regulatory requirements

Retention periods are the length of time that a particular version of a dataset is available to be restored.

Retention periods are driven by the type of recovery the business is trying to achieve:For operational restore, data sets could be maintained on a disk primary backup storage target for a period of time, where most restore requests are likely to be achieved, and then moved to a secondary backup storage target, such as tape, for long term offsite storage.For disaster recovery, backups must be done and moved to an offsite location. For archiving, requirements usually will be driven by the organization’s policy and regulatory conformance requirements. Tapes can be used for some applications, but for others a more robust and reliable solution, such as disks, may be more appropriate.




Lesson: Summary Topics in this lesson included:

Backup and Recovery definitions and examples

Common reasons for Backup and Recovery

The business considerations for Backup and Recovery

Recovery Point Objectives and Recovery Time Objectives

The data considerations for Backup and Recovery

The planning for Backup and Recovery

In this lesson we reviewed the business and data considerations when planning for backup and recovery, including:

What is a Backup and Recovery?

What is the Backup and Recovery process?

Business recovery needsRPO Recovery point objectivesRTO Recovery time objectives

Data characteristics Files, compression, retention




Lesson: Backup and Recovery Methods Upon completion of this lesson, you be able to:

Describe Hot and cold backups

Describe the levels of backup granularity

We’ve discussed the importance and considerations for a backup plan. This lesson provides an overview of the different methods for creating a backup set.





Database Backup MethodsHot Backup: production is not interrupted

Cold Backup: production is interrupted

Backup Agents manage the backup of different data types such as:– Structured (such as databases)– Semi-structured (such as email)– Unstructured (file systems)

Backing up databases can occur using two different methods:A hot backup, which means that the application is still up and running, with users accessing it, while backup is taking place.A cold backup, which means that the application will be shut down for the backup to take place.

Most backup applications offer various backup agents to do these kinds of operations. There are different agents for different types of data and applications.




Backup Granularity and LevelsFull Backup

Cumulative (Differential)

Incremental

Full Cumulative Incremental

The three different types of backups include: Full Backup; Incremental Backup; and Cumulative Backup.

A full backup is a backup of all data on the target volumes, regardless of any changes made to the data itself.An incremental backup contains the changes since the last backup, of any type, whichever was most recent.A cumulative backup, also known as a differential backup, is a type of incremental that contains changes made to a file since the last full backup.

The granularity and levels for backups depend on business needs, and, to some extent, technological limitations. Some backup strategies define as many as ten levels of backup. IT organizations use a combination of these to fulfill their requirements. Most use some combination of full, cumulative, and incremental backups.

<Continued>



Advantages and DisadvantagesFull Backup: Because the full backup stores all files and folders, frequent full backups result in faster and simpler restore operations. However, the amount of time it takes to run full backups often prevents you from using this backup type. Full backups are often restricted to a weekly or monthly schedule, although the increasing speed and capacity of backup media is making overnight full backups a more realistic proposition. Advantages:

Restore is the fastest Disadvantages:

Backing up is the slowest The storage space requirements are the highest (compared to incremental backups or cumulative backups)

Incremental Backup: The advantage of an incremental backup is that it takes the least amount of time to complete. However, during a restore operation, each incremental backup is processed, which could result in a lengthy restore job. The advantage of lower backup times comes with a price: increased restore time. When restoring from incremental backup, you need the most recent full backup, as well as every incremental backup you've made since the last full backup.Advantages:

Back up is the fastest The storage space requirements are the lowest

Disadvantages: Restore is the slowest

Cumulative Backup: The advantage of a cumulative backup is that it shortens restore time compared to an incremental backup. However, if you perform the cumulative backup too many times, the size of the cumulative backup might grow as large as the full backup. Restoring a cumulative backup is a faster process than restoring an incremental backup because only two sets of tapes would be required –the last full backup and the last cumulative backup.Advantages:

Restore is faster than restoring from incremental backup Backing up is faster than a full backup The storage space requirements are lower than for full backup

Disadvantages: Restore is slower than restoring from full backup Backing up is slower than incremental backup The storage space requirements are higher than for incremental backup




Files 1, 2, 3, 4, 5

ProductionProduction

Restoring an Incremental Backup

Key Features– Files that have changed since the last full or incremental backup are

backed up– Fewest amount of files to be backed up, therefore faster backup and less

storage space– Longer restore because last full and all subsequent incremental backups

must be applied

IncrementalIncremental

Tuesday

File 4


Wednesday

File 3


Thursday

File 5Files 1, 2, 3

Monday

Full BackupFull Backup

Following is an example of an incremental backup and restore.

A full backup of the business data is taken on Monday evening. Each day after that, an incremental backup is taken. These incremental backups only backup files that are new or that have changed since the last full or incremental backup.

On Tuesday, a new file is added, File 4. No other files have changed. Since File 4 is a new file added after the previous backup on Monday evening, it will be backed up Tuesday evening.

On Wednesday, there are no new files added since Tuesday, but File 3 has changed. Since File 3 was changed after the previous evening backup (Tuesday), it will be backed up Wednesday evening.

On Thursday, no files have changed but a new file has been added, File 5. Since File 5 was added after the previous evening backup, it will be backed up Thursday evening.

On Friday morning, there is a data corruption, so the data must be restored from tape. The first step is to restore the full backup from Monday evening. Then, every incremental backup that was done since the last full backup must be applied, which, in this example, means the:Tuesday, Wednesday, and Thursday incremental backups.




Restoring a Cumulative Backup

Key Features– More files to be backed up, therefore it takes more time to backup

and uses more storage space– Much faster restore because only the last full and the last cumulative

backup must be applied

Files 1, 2, 3, 4, 5, 6


CumulativeCumulative

Tuesday

File 4Files 1, 2, 3

Monday

Full BackupFull Backup CumulativeCumulative

Wednesday

Files 4, 5

CumulativeCumulative

Thursday

Files 4, 5, 6

The following is an example of cumulative backup and restore.

A full backup of the data is taken on Monday evening. Each day after that, a cumulative backup is taken. These cumulative backups backup ALL FILES that have changed since the LAST FULL BACKUP.

On Tuesday, File 4 is added. Since File 4 is a new file that has been added since the last full backup, it will be backed up Tuesday evening.

On Wednesday, File 5 is added. Now, since both File 4 and File 5 are files that have been added or changed since the last full backup, both files will be backed up Wednesday evening.

On Thursday, File 6 is added. Again, File 4, File 5, and File 6 are files that have been added or changed since the last full backup; all three files will be backed up Thursday evening.

On Friday morning, there is a corruption of the data, so the data must be restored from tape. The first step is to restore the full backup from Monday evening. Then, only the backup from Thursday evening is restored because it contains all the new/changed files from Tuesday, Wednesday, and Thursday.





Hot and Cold backups

The levels of backup granularity

This lesson provided an introduction to backup methods and levels of backup granularity.




Lesson: Backup Architecture Topologies Upon completion of this lesson, you be able to:

Describe DAS, LAN, SAN, mixed topologies

Describe backup media considerations

So far, we have discussed the importance of the backup plan and the different methods used when creating a backup set. This lesson provides an overview of the different topologies and media types that are used to support creating a backup set.





Backup Architecture TopologiesThere are 3 basic backup topologies:– Direct Attached Based Backup– LAN Based Backup– SAN Based Backup

These topologies can be integrated, forming a “mixed”topology

There are three basic topologies that are used in a backup environment: Direct Attached Based Backup, LAN Based Backup, and SAN Based Backup.

There is also a fourth topology, called “Mixed”, which is formed when mixing two or more of these topologies in a given situation.




Direct Attached Based Backups

Catalog

Backup Server

LAN

Metadata

MediaBackupStorage Node

Backup Client

Data

Here, the backup data flows directly from the host to be backed up to the tape, without utilizing the LAN. In this model, there is no centralized management and it is difficult to grow the environment.




LAN Based Backups

Backup ServerStorage Node

LAN

Metadata

Storage Node

Data

Mail ServerBackup Client

Database ServerBackup Client

MetadataData

In this model, the backup data flows from the host to be backed up to the tape through the LAN. There is centralized management, but there may be an issue with the LAN utilization since all data goes through it.




SAN Based Backups (LAN Free)

LAN

Metadata

Storage NodeBackup Client

Data

Mail Server

SAN

Backup Server

Data

Backup Device

A SAN based backup, also known as LAN free backup, is achieved when there is no backup data movement over the LAN. In this case, all backup data travels through a SAN to the destination backup device.

This type of backup still requires network connectivity from the Storage Node to the Backup Server, since metadata always has to travel through the LAN.




SAN/LAN Mixed Based Backups

LAN

Metadata

Storage Node

Data

Mail ServerBackup Client

Database ServerBackup Client

Data

SAN

Backup Server

Data

Backup Device

A SAN/LAN mixed based backup environment is achieved by using two or more of the topologies described in the previous slides. In this example, some servers are SAN based while others are LAN based.




Backup MediaTape– Traditional destination for backups– Sequential access– No protection

Disk– Random access– Protected by the storage array (RAID, hot spare, etc)

There are two common types of Backup media: Tape and Disk.




Multiple Streams on Tape Media

Multiple streams interleaved to achieve higher throughput on tape

– Keeps the tape streaming, for maximum write performance– Helps prevent tape mechanical failure– Greatly increases time to restore

TapeTape

Data fromStream 1 Data from

Stream 2 Data fromStream 3

Tape drive streaming is recommended from all vendors in order to keep the drive busy. If you do not keep the drive busy during the backup process (writing), performance suffers. Multiple streaming helps to improve performance drastically, but it generates one issue as well: the backup data becomes interleaved; thus, the recovery times are increased.




Backup to DiskBackup to disk minimizes tape in backup environments by using disk as the primary destination device– Cost benefits– No processes changes needed– Better service levels

Backup to disk aligns backup strategy to RTO and RPO

Backup to disk replaces tape and its associated devices, as the primary target for backup, with disk. Backup to disk systems offer major advantages over equivalent scale tape systems, in terms of capital costs, operating costs, support costs, and quality of service. It can be implemented fully on day 1 or over a phased approach.




Tape versus Disk – Restore Comparison

Typical Scenario:800 users, 75 MB mailbox 60 GB database

Source: EMC Engineering and EMC IT

*Total time from point of failure to return of service to e-mail users

31

0 10 20 30 40 50 60 70 80 90 100 120110Recovery Time in Minutes*

TapeBackup / Restore

DiskBackup / Restore

108Minutes

108Minutes

24Minutes

24Minutes

This example shows a typical recovery scenario using tape and disk. As you can see, recovery with disk provides much faster recovery than does recovery with tape.

This example shows a typical recovery scenario using tape and disk. As you can see, recovery with disk provides much faster recovery than recovery with tape.It is important to keep in mind that this example involves data recovery only. The time it takes to bring the application online is a separate matter. Even so, you can see in this example that the benefit was a restore roughly five times faster than it would have been with tape.




Three Backup / Restore Solutions based on RTO

Time of last image dictates the log playback timeLarger data sets extend the recovery time (ATA and tape)

*Total time from point of failure to return of service to e-mail users

0 10 20 30 40 50 60 70 80 90 100 120110Recovery Time in Minutes*

Backup on tape

Backup on ATA

108 Min.108 Min.

24 Min.24 Min.

Typical Scenario:800 users, 75 MB mailbox 60 GB DB – restore time500 MB logs – log playback

130

Local Replica / Clone

2 Min.

41 Minutes

19 Minutes

125 Minutes

17 Min.

17 Min.

17 Min.

Restore time

Log playback

The diagram shows typical recovery scenarios using different technical solutions. As seen in the slide, recovery with a Local Replica or clones provides the quickest recovery method.

It is important to note that using clones on disk enables you to be able to make more copies of your data more often. This will improve RPO (the point from which they can recover). It also improves RTO because the log files are smaller, reducing the log playback time.




Traditional Backup, Recovery, and Archive Approach

Production environment grows– Requires constant tuning and data placement to

maintain performance – Need to add more tier-1 storage

Backup environment grows– Backup windows get longer and jobs do not complete– Restores take longer– Requires more tape drives and silos to keep up with

service levels

Archive environment grows– Impact flexibility to retrieve content when requested– Requires more media, adding management cost– No investment protection for long term retention

requirements

BackupProcessBackupProcess

ArchiveProcessArchiveProcess


In a traditional approach for backup and archive, businesses take a backup of production. Typically, backup jobs use weekly full backups and nightly incremental backups. Based on business requirements, they then copy the backup jobs and eject the tapes to have them sent offsite, where they are stored for a specified amount of time.

The problem with this approach is simple; as the production environment grows, so does the backup environment.




Differences Between Backup / Recovery & Archive

Data typically maintained for analysis, value generation, or compliance

Data typically overwritten on periodic basis (e.g., monthly)

Useful for compliance and should take into account information-retention policy

Not for regulatory compliance—though some are forced to use

Typically long-term (months, years, or decades)

Typically short-term (weeks or months)

Adds operational efficiencies by moving fixed / unstructured content out of operational environment

Improves availability by enabling application to be restored to a specific point in time

Available for information retrievalUsed for recovery operations

Primary copy of informationA secondary copy of information

ArchiveBackup / Recovery

Backup/recovery and archiving support different business and goals. This slide compares and contrasts some of the differences that are significant.




New Architecture for Backup, Recovery & Archive

Understand the environment

Actively archive valuable information to tiered storage

Back up active production information to disk

Retrieve from archive or recover from backup

BackupProcessBackupProcess

ArchiveProcessArchiveProcessProductionProduction

1

3

4

2

4

The recovery process is much more important than the backup process. It is based on the appropriate recovery-point objectives (RPOs) and recovery-time objectives (RTOs). The process usually drives a decision to have a combination of technologies in place, from online local replicas, to backup to disk, to backup to tape for long-term, passive RPOs.

Archive processes are determined not only by the required retention times, but also by retrieval-time service levels and the availability requirements of the information in the archive.

For both processes, a combination of hardware and software is needed to deliver the appropriate service level. The best way to discover the appropriate service level is to classify the data and align the business applications with it.





The DAS, LAN, SAN, and Mixed topologies

Backup media considerations

This lesson provided an overview of the different topologies and media types that support creating a backup set.




Lesson: Managing the Backup Process Upon completion of this lesson, you be able to:

Describe features and functions of common Backup/Recovery applications

Describe the Backup/Recovery process management considerations

Describe the importance of the information found in Backup Reports and in the Backup Catalog

We have discussed the planning and operations of creating a backup. This lesson provides an overview of management activities and applications that help manage the backup and recovery process.





How a Typical Backup Application WorksBackup clients are grouped and associated with a Backup schedule that determines when and which backup type will occur

Groups are associated with Pools, which determine which backup media will be used

Each backup media has a unique label

Information about the backup is written to the Backup Catalog during and after it completes. The Catalog shows: – when the Backup was performed, and – which media was used (label)

Errors and other informationre is also written to a log

This slide describes how a backup application works.

Backup clients are grouped and associated with a Backup schedule that determines when and which backup type will occur.

Groups are associated with Pools, which determine which backup media will be used.

Each backup media has a unique label.

Information about the backup is written to the Backup Catalog during and after it completes. The Catalog shows when the backup was performed, and which media, or label, was used.

Errors and other information are also written to a log.




Backup Application User InterfacesThere are typically two types of user interfaces:

Command Line Interface – CLI

Graphical User Interface – GUI

There are two types of user interfaces, Command Line Interface (CLI) and Graphical User Interface (GUI).

Command Line Interface – CLIBackup administrators usually write scripts to automate common tasks, such as sending reports via email

Graphical User Interfaces – GUIControl the backup and restore processMultiple backup serversMultiple storage nodesMultiple platforms/operating systemsSingle and easy to use interface that provides the most common (if not all) administrative tasks




Managing the Backup and Restore ProcessRunning the B/R Application: Backup– The backup administrator configures it to be started, most (if not all)

of the times, automatically– Most backup products offer the ability for the backup client to initiate

their own backup (usually disabled)

Running the B/R Application: Restore– There is usually a separate GUI to manage the restore process– Information is pulled from the backup catalog when the user is

selecting the files to be restored– Once the selection is finished, the backup server starts reading from

the required backup media, and the files are sent to the backup client

Shown here are the common tasks associated with managing a Backup or Restore activity using the B/R application.

Backup:Configuring a backup to be started automatically, most (if not all) of the timeEnabling the backup client to initiate it own

Restore:There is usually a separate GUI to manage the restore processInformation is pulled from the backup catalog when the user is selecting the files to be restoredOnce the selection is finished, the backup server starts reading from the required backup media, and the files are sent to the backup client




Backup ReportsBackup products also offer reporting features

These features rely on the backup catalog and log files

Reports are meant to be easy to read and provide important information such as:– Amount of data backed up– Number of completed backups– Number of incomplete backups (failed)– Types of errors that may have occurred

Additional reports may be available, depending on the backup software product used

Backup products also offer reporting features.

These features rely on the backup catalog and log files.

Reports are meant to be easy to read and provide important information such as:Amount of data backed upNumber of completed backupsNumber of incomplete backups (failed)Types of errors that may have occurred




Importance of the Backup CatalogAs you can see, backup operations strongly rely on the backup catalog

If the catalog is lost, the backup software alone has no means to determine where to find a specific file backed up two months ago, for example

It can be reconstructed, but this usually means that all of the backup media (i.e. tapes) has to be read

It’s a good practice to protect the catalog– By replicating the file system where it resides to a remote location– By backing it up

Some backup products have built-in mechanisms to protect their catalog (such as automatic backup)

The importance of the backup catalog is described here.

As you can see, backup operations strongly rely on the backup catalog.

If the catalog is lost, the backup software alone has no means to determine where to find a specific file backed up two months ago, for example.

It can be reconstructed, but this usually means that all of the backup media has to be read.

It’s a good practice to protect the catalog. This can be done by replicating the file system where it resides to a remote location or by backing it up.

Some backup products have built-in mechanisms to protect their catalog, such as automatic backup.





The features and functions of common Backup/Recovery applications

The Backup/Recovery process management considerations

The importance of the information found in Backup Reports and in the Backup Catalog

This lesson provided an overview of Backup and Recovery management activities and tools including: The Backup Application process and user interface; Reports; and the Backup Catalog.





The best practices for planning Backup and Recovery

The common media and types of data that are part of a Backup and Recovery strategy

The common Backup and Recovery topologies

The Backup and Recovery Process

Management considerations for Backup and Recovery





Check Your KnowledgeWhat are three reasons for doing a Backup plan?

What are the three topologies that support creating a Backup set?

What are the advantages and disadvantages of using tape as the Backup media?

What are the three levels of granularity found in Backups?






EMC’s Product Implementation of a

Backup and Recovery Solution

At this point, let’s apply what we’ve learned to some real world examples. In this case, we will describe EMC’s product implementation of a Backup and Recovery solution.




Backupto disk

Disk-backup option

BasicTape backup and recovery

Advanced backupSnapshot

management

Remove riskFaster and more consistent data backup

Improve reliabilityKeep recovery copies fresh and reduce process errors

Lower total cost of ownershipCentralization and ease of use

Low SERVICE-LEVEL REQUIREMENTS High

Tiered Protection and Recovery ManagementEMC NetWorker

NetWorker’s installed base of more than 20,000 customers worldwide is a testament to the product’s market leadership.

Data-growth rates are accelerating, and the spectrum of data and systems that live in environments runs the gamut from key applications that are central to the business to other types of information that may be less important.

What is interesting is that the industry has been somewhat stuck for several years at a one-size-fits-all strategy to backup and recovery. We’re referring to a “basic” backup scenario, or traditional tape backup.

Tape backup serves a noble purpose and is working very well for some companies; it’s been EMC’s core business for some time, so EMC knows it well. But shifting market dynamics, as well as more demanding business environments, have lead to other important choices for backup.

Today, traditional tape faces the challenge of meeting service-level requirements for protection and availability of an ever-increasing quantity of enterprise data. This is why EMC has built into NetWorker key options to meet the needs of a wide range of environments. This includes the ability to use disk for backup, as well as to take advantage of advanced-backup capabilities that connect backup with array-based snapshot and replication management. These provide you with essentially the highest-possible performance levels for backup and recovery. As the value of information changes over time, you may choose any one of these, or a combination thereof, to meet your needs.




Enterprise protection– Critical applications– Heterogeneous platforms and

storage– Scalable architecture– 256-bit AES encryption and secure

authentication

Centralized management– Graphical user interface– Customizable reporting– Wizard-driven configuration

Performance– Data multiplexing– Advanced indexing– Efficient media management

Solution Features

Tape library

Basic ArchitectureHeterogeneous

clients

Backup server

Key applications

LAN

SANNAS

(NDMP)Storage

Node

NetWorker Backup and Recovery

The first key focus is on providing complete coverage. Enterprise protection means the ability to provide coverage for all the components in the environment. NetWorker provides data protection for the widest heterogeneous support of operating systems, and is integrated with leading databases and applications for complete data protection.

A single NetWorker server can be used to protect all clients and servers in the environment, or secondary servers can be employed, which EMC calls Storage Nodes, as a conduit for additional processing power or to protect large critical servers directly across a SAN without having to take data back over the network. Such LAN-free backup is standard with NetWorker.

NetWorker can easily back up environments in LAN, SAN, or WAN environments, with coverage for key storage such as NAS. As a matter of fact, NetWorker’s NAS-protection capabilities, leveraging the Network Data Management Protocol (NDMP), are unequaled.

The key here is that NetWorker can easily grow and scale as needed in the environment and provide advanced functionality, including clustering technologies, open-file protection and compatibility with tape hardware and the newclass of virtual-tape and virtual-disk libraries.

While NetWorker encompasses all these pieces in the environment, EMC has made sure there is a common set of management tools.

With NetWorker, EMC has focused on what it takes within environments both large and small to get the best performance possible, in terms of both speed and reliability. This means the inclusion of capabilities such as multiplexing to protect data as quickly as possible while making use of the backup storage’s maximum bandwidth. It also means ensuring that the way in which EMC indexes and manages the saving of data is designed to provide not only the best performance, but also stability and reliability.




Critical Application and Database Protection

Offline (Cold)

Integration with application APIsfor backup and recovery

Backup without Application Modules

Backup with NetWorker Application Modules

Shut down application

Restart application

Back up application

Application

SAVE

Application

DO

WN

TIM

E

Application

Net

Wor

ker M

OD

ULE

24x7

OPE

RA

TIO

NS

Applications can be backed up either offline or online. NetWorker by itself can back up closed applications as flat files. During an offline, or cold, backup, the application is shut down, backed up and restarted after the backup is finished.

This is fine, but during the shutdown and backup period, the application is unavailable. This is not acceptable in today’s business environments. This is why EMC has worked to integrate NetWorker with applications to provide online backup, specifically, with the use of NetWorker in conjunction with NetWorker Modules.

During an online, or hot, backup, the application is open and is backed up while open. The NetWorker Module extracts data for backup with an API; the application need not be shut down, and remains open while the backup finishes.

NetWorker supports a wide range of applications for online backup with granular-level recovery, including:

OracleMicrosoft ExchangeMicrosoft SQL ServerLotus NotesSybaseInformixIBM DB2EMC Documentum




NetWorkerUNIX/Linux

NetWorker Windows

Open Tape Format– Datastream multiplexing– Self-contained indexing– Cross-platform format

UNIX Windows Linux

– Minimize impact of tape corruption

Dynamic drive sharing– Cross-platform tape-drive sharing– On-demand device usage– Reduce hardware total cost of

ownership

Media-Management Advantages

One key advantage of NetWorker is its media-management features.

The first feature is Open Tape Format. It is NetWorker’s way of recording data to tape, specifically designed to provide several advantages:

Data can be multiplexed, or interleaved, for performance. This essentially means data can be accepted and written to the backup media as it comes in, regardless of what order it comes in, so the tape drives can keep spinning. This enables you to back up faster, but also reduces wear and tear on the tape hardware, which is more susceptible to error if it is continually stopping and starting.Tapes created by NetWorker are self-describing, so if everything else is gone except for the tape, you’ll be able to load it and understand what data is there to be restored.As the image on the right indicates, Open Tape Format allows you to move tape media between systems and servers on unlike operating systems, with Open Tape Format, a tape that began life on a UNIX-based system can easily be read on a Windows-based system. This is key not just for disaster recovery, but for the entire environment, as you go through a regular system lifecycle and adopt new platforms.Also, with Open Tape Format, NetWorker can skip bad spots on tape and continue data access. When other solutions on the market encounter any error on tape, they are unable to do anything further with the tape. Imagine if there is a bad spot 100 MB into a backup tape.Finally, NetWorker can broker tape devices on a SAN to get the best use and performance out of the hardware investment. So, instead of hard-assigning tape drives to a backup server or Storage Node, you can dynamically allocate any drive on demand.




Disk-backuptarget

NetWorker DiskBackup Option

High performance– Simultaneous-access operations– No penalty on restore versus tape

Policy-based migration of data from disk to tape

– Automated staging and cloning– Up to 50% faster– Clone backups jobs as they

complete– Reduce wear and tear on tape

drives and cartridges

Superior capability– Operational backup and recovery

for all clients, including NAS with NDMP

– Direct file access for fast recovery

Tape library

Backup-to-Disk Architecture

Heterogeneous clients

Backup server

Key applications

LAN

NAS Storage NodeSAN

The focus here is the resolution of the top pain points around traditional tape-based backup.

Performance NetWorker backup to disk allows for simultaneous-access operations to a volume, both reads (restore, staging, cloning) and writes (backups). With NetWorker, as opposed to with traditional tape-only backup, you don’t "pay a penalty on restore."

Also, cloning from disk to tape is up to 50% faster. Why? As soon as the Save Set (backup job) is complete, the cloning process can begin without the Administrator having to wait for all the backup jobs to complete. NetWorker can back up to disk and clone to tape at the same time. You don’t have to spend 12–16 hours a day running clone operations (tape-to-tape copies); in fact, you might actually be able to eliminate the clone jobs. Some NetWorker customers have seen cloning times reduced from 12–16 hours daily to three to four hours daily.

Cloning from disk to tape also augments the disaster-recovery strategy for tape. As data grows, more copies must be sent offsite. Because NetWorker backup to disk improves cloning performance, you can continue to meet the daily service-level agreements to get tapes offsite to a vaulting provider.

Taking the idea of leveraging disk even further, leads us into a discussion of NetWorker’s advanced backup capability, which also leverages disk-based technologies.




Advanced Backup - Snapshots and CDPIntegration of backup with snapshots, full-volume mirrors, and Continuous Data Protection (CDP)

Instant restore

Off-host backups

Achieve stringent recovery-time objectives (RTOs), recovery-point objectives (RPOs)

It is expected that snapshot technology for data protection will surpass backup to tape as the trend in data protection as organizations continue to focus on recovery times

Productioninformation

Recover

Backup

Productionserver

Backupserver

Snapshot 11:00 a.m.

Snapshot 5:00 p.m.

Backup snap10:00 p.m.

Disk-solution providers, like EMC, provide array-based abilities to perform snapshots and replication. These “point-in-time” copies of data allow for instant recovery of disk and data volumes. Many are likely familiar with array-based replication or snapshot capabilities.

NetWorker is engineered to take advantage of these capabilities by providing direct tie-ins with EMC offerings such as CLARiiON with SnapView, or Symmetrix with TimeFinder/Snap. This enables you to begin to meet the most stringent recovery requirements.

In a study done in the spring of 2004, the Taneja Group identified that the market intends to rely on snapshots for ensuring application-data availability and rapid recoveries. The figures represent a scale of one to five, with one as the low point, five as the high point:

Rapid application recovery (4.34)Ability to automate backup to tape (4.13)Instant backup (3.98)Roll back to point in time (3.88)Integration with backup strategy (3.87)Flexibility to leverage hardware (3.61)Multiple fulls throughout day (3.49)




Policy-based management– Administer snapshots in

NetWorker– Schedule, create, retain, and

delete snapshots by policy

Third-party integration– Leverage third-party replication

technologyArray-based (Symmetrix DMX, CLARiiON CX, etc.)Software-based (RecoverPoint)

Application recovery– Integration with Application

Modules to ensure consistent state

Exchange / SQL / Oracle / SAP

NetWorker PowerSnap Module

CLARiiON with

SnapViewTape library

Advanced BackupHeterogeneous

clients

Backup server

Key applications

LAN

SANNAS Storage

Node

In addition to traditional backup-and-recovery application modules for disk and tape, the snapshot management capability called NetWorker PowerSnap enables you to meet the demanding service-level agreement requirements in both tape and disk environments by seamlessly integrating snapshot technology and applications. NetWorker PowerSnap software works with NetWorker Modules to enable snapshot backups of applications—with consistency.

PowerSnap performs snapshot management by policy, just like standard backup policies to tape or disk. It uses these policies to determine how many snapshots to create, how long to retain the snapshots, when to do backups to tape from specified snapshots…all based on business needs that you define.

For example, snapshots might be taken every few hours, and the three most recent are retained. You can easily leverage any of those snapshots to back up to tape in an off-host fashion, i.e., with no impact to the application servers.

PowerSnap manages the full life cycle of snapshots, including creation, scheduling, backups, and expiration. This, along with its orchestration with applications, provides a comprehensive solution for complete application-data protection to help you meet the most stringent of RTOs and RPOs.




Block-level backups– Host-based snapshot– Targeted at high-density file

systems– Single-file restore– Sparse backups

High performance– Significant backup-and-restore

performance impact—up to 10 times faster

– Drive tape at rated speeds– Optional network-accelerated

serverless backup with Cisco intelligent switch

NetWorker SnapImage Module

1,000,000+ directories10,000,000+ files

Advanced Backup

If there are servers with lots of files and lots of directories, what we refer to as high-density file systems, backup and recovery are particularly challenging. With so many files, traditional backup struggles to keep up with backup windows.

NetWorker SnapImage enables block-level backup of these file systems while maintaining the abilityto restore a single file. SnapImage is intelligent enough to also support sparse backups.

Sparse files contain data with portions of empty blocks, or “zeroes.”NetWorker backs up only the non-zero blocks, thereby reducing:− Time for backup −Amount of backup-media space consumed

Sparse-file examples:− Large database files with deleted data or unused database fields− Files from image applications

With the NetWorker SnapImage Module, backup and recovery of servers with high-density file systems is significantly increased:− The time required to back up 18.8 million 1 KB files in a 100 GB file system with a block size

of 4 KB can be reduced from 31 to seven hours.− The time required to perform a Save Set restore of one million 4 KB files in a 5.36 GB internal

disk can be reduced from 72 to seven minutes.




Solution Example: Major Telecom Company

Value PropositionZero backup window for applicationsEliminated data-loss riskReduced management overhead

Business Challenge:Complex application environmentNo backup window Recovery-time objective: Restore 24 TB in two hours

Enterprise-Information Protection

Solution:NetWorker PowerSnap with Symmetrix and TimeFinder/Snap

– Server-free backup

NetWorker DiskBackup Option with CLARiiON with ATA disks

– Rapid primary-site protection

NetWorker and SRDF/S– Disaster recovery– Offsite protection

55

Disaster-Recovery Site Production Site

SymmetrixDMX

Applicationhost

NetWorker

Storage Node PowerSnap

Disaster-recovery host

CLARiiONCX

SymmetrixDMX

Storage Node

Tapelibrary SAN

SAN Tapelibrary

SRDF/S

An example of EMC NetWorker solution is shown in the slide.

EMC has worked with a large telecommunications company to meet their most demanding IT challenges:Complex application environment—Oracle, and lots of dataNo backup window Recovery-time objective: Restore 24 TB in two hours.

They chose to implement NetWorker, along with other key EMC offerings, to achieve a superior level of protection and recovery management—and confidence in the ability to recover.

Solution:NetWorker PowerSnap with Symmetrix and TimeFinder/Snap− Server-free backup and rapid recovery

NetWorker DiskBackup with CLARiiON with ATA disks− Rapid primary-site protection and recovery

NetWorker and SRDF/S− Disaster recovery, offsite protection

Here is what they have been able to achieve with the above:Zero backup time for their applicationsZero data lossSignificantly reduced management overhead

Not all environments are this complex or demanding, but NetWorker can meet any backup and recovery requirements, and can easily be upgraded to meet more stringent requirements as needed.




SummaryKey points covered in this topic:

EMC’s product implementation of a Backup and Recovery solution

In this topic, we described EMC’s product implementation of a Backup and Recovery solution.

This concludes the module.

Backup and Recovery Case Study

Business Profile: A manufacturing corporation uses tape as their primary backup storage media throughout the entire organization. Current Situation/Issue: Full backups are run every Sunday. Incremental backups are run from Monday through Saturday. There are many backup servers in the environment, backing up different groups of servers. Their e-mail and database applications have to be shut down during the backup process. The main concerns facing the corporation are: 1) Due to the de-centralized backup environment, recoverability of the backup servers is

compromised. 2) Key applications have to be shut down during the backup process. 3) Too many tapes need to be mounted in order to perform a full recover, in case of a

complete failure. The company would like to: 1) Deploy an easy-to-manage backup environment. 2) Reduce the amount of time the email and database applications need to be shutdown. 3) Reduce the number of tapes required to fully recover a server in case of failure. Proposal: Propose a backup and recovery solution to address the company’s concern. Justify how your solution will ensure that the company’s needs are met.


Business Continuity – Local Replication - 1

© 2007 EMC Corporation. All rights reserved. Business Continuity – Local Replication - 1

Local ReplicationUpon completion of this module, you will be able to:

Discuss replicas and the possible uses of replicas

Explain consistency considerations when replicating file systems and databases

Discuss host and array based replication technologies– Functionality– Differences– Considerations– Selecting the appropriate technology

In this module, we will look at what replication is, technologies used for creating local replicas, and things that need to be considered when creating replicas.





What is ReplicationReplica - An exact copy (in all details)

Replication - The process of reproducing data

Original Replica

REPLICATIONREPLICATION

Local replication is a technique for ensuring Business Continuity by making exact copies of data. With replication, data on the replica is identical to the data on the original at the point-in-time that the replica was created.

Examples: Copy a specific fileCopy all the data used by a database applicationCopy all the data in a UNIX Volume Group (including underlying logical volumes, file systems, etc.)Copy data on a storage array to a remote storage array




Possible Uses of ReplicasAlternate source for backup

Source for fast recovery

Decision support

Testing platform

Migration

Replicas can be used to address a number of Business Continuity functions:Provide an alternate source for backup to alleviate the impact on productionProvide a source for fast recovery to facilitate faster RPO and RTODecision Support activities such as reporting − For example, a company may have a requirement to generate periodic reports. Running the

reports off of the replicas greatly reduces the burden placed on the production volumes. Typically reports would need to be generated once a day or once a week, etc.

Developing and testing proposed changes to an application or an operating environment− For example, the application can be run on an alternate server using the replica volumes and

any proposed design changes can be tested Data migration −Migration can be as simple as moving applications from one server to the next, or as

complicated as migrating entire data centers from one location to another.




ConsiderationsWhat makes a replica good– Recoverability

Considerations for resuming operations with primary– Consistency/re-startability

How is this achieved by various technologies

Kinds of Replicas– Point-in-Time (PIT) = finite RPO– Continuous = zero RPO

How does the choice of replication technology tie back into RPO/RTO

Key factors to consider with replicas:What makes a replica good:−Recoverability from a failure on the production volumes. The replication technology must

allow for the restoration of data from the replicas to the production and then allow production to resume with a minimal RPO an RTO

−Consistency/re-startability is very important if data on the replicas will be accessed directly or if the replicas will be used for restore operations

Replicas can either be Point-in-Time (PIT) or continuous:− Point-in-Time (PIT) - the data on the replica is an identical image of the production at some

specific timestamp For example, a replica of a file system is created at 4:00 PM on Monday. This replica would then be referred to as the Monday 4:00 PM Point-in-Time copy Note: The RPO will be a finite value with any PIT. The RPO will map to the time when the PIT was created to the time when any kind of failure on the production occurred. If there is a failure on the production at 8:00 PM and there is a 4:00 PM PIT available, the RPO would be 4 hours (8 – 4 = 4). To minimize RPO with PITs, take periodic PITs

−Continuous replica - the data on the replica is synchronized with the production data at all times

The objective with any continuous replication is to reduce the RPO to zero




Replication of File SystemsHost

Apps

Volume Management

DBMS Mgmt Utilities

File System

Multi-pathing Software

Device Drivers

HBA HBA HBA

Operating System

Physical Volume

Buffer

Most OS file systems buffer data in the host before the data is written to the disk on which the file system resides.

For data consistency on the replica, the host buffers must be flushed prior to the creation of the PIT. If the host buffers are not flushed, the data on the replica will not contain the information that was buffered on the host.Some level of recovery will be necessary

Note: If the file system is unmounted prior to the creation of the PIT, no recovery would be needed when accessing data on the replica.




A database application may be spread out over numerous files, file systems, and devices,all of which must be replicated

Database replication can be offline or online

Replication of Database Applications

LogsData

Database replication can be offline or online:Offline – replication takes place when the database and the application are shutdownOnline – replication takes place when the database and the application are running




Database: Understanding ConsistencyDatabases/Applications maintain integrity by following the “Dependent Write I/O Principle”– Dependent Write: A write I/O that will not be issued by an application

until a prior related write I/O has completedA logical dependency, not a time dependency

– Inherent in all Database Management Systems (DBMS) e.g. Page (data) write is dependent write I/O based on a successful log write

– Applications can also use this technology– Necessary for protection against local outages

Power failures create a dependent write consistent imageA Restart transforms the dependent write consistent to transactionallyconsistent

i.e. Committed transactions will be recovered, in-flight transactions will be discarded

All logging database management systems use the concept of dependent write I/Os to maintain integrity. This is the definition of dependent write consistency. Dependent write consistency is required for the protection against local power outages, loss of local channel connectivity, or storage devices. The logical dependency between I/Os is built into database management systems, certain applications, and operating systems.




Database Replication: Transactions

Data

Log

Database Application

4 4

3 3

2 2

1 1

Buffer

Database applications require that for a transaction to be deemed complete, a series of writes have to occur in a particular order (Dependent Write I/O). These writes would be recorded on the various devices/file systems.

In this example, steps 1-4 must complete for the transaction to be deemed complete− Step 4 is dependent on Step 3 and will occur only if Step 3 is complete− Step 3 is dependent on Step 2 will occur only if Step 2 is complete− Step 2 is dependent on Step 1 will occur only if Step 1 is complete

Steps 1-4 are written to the database’s buffer and then to the physical disks




Database Replication: Consistency

Data

Log

Source Replica

Consistent

4 4

3 3

2 2

1 1

Log

Data

Note: In this example, the database is online.

At the point in time when the replica is created, all the writes to the source devices must be captured on the replica devices to ensure data consistency on the replica.

In this example, steps 1-4 on the source devices must be captured on the replica devices for the data on the replicas to be consistent.




Database Replication: Consistency

Data

Log

Source Replica

Inconsistent

Note: In this example, the database is online.

4 4

3 3

2

1

Creating a PIT for multiple devices happens quickly, but not instantaneously. Steps 1-4 which are dependent write I/Os have occurred and have been recorded successfully on the source devicesIt is possible that steps 3 and 4 were copied to the replica devices, while steps 1 and 2 were not copiedIn this case, the data on the replica is inconsistent with the data on the source. If a restart were to be performed on the replica devices, Step 4 which is available on the replica might indicate that a particular transaction is complete, but all the data associated with the transaction will be unavailable on the replica making the replica inconsistent.




Database Application

(Offline)

Database Replication: Ensuring Consistency

Data

Log

Source Replica

Consistent

Off-line Replication– If the database is offline or

shutdown and then a replica is created, the replica will be consistent

– In many cases, creating an offline replica may not be a viable due to the 24x7 nature of business

Database replication can be performed with the application offline (i.e., application is shutdown, no I/O activity) or online (i.e., while the application is up and running). If the application is offline, the replica will be consistent because there is no activity. However, consistency is an issue if the database application is replicated while it is up and running.




Online Replication– Some database applications allow

replication while the application is up and running

– The production database would have to be put in a state which would allow it to be replicated while it is active

– Some level of recovery must be performed on the replica to make the replica consistent


Data

Log

Source Replica

Inconsistent

4 4

3 3

2

1

In the situation shown, Steps 1-4 are dependent write I/Os. The replica is inconsistent because Steps 1 and 2 never made it to the replica. To make the database consistent, some level of recovery would have to be performed. In this example, it could be done by simply discarding the transaction that was represented by Steps 1-4. Many databases are capable of performing such recovery tasks.





5

Source Replica

Consistent

4 4

3 3

2 2

1 1

5

An alternative way to ensure that an online replica is consistent is to:−Hold I/O to all the devices at the same instant−Create the replica−Release the I/O

Holding I/O is similar to a power failure and most databases have the ability to restart from a power failure.

Note: While holding I/O simultaneously, one ensures that the data on the replica is identical to that on the source devices. The database application times out if I/O is held for too long.




Tracking Changes After PIT Creation

At PIT

Source = Target

Later

Source ≠ Target

Resynch

Source = Target

Changes occur on the production volume after the creation of a PIT. Changes could also occur on the target. Typically the target device re-synchronizes with the source device at some future time in orderto obtain a more recent PIT.

Note: The replication technology employed should have a mechanism to keep track of changes. This makes the re-synchronization process much faster. If the replication technology does not track changes between the source and target, every resynchronization operation has to be a full operation.




Local Replication TechnologiesHost based– Logical Volume Manager (LVM) based mirroring– File System Snapshots

Storage Array based– Full volume mirroring– Full volume: Copy on First Access– Pointer based: Copy on First Write

Replication technologies can classified by:Distance over which replication is performed - local or remoteWhere the replication is performed - host or array based−Host based - all the replication is performed by using the CPU resources of the host using

software that is running on the host. −Array based - all replication is performed on the storage array using CPU resources on the array

via the array’s operating environment.

Note: In the context of this discussion, local replication refers to replication that is performed within a data center if it is host based, and within a storage array if it is array based.




Logical Volume Manager: Review

Physical Storage

Logical Storage

LVM

Host resident software responsible for creating and controlling host level logical storage– Physical view of storage is converted to a

logical view by mapping. Logical data blocks are mapped to physical data blocks

– Logical layer resides between the physical layer (physical devices and device drivers) and the application layer (OS and applications see logical view of storage)

Usually offered as part of the operating system or as third party host softwareLVM Components:– Physical Volumes– Volume Groups– Logical Volumes

Logical Volume Managers (LVMs) introduce a logical layer between the operating system and thephysical storage. LVMs have the ability to define logical storage structures that can span multiple physical devices. The logical storage structures appear contiguous to the operating system and applications.

The fact that logical storage structures can span multiple physical devices provides flexibility and additional functionality:

Dynamic extension of file systemsHost based mirroringHost based striping

The Logical Volume Manager provides a set of operating system commands, library subroutines, and other tools that enable the creation and control of logical storage.




Volume Groups

Physical Disk

Block

Volume Group

Physical Volume 1

Physical Volume 2

Physical Volume 3

One or more Physical Volumes form a Volume Group

LVM manages Volume Groups as a single entity

Physical Volumes can be added and removed from a Volume Group as necessary

Physical Volumes are typically divided into contiguous equal-sized disk blocks

A host will always have at least one disk group for the Operating System– Application and Operating

System data maintained in separate volume groups

A Volume Group is created by grouping together one or more Physical Volumes. Physical Volumes:Can be added or removed from a Volume Group dynamically Cannot be shared between Volume Groups, the entire Physical Volume becomes part of a Volume Group

Each Physical Volume is partitioned into equal-sized data blocks. The size of a Logical Volume is based on a multiple of the equal-sized data block.

The Volume Group is handled as a single unit by the LVM. A Volume Group as a whole can be activated or deactivated A Volume Group would typically contain related information. For example, each host would have a Volume Group which holds all the OS data, while applications would be on separate Volume Groups.

Logical Volumes are created within a given Volume Group. A Logical Volume can be thought of as a virtual disk partition, while the Volume Group itself can be though of as a disk. A Volume Group can have a number of Logical Volumes.




Logical Volumes

Logical Disk Block

Volume GroupPhysical Disk

Block

Physical Volume 1 Physical Volume 2 Physical Volume 3

Logical Volume

Logical Volume

Logical Volumes (LV) form the basis of logical storage. They contain logically contiguous data blocks (or logical partitions) within the volume group. Each logical partition is mapped to at least one physical partition on a physical volume within the Volume Group. The OS treats an LV like a physical device and accesses it via device special files (character or block). A Logical Volume:

Can only belong to one Volume Group. However, a Volume Group can have multiple LVsCan span multiple physical volumesCan be made up of physical disk blocks that are not physically contiguousAppears as a series of contiguous data blocks to the OSCan contain a file system or be used directly. Note: There is a one-to-one relationship between LV and a File System

Note: Under normal circumstances, there is a one-to-one mapping between a logical and physical Partition. A one-to-many mapping between a logical and physical partition leads to mirroring of Logical Volumes.




Host Based Replication: Mirrored Logical Volumes

Host Logical Volume

Logical Volume

PhysicalVolume 1

PVID1 VGDA

PhysicalVolume 2

PVID2 VGDA

Logical Volumes may be mirrored to improve data availability. In mirrored logical volumes, every logical partition maps to 2 or more physical partitions on different physical volumes.

Logical volume mirrors may be added and removed dynamicallyA mirror can be split and data contained used independently

The advantages of mirroring a Logical Volume are high availability and load balancing during reads if the parallel policy is used. The cost of mirroring is additional CPU cycles necessary to perform two writes for every write and the longer cycle time needed to complete the writes.




Host Based Replication: File System SnapshotsMany LVM vendors will allow the creation of File System Snapshots while a File System is mounted

File System snapshots are typically easier to manage than creating mirrored logical volumes and then splitting them

Many Logical Volume Manager vendors will allow the creation of File System snapshots while a File System is mounted. File System snapshots are typically easier to manage than creating mirrored logical volumes and then splitting them.




Host (LVM) Based Replicas: DisadvantagesLVM based replicas add overhead on host CPUs

If host devices are already Storage Array devices then the added redundancy provided by LVM mirroring is unnecessary– The devices will have some RAID protection already

Host based replicas can be usually presented back to the same server

Keeping track of changes after the replica has been created

Host based replicas can be usually presented back to the same server:Using the replica from the same host for any BC operation adds an additional CPU burden on the serverReplica is useful for fast recovery if there is any logical corruption on the source at the File System levelReplica itself may become unavailable if there is a problem at the Volume Group levelIf the Server fails, then the replica and the source would be unavailable until the server is brought online or another server is given access to the Volume groupPresenting a LVM based local replica to a second host is usually not possible because the replica will still be part of the volume group which is usually accessed by one host at any given time

Keeping track of changes after the replica has been created:If changes are not tracked, all future resynchronization will be a full operationSome LVMs may offer incremental resynchronization




Replication performed by the Array Operating Environment

Replicas are on the same array

Storage Array Based Local Replication

Production Server

Business Continuity Server

Array

ReplicaSource

With storage array based local replication:Replication performed by the Array Operating Environment−Array CPU resources are used for the replication operations−Host CPU resources can be devoted to production operations instead of replication operations

Replicas are on the same array−Can be accessed by an alternate host for any BC operations

Typically array based replication is performed at a array device level.−Need to map storage components used by an application back to the specific array devices used

– then replicate those devices on the array.−A database could be laid out on over multiple physical volumes which belong. One would have

to replicate all the devices for a PIT copy of the database.




Typically Array based replication is done at a array device level– Need to map storage components used by an application/file system

back to the specific array devices used – then replicate those devices on the array

Array 1

Storage Array Based – Local Replication Example

File System 1

Volume Group 1

Logical Volume 1 Source Vol 1

Replica Vol 1

Source Vol 2

Replica Vol 2

c12t1d1 c12t1d2

In this example, File System 1 has to be replicated. File System 1 is actually built on Logical Volume 1, which in turn is a part of Volume Group 1, which is made up of two Physical Volumes c12t1d1 and c12t1d2These physical volumes are actually residing in Array 1 and are Source Vol1 and Source Vol2In order to replicate File System 1, one has to actually replicate the two Array Devices Since 2 Array Volumes have to replicated we need two Array Volumes to act as the replica volumes. In this example Replica Vol1 and Replica Vol2 are used for the replication




Array Based Local Replication: Full Volume Mirror

Source Target

Attached

Array

Read/Write Not Ready

Full volume mirroring is achieved by attaching the target device to the source device and then copying all the data from the source to the target. The target is unavailable to its host while it is attached to the source, and the synchronization occurs.

Target (Replica) device is attached to the Source device and the entire data from the source device is copied over to the target deviceDuring this attachment and synchronization period, the Target device is unavailable




Array Based Local Replication: Full Volume Mirror

Source Target

Detached - PIT

Read/Write Read/Write

Array

After the synchronization is complete, the target can be detached from the source and be made available for Business Continuity operations. The point-in-time (PIT) is determined by the time of detachment or separation of the Source and Target. For example, if the detachment time is 4:00 PM, the PIT of the replica is 4:00 PM.




Array Based Local Replication: Full Volume MirrorFor future re-synchronization to be incremental, most vendors have the ability to track changes at some level of granularity (e.g., 512 byte block, 32 KB, etc.)– Tracking is typically done with some kind of bitmap

Target device must be at least as large as the Source device– For full volume copies the minimum amount of storage required is

the same as the size of the source

For future re-synchronization to be incremental, most vendors have the ability to track changes at some level of granularity, such as 512 byte block, 32 KB, etc. Tracking is typically done with some kind of bitmap.

The target device must be at least as large as the source device. For full volume copies, the minimum amount of storage required is the same as the size of the source.




Copy on First Access (COFA)Target device is made accessible for BC tasks as soon as the replication session is started

Point-in-Time is determined by time of activation

Can be used in Copy First Access mode (deferred) or in Full Copy mode

Target device is at least as large as the Source device

Copy on First Access (COFA) provides an alternate method to create full volume copies. Unlike Full Volume mirrors, the replica is immediately available when the session is started (no waiting for full synchronization).

The PIT is determined by the time of activation of the session. Just like the full volume mirror technology, this method requires the Target devices to be at least as large as the source devices.A protection map is created for all the data on the Source device at some level of granularity (e.g., 512 byte block, 32 KB, etc.). Then the data is copied from the source to the target in the background based on the mode with which the replication session was invoked.




Write to SourceCopy on First Access Mode: Deferred Mode

Source Target


Write to Target

Read from Target

Source Target

Source Target



In the Copy on First Access mode (or the deferred mode), data is copied from the source to the target only when:

A write is issued for the first time after the PIT to a specific address on the source A read or write is issued for the first time after the PIT to a specific address on the target.

Since data is only copied when required, if the replication session is terminated, the target device only has data that was copied (not the entire contents of the source at the PIT). In this scenario, the data on the target cannot be used as it is incomplete.




Copy on First Access: Full Copy ModeOn session start, the entire contents of the Source device is copied to the Target device in the background

Most vendor implementations provide the ability to track changes: – Made to the Source or Target – Enables incremental re-synchronization

In Full Copy Mode, the target is made available immediately and all the data from the source is copied over to the target in the background.

During this process, if a data block that has not yet been copied to the target is accessed, the replication process jumps ahead and moves the required data block first. When a full copy mode session is terminated (after full synchronization), the data on the target is still usable as it is a full copy of the original data.




Array: Pointer Based Copy on First WriteTargets do not hold actual data, but hold pointers to where the data is located– Actual storage requirement for the replicas is usually a small fraction

of the size of the source volumes

A replication session is setup between the Source and Target devices and started– When the session is setup based on the specific vendors

implementation a protection map is created for all the data on the Source device at some level of granularity (e.g 512 byte block, 32 KB etc.)

– Target devices are accessible immediately when the session is started

– At the start of the session the Target device holds pointers to the data on the Source device

Unlike full volume replicas, the target devices for pointer based replicas only hold pointers to the location of the data but not the data itself. When the copy session is started, the target device holds pointers to the data on the source device. The primary advantage of pointer based copies is the reduction in storage requirement for the replicas.




Pointer Based Copy on First Write Example

Source Save Location

TargetVirtual Device

The original data block from the Source is copied to the save location, when a data block is first written to after the PIT.

Prior to a new write to the source or target device:−Data is copied from the source to a “save” location− The pointer for that specific address on the Target then points to the “save” location−Writes to the Target result in writes to the “save” location and the updating of the pointer to the

“save” locationIf a write is issued to the source for the first time after the PIT, the original data block is copied to the save location and the pointer is updated from the Source to the save location. If a write is issued to the Target for the first time after the PIT, the original data is copied from the Source to the Save location, the pointer is updated and then the new data is written to the save location.Reads from the Target are serviced by the Source device or from the save location based on the where the pointer directs the read.− Source – When data has not changed since PIT− Save Location – When data has changed since PIT

Data on the replica is a combined view of unchanged data on the Source and the save location. Hence, if the Source device becomes unavailable, the replica no longer has valid data.




Array Replicas: Tracking ChangesChanges will/can occur to the Source/Target devices after PIT has been created

How and at what level of granularity should this be tracked– Too expensive to track changes at a bit by bit level

Would require an equivalent amount of storage to keep track of which bit changed for each the source and the target

– Based on the vendor some level of granularity is chosen and a bit map is created (one for Source and one for Target)

One could choose 32 KB as the granularityFor a 1 GB device changes would be tracked for 32768, 32KB chunksIf any change is made to any bit on one 32KB chunk the whole chunk is flagged as changed in the bit map1 GB device map would only take up 32768/8/1024 = 4KB space

It is too expensive to track changes at a bit by bit level because it would require an equivalent amount of storage to keep track of which bit changed for both the Source and the Target.

Some level of granularity is chosen and a bit map is created -- one for the Source and one for the Target. The level of granularity is vendor specific.




Source

Target

Array Replicas: How Changes Are Determined

0 = unchanged = changed

Re-synch(Source to

Target)

At PIT

Target

SourceAfter PIT…

Target

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 1 0 1 0 0

0 0 1 1 0 0 0 1

1 0 1 1 0 1 0 1

1

Differential/incremental re-synchronization (Source to Target) or restore (Target to Source): The bitmaps for the source and target are all set to 0 at the PITAny changes to the source or target after PIT are flagged by setting appropriate bit to 1 in the bitmapWhen re-synchronization or a restore is required, a logical OR operation between the source bitmap and the target bitmap is performed. The bitmap resulting from this operation references all blocks that have been modified in either the source or the target. This enables an optimized re-synchronization or a restore operation, as it eliminates the need to copy all the blocks between the source and the target.If re-synchronization is required, then changes to the target are overwritten with the corresponding blocks from the source. In the example shown, these would be blocks 3, 4, and 8 on the target (starting from left). Changes to the source are propagated to the target. In the example shown, these would be blocks 1, 4, and 6 on the source.If restore is required, then changes to the source are overwritten with the corresponding blocks from the target. Changes to the target are propagated to the source.In either case, both the changes to the source and to the target cannot be simultaneously preserved.

The result of the logical OR operation is shown on slide. The direction of data movement will be depend on whether a re-synchronization or a restore operation is performed.

Question: In a hypothetical scenario, wherein starting with block 1, every alternate block on the source has changed, and starting with block 2, every alternate block on the target has changed, what would be the result and impact on re-synchronization or a restore?




Array Replication: Multiple PITs

06:00 A.M.

: 12 : 01 : 02 : 03 : 04 : 05 : 06 : 07 : 08 : 09 : 10 : 11 : 12 : 01 : 02 : 03 : 04 : 05 : 06 : 07 : 08 : 09 : 10 : 11 :

P.M.A.M.

12:00 P.M.

06:00 P.M.

12:00 A.M.

Source

Target Devices

Point-In-Time

Most array based replication technologies allow the Source devices to maintain replication relationships with multiple Targets.

This can also reduce RTO because the restore can be a differential restoreEach PIT could be used for a different BC activity and also as restore points

In this example, a PIT is created every six hours from the same source. If any logical or physical corruption occurs on the Source, the data can be recovered from the latest PIT and at worst, the RPO will be 6 hours.




Array Replicas: Ensuring Consistency

InconsistentConsistent

Source Replica

4 4

3 3

2 2

1 1

Source Replica

4 4

3 3

2

1

Most array based replication technologies allow the creation of consistent replicas by holding I/O to all devices simultaneously when the PIT is created.

Typically, applications are spread out over multiple devices−Could be on the same array or multiple arrays

Replication technology must ensure that the PIT is consistent for the whole application−Need mechanism to ensure that updates do not occur while PIT is created

Hold I/O to all devices simultaneously for an instant, create PIT and release I/O −Cannot hold I/O for too long, application will timeout




Mechanisms to Hold I/OHost based

Array based

What if the application straddles multiple hosts and multiple arrays

Mechanisms to hold I/O:Host based− Some host based application could be used to hold I/O to all the array devices that are to be

replicated when the PIT is created− Typically achieved at the device driver level or above before the I/O reaches the HBAs

Some vendors implement this at the multi-pathing software layerArray based− I/Os can be held for all the array devices that are to be replicated by the Array Operating

Environment in the array itself when the PIT is created

What if the application straddles multiple hosts and multiple arrays?Federated DatabasesSome array vendors are able to ensure consistency in this situation




Array Replicas: Restore/Restart ConsiderationsProduction has a failure– Logical Corruption– Physical failure of production devices– Failure of Production server

Solution– Restore data from replica to production

The restore would typically be done in an incremental manner and the Applications would be restarted even before the synchronization is complete leading to very small RTO

-----OR------– Start production on replica

Resolve issues with production while continuing operations on replicasAfter issue resolution restore latest data on replica to production

Failures can occur in many different ways:There could be a logical corruption of the data on the production devices, the devices are available but the data on them is corrupt. In this case, opt to restore the data to the production from the latest replica. Production devices may become unavailable due to physical failures (Production server down, physical drive failure etc.). In this case, start the production on the latest replica and then while the production is being done from the replicas fix the physical problems on the Production side. Once the situation has been resolved, the latest information from the replica devices can be restored back to the production volumes.

In either of these scenarios, it is a good idea to stop access to the production and replica devices, and then identify the replica that will be used for the restore or the restart operations.




Array Replicas: Restore/Restart ConsiderationsBefore a Restore– Stop all access to the Production devices and the Replica devices– Identify Replica to be used for restore

Based on RPO and Data Consistency– Perform Restore

Before starting production on Replica– Stop all access to the Production devices and the Replica devices– Identify Replica to be used for restart

Based on RPO and Data Consistency– Create a “Gold” copy of Replica

As a precaution against further failures– Start production on Replica

RTO drives choice of replication technology

Based on the type of failure, choose to either perform a restore to the production devices or shift production operations to the replica devices. In either case, the recommendation would be to stop access to the production and replica devices, then identify the replica to be used for the restore or restart operations.

The choice of replica depends on the consistency of the data on the replica and the desired RPO (e.g., a business may create PIT replicas every 2 hours; if a failure occurs, then at most only 2 hours of data would have been lost). If a replica has been written (application testing for example) to after the creation of the PIT, then this replica may not be a viable candidate for the restore or restart.

Note: RTO is a key driver in the choice of replication technology. The ability to restore or restart almost instantaneously after any failure is very important.




Array Replicas: Restore ConsiderationsFull Volume Replicas– Restores can be performed to either the original source device or to

any other device of like sizeRestores to the original source could be incremental in natureRestore to a new device would involve a full synchronization

Pointer Based Replicas– Restores can be performed to the original source or to any other

device of like size as long as the original source device is healthyTarget only has pointers

Pointers to source for data that has not been written to after PIT Pointers to the “save” location for data was written after PIT

Thus to perform a restore to an alternate volume the source must be healthy to access data that has not yet been copied over to the target

With Full Volume replicas, all the data that was on the source device when the PIT was created is available on the Replica (either with Full Volume Mirroring or Full Volume Copies). With Pointer Based Replicas and Full Volume Copies in deferred mode (COFA), access to all the data on the Replica is dependent on the health (accessibility) of the original source volumes. If the original source volume is inaccessible for any reason, pointer based or Full Volume Copy on First Access replicas are of no use in either a restore or a restart scenario.




Array Replicas: Which TechnologyFull Volume Replica– Replica is a full physical copy of the source device– Storage requirement is identical to the source device– Restore does not require a healthy source device– Activity on replica will have no performance impact on the source

device– Good for full backup, decision support, development, testing and

restore to last PIT– RPO depends on when the last PIT was created– RTO is extremely small

Full Volume replicas have a number of advantages over Pointer based (COFW) and Copy On First Access technologies.

The replica has the entire contents of the original source device from the PIT and any activity to the replica has no performance impact on the source device (there is no COFA or COFW penalty) Full Volume replicas can be used for any BC activity The only disadvantage is that the storage requirements for the replica are at least equal to that of the source devices




Array Replicas: Which Technology (continued)Pointer based - COFW– Replica contains pointers to data

Storage requirement is a fraction of the source device (lower cost)– Restore requires a healthy source device– Activity on replica will have some performance impact on source

Any first write to the source or target will require data to be copied to the save location and move pointer to save locationAny read I/O to data not in the save location will have to be serviced by the source device

– Typically recommended if the changes to the source are less than30%

– RPO depends on when the last PIT was created– RTO is extremely small

The main benefit of Pointer based copies is the lower storage requirement for the replicas. This technology is also very useful when the changes to the Source are expected to be less that 30% after the PIT has been created. Heavy activity on the Target devices may cause performance impact on the Source because any first writes to the Target require data to be copied from the source to the Save location. Also, any reads which are not in the save area have to be read from the Source device. The Source device needs to be accessible for any restart or restore operations from the Target.




Array Replicas: Which TechnologyFull Volume – COFA Replicas– Replica only has data that was accessed– Restore requires a healthy source device– Activity on replica will have some performance impact

Any first access on target will require data to be copied to target before the I/O to/from target can be satisfied

– Typically replicas created with COFA only are not as useful as replicas created with the full copy mode – Recommendation would be to use the full copy mode if the technology allows such an option

Listed here, are some considerations for using Full Volume Copy on First Access (COFA).

The COFA technology requires at least the same amount of storage as the Source. The disadvantages of the COFA penalty, and the fact that the replica would be of no use if the source volume were inaccessible, make this technology less desirable. If a Full Copy mode is available, then always use the Full Copy mode. The advantages are identical to that discussed for Full Volume replicas.




Array Replicas: Full Volume vs. Pointer Based

Full Volume Pointer Based

Required Storage 100% of Source Fraction of Source

Performance Impact None Some

RTO Very small Very small

Restore Source need not be healthy

Requires a healthy source device

Data change No limits < 30%

This table summarizes the differences between Full Volume and Pointer Base replication technologies.





Replicas and the possible use of Replicas

Consistency considerations when replicating File Systems and Databases

Host and Array based Replication Technologies– Advantages/Disadvantages– Differences– Considerations– Selecting the appropriate technology





Check Your KnowledgeWhat is a replica?

What are the possible uses of a replica?

What is consistency in the context of a database?

How can consistency be ensured when replicating a database?

Discuss one host based replication technology

What is the difference between full volume mirrors and pointer based replicas?

What are the considerations when performing restore operations for each replication technology?






EMC’s Local Replication Solutions for the Symmetrix and CLARiiON Arrays

EMC’s TimeFinder/Mirror Replication Solution

EMC’s SnapView - Snapshot Replication Solution

At this point, let’s apply what we’ve learned to some real world examples. Upon completion of thistopic you will be able to:

List EMC’s Local Replication Solutions for the Symmetrix and CLARiiON arrays;

Describe EMC’s TimeFinder/Mirror Replication Solution; and

Describe EMC’s SnapView - Snapshot Replication Solution




EMC – Local Replication SolutionsEMC Symmetrix Arrays– EMC TimeFinder/Mirror

Full volume mirroring– EMC TimeFinder/Clone

Full volume replication– EMC TimeFinder/SNAP

Pointer based replication

EMC CLARiiON Arrays– EMC SnapView Clone

Full volume replication– EMC SnapView Snapshot

Pointer based replication

All the Local Replication solutions that were discussed in this module are available on EMC Symmetrix and CLARiiON arrays.

EMC TimeFinder/Mirror and EMC TimeFinder/Clone are full volume replication solutions on the Symmetrix arrays, while EMC TimeFinder/Snap is a pointer based replication solution on the Symmetrix. EMC SnapView on the CLARiiON arrays allows full volume replication via SnapView Clone and pointer based replication via SnapView Snapshot.EMC TimeFinder/Mirror: Highly available, ultra-performance mirror images of Symmetrix volumes that can be non-disruptively split off and used as point-in-time copies for backups, restores, decision support, or contingency uses.EMC TimeFinder/Clone: Highly functional, high-performance, full volume copies of Symmetrix volumes that can be used as point-in-time copies for data warehouse refreshes, backups, online restores, and volume migrations.EMC SnapView Clone: Highly functional, high-performance, full volume copies of CLARiiON volumes that can be used as point-in-time copies for data warehouse refreshes, backups, online restores, and volume migrations.EMC TimeFinder/Snap: High function, space-saving, pointer-based copies (logical images) of Symmetrix volumes that can be used for fast and efficient disk-based restores.EMC SnapView Snapshot: High function, space-saving, pointer-based copies (logical images) of CLARiiON volumes that can be used for fast and efficient disk-based restores.

EMC TimeFinder/Mirror and EMC SnapView Snapshot are discussed in more detail on the next few slides.




EMC TimeFinder/Mirror - IntroductionArray based local replication technology for Full Volume Mirroring on EMC Symmetrix Storage Arrays– Create Full Volume Mirrors of an EMC Symmetrix device within an Array

TimeFinder/Mirror uses special Symmetrix devices called BusinessContinuance Volumes (BCV). BCVs:– Are devices dedicated for Local Replication– Can be dynamically, non-disruptively established with a Standard device.

They can be subsequently split instantly to create a PIT copy of data.

The PIT copy of data can be used in a number of ways:– Instant restore – Use BCVs as standby data for recovery– Decision Support operations– Backup – Reduce application downtime to a minimum (offline backup)– Testing

TimeFinder/Mirror is available in both Open Systems and Mainframe environments

EMC TimeFinder/Mirror is an array based local replication technology for Full Volume Mirroring on EMC Symmetrix Storage Arrays.

TimeFinder/Mirror Business Continuance Volumes (BCV) are devices dedicated to local replication. The BCVs are typically established with a standard Symmetrix device to create a Full Volume Mirror. After the data has been synchronized, the BCV can be “split” from its source device and used for any BC task. TimeFinder controls available on Open Systems and Mainframe environments.




EMC TimeFinder/Mirror – OperationsEstablish– Synchronize the Standard volume to the BCV volume– BCV is set to a Not Ready state when

establishedBCV cannot be independently addressed

– Re-synchronization is incremental– BCVs cannot be established to other BCVs– Establish operation is non-disruptive to the

Standard device– Operations to the Standard can proceed as

normal during the establish

Establish

STD BCV

Incremental Establish

BCV

The TimeFinder Establish operation is the first step in creating a TimeFinder/Mirror replica. The purpose of the establish operation is to synchronize the contents from the Standard device to the BCV. The first time a BCV is established with a standard device, a full synchronization has to be performed. Any future re-synchronization can be incremental in nature. The Symmetrix microcode can keep track of changes made to either the Standard or the BCV.

The Establish is a non-disruptive operation to the Standard device. I/O to Standard devices can proceed during establish. Applications need not be quiesced during the establish operation. The Establish operation sets a “Not Ready” status on the BCV device. Hence, all I/O to the BCV device must be stopped before the Establish operation is performed. Since BCVs are dedicated replication devices, a BCV cannot be established with another BCV.




EMC TimeFinder/Mirror – Operations Split– Time of Split is the Point-in-Time– BCV is made accessible for BC Operations– Consistency

Consistent Split– Changes tracked STD BCV

Split

The Point-in-Time of the replica is tied to the time when the Split operation is executed.

The Split operation separates the BCV from the Standard Symmetrix device and makes the BCV device available for host access through its own device address. After the split operation, changes made to the Standard or BCV devices are tracked by the Symmetrix Microcode. EMC TimeFinder/Mirror ensures consistency of data on the BCV devices via the Consistent Split option.




EMC TimeFinder/Mirror Consistent SplitEMC PowerPath

PowerPath is an EMC host based multi-pathing software

PowerPath holds I/O during TimeFinder/Mirror Split– Read and write I/O

Enginuity Consistency Assist

Symmetrix Microcode holds I/O during TimeFinder/Mirror Split– - Write I/O (subsequent reads

after first write)

Host

STDSTD BCVBCVBCVBCVSTDSTD

The TimeFinder/Mirror Consistent Split option ensures that the data on the BCVs is consistent with the data on the Standard devices. Consistent Split holds I/O across a group of devices using a single Consistent Split command, thus all the BCVs in the group are consistent point-in-time copies. It is used to create a consistent point-in-time copy of an entire system, database, or any associated set of volumes.

The holding of I/Os can be either done by the EMC PowerPath multi-pathing software or the Symmetrix Microcode (Enginuity Consistency Assist). With PowerPath-based consistent split executed by the host doing the I/O, I/O is held at the host before the split.

Enginuity Consistency Assist (ECA) based consistent split can be executed by the host doing the I/O or by a control host in an environment where there are distributed and/or related databases. I/O is held at the Symmetrix until the split operation is completed. Since I/O is held at the Symmetrix, ECA can be used to perform consistent splits on BCV pairs across multiple, heterogeneous hosts.




EMC TimeFinder/Mirror – Operations Restore– Synchronize contents of BCV volume to the

Standard volume– Restore can be full or incremental– BCV is set to a Not Ready state– I/Os to the Standard and BCVs should be

stopped before the restore is initiated

Query– Provide current status of BCV/Standard volume

pairs

Incremental Restore

STD BCVSTD

The purpose of the restore operation is to synchronize the data on the BCVs from a prior Point in Time to the Standard devices. Restore is a recovery operation, so all I/O’s to the Standard device should be stopped and the device must be taken offline prior to a restore operation. The restore sets the BCV device to a Not-Ready state, thus all I/O’s to the BCV devices must be stopped and the devices must be offline before issuing the restore command.

Operations on the Standard volumes can resume as soon as the restore operation is initiated, while the synchronization of the Standards from the BCV is still in progress.

The query operation is used to provide current status of Standard/BCV volume pairs.




Incremental establish

or

Incremental restore

StandardStandardvolumevolume

BCVBCV 4:00 a.m.

2:00 a.m.

4:00 a.m.

6:00 a.m.

StandardStandardvolumevolume

EstablishSplit

EstablishSplit

BCVBCV

BCVBCV

BCVBCV

EMC TimeFinder/Mirror Multi-BCVsStandard device keeps track of changes to multiple BCVsone after the other

Incremental establish or restore

TimeFinder/Mirror allows a given Standard device to maintain incremental relationships with multiple BCVs.

This means that different BCVs can be established and then split incrementally from a standard volume at different times of the day. For example, a BCV that was split at 4:00 a.m. can be re-established incrementally, even though another BCV was established and split at 5:00 a.m. In this way, a user can split and incrementally re-establish volumes throughout the day or night and still keep re-establish times to a minimum.

Incremental information can be retained between a STD device and multiple BCV devices, provided the BCV devices have not been paired with different STD devices.

The incremental relationship is maintained between each STD/BCV pairing by the Symmetrix Microcode.




TimeFinder/Mirror Concurrent BCVsTwo BCVs can be established concurrently with the same Standard device

Establish BCVs simultaneously or one after the other

BCVs can be split individually or simultaneously.

Simultaneous. “Concurrent Restores”, are not allowed

StandardStandardBCV1BCV1

BCV2BCV2

Concurrent BCVs is a TimeFinder/Mirror feature that allows two BCVs to be simultaneously attached to a standard volume. The BCV pair can be split, providing customers with two copies of the customer’s data. Each BCV can be mounted online and made available for processing.




EMC CLARiiON SnapView - SnapshotsSnapView allows full copies and pointer-based copies– Full copies – Clones (sometimes called BCVs)– Pointer-based copies – Snapshots

Because they are pointer-based, Snapshots– Use less space than a full copy– Require a ‘save area’ to be provisioned– May impact the performance of the LUN they are associated with

The ‘save area’ is called the ‘Reserved LUN Pool’

The Reserved LUN Pool– Consists of private LUNs (LUNs not visible to a host)– Must be provisioned before Snapshots can be made

SnapView is software that runs on the CLARiiON Storage Processors and is part of the CLARiiON Replication Software suite of products, which includes SnapView, MirrorView and SAN Copy.

SnapView can be used to make point in time (PIT) copies in 2 different ways – Clones, also called BCVs or Business Continuity Volumes, are full copies, whereas Snapshots use a pointer-based mechanism. Full copies are covered later, when we look at Symmetrix TimeFinder. SnapView Snapshots is covered here.

The generic pointer-based mechanism has been discussed in a previous section, so we’ll concentrate on SnapView.

Snapshots require a save area, called the Reserved LUN Pool. The ‘Reserved’ part of the name implies that the LUNs are reserved for use by CLARiiON software, and therefore cannot be assigned to a host. LUNs which cannot be assigned to a host are known as private LUNs in the CLARiiON environment.

To keep the number of pointers, and therefore the pointer map, at a reasonable size, SnapView divides the LUN to be snapped, called a Source LUN, into areas of 64 kB in size. Each of these areas is known as a chunk. Any change to data inside a chunk causes that chunk to be written to the Reserved LUN Pool, if it is being modified for the first time. The 64 kB copied from the Source LUN must fit into a 64 kB area in the Reserved LUN, so Reserved LUNs are also divided into chunks for tracking purposes.

The next 2 slides show more detail on the Reserved LUN Pool, and allocation of Reserved LUNs to a Source LUN.




The Reserved LUN Pool

Private LUN 5

Private LUN 6

Private LUN 7

Private LUN 8

Reserved LUN Pool

FLARE LUN 5

FLARE LUN 6

FLARE LUN 7

FLARE LUN 8

The CLARiiON storage system must be configured with a Reserved LUN Pool in order to use SnapView Snapshot features. The Reserved LUN Pool consists of 2 parts: LUNs for use by SPA and LUNs for use by SPB. Each of those parts is made up of one or more Reserved LUNs. The LUNs used are bound in the normal manner. However, they are not placed in storage groups and allocated to hosts; they are used internally by the storage system software. These are known as private LUNs because they cannot be used, or seen, by attached hosts.

Like any LUN, a Reserved LUN is owned by only one SP at any time and may be trespassed if the need arises (i.e., if an SP fails).

Just as each storage system model has a maximum number of LUNs it supports, each also has a maximum number of LUNs which may be added to the Reserved LUN Pool.

The first step in SnapView configuration usually is the assignment of LUNs to the Reserved LUN Pool. Only then are SnapView Sessions allowed to start. Remember that as snapable LUNs are added to the storage system, the LUN Pool size has to be reviewed. Changes may be made online.

LUNs used in the Reserved LUN Pool are not host-visible, though they do count towards the maximum number of LUNs allowed on a storage system.

Note: FLARE is the operating environment of the EMC CLARiiON Arrays.




Source LUNs

Reserved LUN Allocation

Snapshot 1a

Snapshot 1b

Snapshot 2a

Session 1a

Session 1b

Session 2a

Reserved LUN Pool

LUN 1

LUN 2

Private LUN 5

Private LUN 6

Private LUN 7

Private LUN 8

In this example, LUN 1 and LUN 2 have been changed to Source LUNs by the creation of one or more Snapshots on each. Three Sessions are started on those Source LUNs. Once a Session starts, the SnapView mechanism tracks changes to the LUN and Reserved LUN Pool space is required. In this example, the following occurs:

Session 1a is started on Snapshot 1a Private LUN 5 in the Reserved LUN Pool is immediately allocated to Source LUN 1, and changes made to that Source LUN are placed in Private LUN 5 A second Session, Session 1b, is started on Snapshot 1b, and changes to the Source LUN are still saved in Private LUN 5 When PL 5 fills up, SnapView allocates the next available LUN, Private LUN 6, to Source LUN 1, and the process continues Sessions 1a and 1b are now storing information in PL 6 A Session is then started on Source LUN 2, and Private LUN 7 – a new LUN, since Source LUNs cannot share a Private LUN - is allocated to it Once that LUN fills, Private LUN 8 will be allocatedIf all private LUNs have been allocated, and Session 1b causes Private LUN 6 to become full, then Session 1b is terminated by SnapView without warning. SnapView does notify the user in the SP Event Log, and, if Event Monitor is active in other ways, the Reserved LUN Pool is filling up. This notification allows ample time to correct the condition. Notification takes place when the Reserved LUN Pool is 50% full, then again at 75%, and every 5% thereafter.




SnapView TermsSnapshot– The ‘virtual LUN’ seen by a secondary host– Made up of data on the Source LUN and data in the RLP– Visible to the host (online) if associated with a Session

Session– The mechanism that tracks the changes– Maintains the pointers and the map– Represents the point in time

Activate and deactivate a Snapshot– Associate and disassociate a Session with a Snapshot

Roll back– Copy data from a (typically earlier) Session to the Source LUN

Let’s use an analogy to make the distinction easier to understand. We’ll compare this technology to CD technology.

You can own a CD player, but have no CDs. Similarly, You can own CDs, but not have a player. CDs are only useful if you can listen to them; also, you can only listen to one at a time on a player, no matter how many CDs owned.

In the same way, a Session (the CD) is a Point-in-Time copy of data on a LUN. The exact time is determined by the time at which the session starts.

The Snapshot (the CD player in our analogy) allows us to view the Session data (listen to the CD). The sequence of slides that follows demonstrates the COFW process and the rollback process.




COFW and Reads from Snapshot

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 0

Chunk 1

Chunk 2

Chunk 3

Chunk 4

Secondary Host

Primary Host

At the start of the animation the SnapView Map, Reserved LUN, and the Map in SP Memory should all be empty. Solid arrows point from Snapshot chunks to the Source LUN chunks. Dotted arrows to Source LUN and to Snapshot go from the Map area in SP Memory. Source LUN chunks are labeled Chunk 0, Chunk 1, Chunk 2, Chunk 3, and Chunk 4.Primary Host issues a write to Chunk 3 on the Source LUN. This is indicated by a dotted arrow to Chunk 3 on the Source LUN from the Primary Host. A block travels out of the Primary Host. The block waits between the Primary Host and the Source LUN. Chunk 3 is copied to the first Chunk on the Reserved LUN, and this is now labeled Chunk 3. SnapView Map and SP Memory Map are updated. The solid arrow from Snapshot to the Source LUN Chunk 3 disappears and new Solid arrow from Snapshot to the Reserved LUN Chunk 3 appears. The dotted arrow to Chunk 3 on Source LUN disappears. A dotted arrow to Reserved LUN chunk 3 appears.Next the block travels to Chunk 3 on the Source LUN. 3 changes to 3’.Another block travels from the Primary Host to Chunk 3 on the Source LUN. This is placed there and 3’ changes to 3”.





SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 0

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Secondary Host

Primary Host

Chunk 3’

Next Primary Host issues a write to Chunk 0 on the Source LUN. This is indicated by a dotted arrow to Chunk 0 on the Source LUN from the Primary Host. A block travels out of the Primary Host. The block waits between the Primary Host and the Source LUN. Chunk 0 is copied to the second Chunk on the Reserved LUN, and this is now labeled Chunk 0. SnapView Map and SP Memory Map are updated. The solid arrow from Snapshot to the Source LUN Chunk 0 disappears and new Solid arrow from Snapshot to the Reserved LUN Chunk 0 appears. The dotted arrow to Chunk 0 on Source LUN disappears. A dotted arrow to Reserved LUN chunk 0 appears.Next the block travels to Chunk 0 on the Source LUN. 0 changes to 0’.Next the Secondary Host issues a read to Chunk 4 on the Snapshot. This is indicated by a dotted arrow from Chunk 4 on the Snapshot to the Secondary Host.A block travels from Chunk 4 of the Source LUN to Chunk 4 of the Snapshot. Then the block travels on to the Secondary Host.Next the Secondary Host issues a read to Chunk 0 on the Snapshot. This is indicated by a dotted arrow from Chunk 0 on the Snapshot to the Secondary Host.A block travels from Chunk 0 of the Reserved LUN to Chunk 0 of the Snapshot. Then the block travels on to the Secondary Host.





SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 0

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Secondary Host

Primary Host

Chunk 3’’





SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Secondary Host

Primary Host

Chunk 0

Chunk 3’’

Chunk 0’




Writes to Snapshot

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Secondary Host

Primary Host

Chunk 0

Chunk 3’’

Chunk 0’

Secondary host issues a write to Chunk 0 of the Snapshot. This is indicated by a dotted arrow from Secondary host to Chunk 0 of the Snapshot.A block starts from Secondary host and waits between the Secondary host and the Snapshot.Chunk 0 on the Reserved LUN is copied over to the next Chunk in the Reserved LUN.Block travels to Chunk 0 of Snapshot and then to the original Chunk 0 on the Reserved LUN. 0 changes to 0* in the Reserved LUN.Next the Secondary host issues a write to Chunk 2 of the Snapshot. This is indicated by a dotted arrow from Secondary host to Chunk 2 of the Snapshot.A block travels from Secondary host and waits between the Secondary host and the Snapshot.Chunk 2 is copied from Source LUN to the next available Chunk in the Reserved LUN. The solid arrow from Chunk 2 of Snapshot to Chunk 2 of Source LUN disappears. Solid arrow from Chunk 2 of Snapshot to Chunk 2 in the Reserved LUN appears. Dotted arrow to Chunk 2 on the Source LUN disappears and a dotted arrow to Chunk 2 on the Reserved LUN appears. SnapView Map and the Map in SP memory are updatedChunk 2 on the Reserved LUN is copied to the next Chunk on the Reserved LUN. A dotted arrow appears.The block travels to Chunk 2 on the Snapshot and then on to the original location of Chunk 2 on the Reserved LUN. 2 on the Reserved LUN is changed to 2*.




Writes to Snapshot

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3’’

Chunk 0’

Chunk 3

Chunk 0

Secondary Host

Primary Host

Chunk 0*




Writes to Snapshot

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Chunk 0

Chunk 2

Secondary Host

Primary Host

Chunk 0*

Chunk 2*

Chunk 3’’

Chunk 0’




Rollback - Snapshot Active (preserve changes)

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Chunk 0

Chunk 2

Secondary Host

Primary Host

Chunk 0*

Chunk 2*

Chunk 3’’

Chunk 0’

SnapView rollback allows a Source LUN to be returned to its state at a previously defined point in time. When performing the rollback, you can choose to preserve or discard any changes made by the secondary host. In this first example, changes are preserved. Meaning that the state of the Source LUN at the end of the rollback process is identical to the Snapshot, as it appears now.

All chunks that are in the Reserved LUN Pool are copied over the corresponding chunks on the Source LUN. Before this process starts, it is necessary to take the Source LUN offline (we are changing the data structure without the knowledge of the host operating system, and it needs to refresh its view of that structure). If this step is not performed, data corruption could occur on the Source LUN.

Note: No changes are made to the Snapshot or to the Reserved LUN Pool when this process takes place.




Rollback - Snapshot Active (preserve changes)

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2*

Chunk 4

Chunk 3

Chunk 0

Chunk 2

Secondary Host

Primary Host

Chunk 0*

Chunk 2*

Chunk 3

Chunk 0*




Rollback - Snapshot Deactivated (discard changes)

SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2*

Chunk 4

Chunk 3

Chunk 0

Chunk 2

Secondary Host

Primary Host

Chunk 0*

Chunk 2*

Chunk 3

Chunk 0*

In this example, all changes that have been made to the Snapshot by the secondary host are discarded, and return the Source LUN to the state it was in when the session was started (the original PIT view). To do this, the Snapshot needs to be deactivated. Deactivating the Snapshot discards all changes made by the secondary host, and frees up areas of the Reserved LUN Pool which were holding those changes. It also makes the Snapshot unavailable to the secondary host.

Once the deactivation has completed, the rollback process can be started. At this point, the Source LUN needs to be taken offline. The Source LUN is then returned to its original state at the time the session was started.





SP memory

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Chunk 0

Chunk 2

Secondary Host

Primary Host

Chunk 3’’

Chunk 0’

SnapView Map

Map





SP memory

SnapView Map

Map

Source LUN

Snapshot

Reserved LUN

Chunk 1

Chunk 2

Chunk 4

Chunk 3

Chunk 0

Chunk 2

Secondary Host

Primary Host

Chunk 3

Chunk 0




SummaryKey points covered in this topic:

EMC’s Local Replication Solutions for the Symmetrix and CLARiiON Arrays

EMC’s TimeFinder/Mirror Replication Solution

EMC’s SnapView - Snapshot Replication Solution

In this topic, we listed EMC’s Local ’Replication Solutions for the Symmetrix and CLARiiON arrays; Described EMC’s TimeFinder/Mirror Replication Solution; andDescribed EMC’s SnapView -Snapshot Replication Solution




Local Replication Case Study 1

Business Profile: A Manufacturing Corporation maintains the storage of their mission critical applications on high end Storage Arrays on RAID 1 volumes. Current Situation/Issue: A Full backup of their Key Application is run once a week. The Database application takes up 1 TB of storage and has to be shutdown during the course of the full backup. The Shutdown of the database is a requirement that cannot be changed. The main concerns facing the corporation are: 1) The Backup window is too long and is negatively impacting the business (2 hours) 2) A disaster recovery test with the full backup tapes took an extremely long time (many

hours). The company would like to: 1) Reduce the backup window during which the database application is shutdown to as

small a time window as possible (less than ½ hour) 2) Ensure that the RTO from the full backup is reduced to under an hour The company’s IT group is very interested in leveraging some of the Local Replication technologies that are available on their high end array. Proposal: Propose a local replication solution to address the company’s concern. Justify how your solution will ensure that the Company’s needs are met.

Local Replication Case Study 2

Business Profile: A Manufacturing Corporation maintains the storage of their mission critical applications on high end Storage Arrays on RAID 1 volumes. Current Situation/Issue: The Company’s key database application takes up 1 TB of storage and has to be up 24x7. The main concerns facing the corporation are: 1) Logical Corruption of the database (e.g. accidental deletion of table or table space) 2) Guaranteed restore operations with a minimum RPO of 1 hour and with an RTO of

less than ½ hour 3) On occasion may have to restore to a point in time that is up to 8 hours old Additional information: Company would like to minimize the amount of storage used by the solution that will address their concerns. On average 240 GB of data changes in a 24 hour period. Customer is not concerned about physical failure of the database devices – other solutions in place already address this issue. The company’s IT group is very interested in leveraging some of the Local Replication technologies that are available on their high end array. Proposal: Propose a local replication solution to address the company’s concern. Justify how your solution will ensure that the Company’s needs are met. How much physical storage will this replication actually need?


Business Continuity – Remote Replication - 1

© 2007 EMC Corporation. All rights reserved. Business Continuity – Remote Replication - 1

Remote ReplicationAfter completing this module, you will be able to:

Explain Remote Replication Concepts– Synchronous/Asynchronous– Connectivity Options

Discuss Host and Array based Remote Replication Technologies– Functionality– Differences– Considerations– Selecting the appropriate technology

This module introduces the challenges and solutions for remote replication and describes possible implementations.





Remote Replication ConceptsReplica is available at a remote facility– Could be a few miles away or half way around the world– Backup and Vaulting are not considered remote replication

Synchronous Replication– Replica is identical to source at all times – Zero RPO

Asynchronous Replication– Replica is behind the source by a finite margin – Small RPO

Connectivity– Network infrastructure over which data is transported from source

site to remote site

The Replication concepts/considerations that were discussed for Local Replication apply to Remote Replication as well. We explore the concepts that are unique to Remote replication.

Synchronous and Asynchronous replication concepts and considerations are explained in more detail in the next few slides.

Data has to be transferred from the source site to a remote site over some network. This can be done over IP networks, over the SAN, using DWDM (Dense Wave Division Multiplexing) or SONET (Synchronous Optical Network), etc. We will discuss the various options later in the module.

The Fundamental difference between local and remote replication is that remote replicas can be at a geographically different location. For example, applications at a data center in Boston could be replicated to a data center in London. Though remote replicas can be used for various Business Continuity operations, just like local replicas, the primary driver of remote replication is disaster recovery. Because data has to be replicated over a distance, a network infrastructure is a necessity for remote replication.




Synchronous ReplicationA write has to be secured on the remote replica and the source before it is acknowledged to the host

Ensures that the source and remote replica have identical data at all times– Write ordering is maintained at all times

Replica receives writes in exactly the same order as the source

Synchronous replication provides the lowest RPO and RTO– Goal is zero RPO– RTO is as small as the time it takes to

start application on the remote site

1

3

4

2Data Write

Data Acknowledgement

Server

Disk

Disk

Synchronous – Data is committed at both the source site and the remote site before the write is acknowledged to the host. Any write to the source must be transmitted to and acknowledged by the remote before signaling a write complete to the host. Additional writes cannot occur until each preceding write has been completed and acknowledged. It ensures that data at both sites are identical at all times.




Synchronous ReplicationResponse Time Extension– Application response time will be

extended due to synchronous replication

Data must be transmitted to remote site before write can be acknowledgedTime to transmit will depend on distance and bandwidth

Bandwidth– To minimize impact on response

time, sufficient bandwidth must be provided for at all times

Rarely deployed beyond 200 km

Average

Time

WritesMB/s

Max

Applications response times are extended with any kind of Synchronous replication. This is due to the fact that any write to source must be transmitted to and acknowledged by remote before signaling write complete to the host. The response time depends on the distance between sites, available bandwidth, and the network connectivity infrastructure.

The longer the distance, the more the response time. Speed of light is finite, every 200 Km (125 miles) adds 1ms to the response time.

Insufficient bandwidth also causes response time elongation. With Synchronous replication, there is sufficient bandwidth all the time. The picture on the slide shows the amount of data that has to be replicated as a function of time. To minimize the response time elongation, ensure that the Max bandwidth is provided by the network at all times. If we assume that only the average bandwidth is provided for, then there are times during the day (the shaded section) when response times may be unduly elongated, causing applications to time out.

The distances over which Synchronous replication can be deployed really depends on an applications ability to tolerate the extension in response time. It is rarely deployed for distances greater than 200 Km (125 miles).




Asynchronous ReplicationWrite is acknowledged to host as soon as it is received by the source

Data is buffered and sent to remote– Some vendors maintain write ordering– Other vendors do not maintain write

ordering, but ensure that the replica will always be a consistent re-startable image

Finite RPO– Replica will be behind the Source by

a finite amount– Typically configurable

1

4

2

3Data Write

Data Acknowledgement

Server

Disk

Disk

Asynchronous - Data is committed at the source site and the acknowledgement is sent to the host. The data is buffered and then forwarded to the remote site as the network capabilities permit. The data at the remote site is behind the source by a finite RPO; typically the RPO would be a configurable value.

The primary benefit of Asynchronous replication is that there is no response time elongation. Asynchronous replications are typically deployed over extended distances. The response time benefit is offset by the finite RPO.




Asynchronous ReplicationResponse Time unaffected

Bandwidth– Need sufficient bandwidth on average

Buffers– Need sufficient buffers

Can be deployed over long distancesAverage

Time

WritesMB/s

Max

Extended distances can be achieved with Asynchronous replication because there is no impact on the application response time. Data is buffered and then sent to the remote site. The available bandwidth should be at least equal to the average write workload. Data is buffered during times when the bandwidth is not enough, thus sufficient buffers should be designed into the solution.

Understanding the workload of the application and the bandwidth required for the replication is as important for Asynchronous replication as Synchronous. While it is true that Asynchronous replication requires less bandwidth than Synchronous, one still has to provide bandwidth which is equal to the average write workload. Data will be buffered when the bandwidth is not enough. This buffering of data causes the RPO to become larger. Insufficient bandwidth will lead to large RPO’s which may not be acceptable.




Remote Replication TechnologiesHost based– Logical Volume Manager (LVM)

Synchronous/Asynchronous– Log Shipping

Storage Array based– Synchronous– Asynchronous– Disk Buffered - Consistent PITs

Combination of Local and Remote Replication

In the context of our discussion, Remote Replication refers to replication that is done between data centers if it is host based, and between Storage arrays if it is array based. In the latter case, the two arrays may be adjacent to each other in the same data center, or might be geographically separated.

Host based implies that all the replication is done by using the CPU resources of the host, using software that is running on the host. Array based implies that all replication is done between Storage Arrays and is handled by the Array Operating Environment.




LVM Based Remote ReplicationDuplicate Volume Groups at local and remote sites

All writes to the source Volume Group are replicated to the remote Volume Group by the LVM– Synchronous or Asynchronous

NetworkVolume Group

Physical Volume 1

Physical Volume 2

Physical Volume 3

Physical Volume 1

Physical Volume 2

Physical Volume 3

Local Site Remote Site

Volume Group

Log Log

Some LVM vendors provide remote replication at the Volume Group level.

Duplicate Volume Groups need to exist at both the local and remote sites before replication starts. This can be achieved in a number of ways

Over IPTape backup/restore etc.

All writes to the source Volume Group are replicated to the remote Volume Group by the LVM. Typically the writes are queued in a log file and sent to the remote site in the order received over a standard IP network. It can be done synchronously or asynchronously.

Synchronous – Write must be received by remote before the write is acknowledged locally to the hostAsynchronous – Write is acknowledged immediately to the local host and queued and sent in order




LVM Based Remote ReplicationIn the event of a network failure– Writes are queued in the log file– When the issue is resolved the queued writes are sent over to the

remote– The maximum size of the log file determines the length of outage

that can be withstood

In the event of a failure at the source site, production operations can be transferred to the remote site

Production work can continue at the source site if there is a network failure. The writes that need to be replicated are queued in the log file and sent over to the remote site when the network issue is resolved. If the log files fill up before the network outage is resolved, a complete resynchronization of the remote site would have to be performed. Thus, the size of the log file determines the length of network outage that can be tolerated.

In the event of a failure at the source site (e.g. server crash, site wide disaster), production operations can be resumed at the remote site with the remote replica. The exact steps that need to be performed to achieve this depends on the LVM that is in use.




LVM Based Remote ReplicationAdvantages– Different storage arrays and RAID protection can be used at the

source and remote sites– Standard IP network can be used for replication– Response time issue can be eliminated with asynchronous mode,

with extended RPO

Disadvantages– Extended network outages require large log files– CPU overhead on host

For maintaining and shipping log files

A significant advantage of using LVM based remote replication is the fact that storage arrays from different vendors can be used at the two sites. For example, at the production site, a high-end array could be used while at the remote site, a second tier array could be used. In a similar manner, the RAID protection at the two sites could be different as well.

Most of the LVM based remote replication technologies allow the use of standard IP networks that are already in place, eliminating the need for a dedicated network. Asynchronous mode supported by many LVMs eliminates the response time issue of synchronous mode while extending the RPO.

Log files need to be configured appropriately to support extended network outages. Host based replication technologies use host CPU cycles.




Host Based Log Shipping

Offered by most DB Vendors

Advantages– Minimal CPU overhead– Low bandwidth– Standby Database consistent

to last applied log

Original

Logs

Stand By

Logs

IP Network

Log Shipping is a host based replication technology for databases offered by most DB VendorsInitial State - All the relevant storage components that make up the database are replicated to a standby server (done over IP or other means) while the database is shutdownDatabase is started on the production server, as and when log switches occur the log file that was closed is sent over IP to the standby serverDatabase is started in standby mode on the standby server; when log files arrive, they are applied to the standby databaseStandby database is consistent up to the last log file that was applied

AdvantagesMinimal CPU overhead on production serverLow bandwidth (IP) requirementStandby Database consistent to last applied log−RPO can be reduced by controlling log switching

DisadvantagesNeed host based mechanism on production server to periodically ship logsNeed host based mechanism on standby server to periodically apply logs and check for consistencyIP network outage could lead to standby database falling further behind




Array Based – Remote ReplicationReplication performed by the array operating environment– Host CPU resources can be devoted to production operations

instead of replication operations– Arrays communicate with each other via dedicated channels

ESCON, Fibre Channel or Gigabit Ethernet

Replicas are on different arrays– Primarily used for DR purposes– Can also be used for other BC operations

Production Array Remote Array

DistanceSource ReplicaNetwork

DR ServerProduction Server

Replication ProcessA Write is initiated by an application/serverReceived by the source arraySource array transmits the write to the remote array via dedicated channels (ESCON, Fibre Channel or Gigabit Ethernet) over a dedicated or shared network infrastructureWrite received by the remote array

Only Writes are forwarded to the remote arrayReads are from the source devices




Array Based – Synchronous Replication

Network links

Write is received by the source array from host/server

Write is transmitted by source array to the remote array

Remote array sends acknowledgement to the source array

Source array signals write complete to host/server

Source Target

Synchronous Replication ensures that the replica and source have identical data at all times. The source array issues the write complete to the host/server only when the write has been received both at the remote array and the source array. When the write complete is sent, the replica and source are identical.

The sequence of operations is:Write is received by the source array from host/serverWrite is transmitted by source array to the remote arrayRemote array sends acknowledgement to the source arraySource array signals write complete to host/server




Array Based – Asynchronous Replication

No impact on response timeExtended distances between arraysLower bandwidth as compared to Synchronous

Network links

Write is received by the source array from host/server

Write is transmitted by source array to the remote array

Source array signals write complete to host/server

Remote array sends acknowledgement to the source array

Source Target

Applications do not suffer any response time elongation with Asynchronous replication because any write is acknowledged to the host as soon as the write is received by the source array. Asynchronous replication can be used for extended distances. Bandwidth requirements for Asynchronous will be lower than Synchronous for the same workload. Vendors ensure data consistency in different ways.

The sequence of operations is shown here:

A Write is received by the source array from the host;

The Source array signals write complete to the host;

The Write is transmitted by source array to the remote array; and then

The Remote array sends acknowledgement to the source array.




Array Based – Asynchronous ReplicationEnsuring Consistency– Maintain write ordering

Some vendors attach a time stamp and sequence number with each of the writes, then ship the writes to the remote array and apply the writes to the remote devices in the exact order based on the time stamp and sequence numbersRemote array applies the writes in the exact order they were received, just like synchronous

– Dependent write consistencySome vendors buffer the writes in the cache of the source array for a period of time (between 5 and 30 seconds)At the end of this time the current buffer is closed in a consistent manner and the buffer is switched, new writes are received in the new bufferThe closed buffer is then transmitted to the remote arrayRemote replica will contain a consistent, re-startable image on the application

The data on the remote replicas will be behind the source by a finite amount in Asynchronous replication, thus steps must be taken to ensure consistency. Some vendors achieve consistency by maintaining write ordering, wherein the remote array applies writes to the replica devices in the exact order that they were received at the source. Other vendors leverage the dependent write I/O logic that is built into most databases and applications.

Cache buffered Asynchronous replication technologies buffer writes in cache for a period of time, and then close the buffer in a consistent manner and receive new writes in a new buffer. When the buffer is open, if a particular location is written to more that once (locality of reference), only the final write is sent to the remote array. Thus, if a particular location is written to 10 times, only the last I/O is sent to the remote array when the buffer is closed. This method is different from the asynchronous technique which maintains write ordering. With write ordering 10 I/Os will be sent to the remote array as compared to the 1 I/O in the cache buffered method. Data consistency is maintained with both techniques, but the cache buffered technique would require less bandwidth if the workload has a high locality of reference (same data location written to multiple times).




Array Based – Disk Buffered Consistent PITsLocal and Remote replication technologies can be combined to create consistent PIT copies of data on remote arrays

RPO usually in the order of hours

Lower Bandwidth requirements

Extended distance solution

Disk buffered consistent PITs is a combination of Local and Remote replications technologies. The idea is to make a Local PIT replica and then create a Remote replica of the Local PIT. The advantage of disk buffered PITs is lower bandwidth requirements and the ability to replicate over extended distances. Disk buffered replication is typically used when the RPO requirements are of the order of hours or so, thus a lower bandwidth network can be used to transfer data from the Local PIT copy to the remote site. The data transfer may take a while, but the solution would be designed to meet the RPO.

Let’s take a look at a two disk buffered PIT solutions.




Extended Distance Consistent PIT

Create a Consistent PIT Local Replica on Source ArrayCreate a Remote Replica of this Local ReplicaOptionally create another replica of the Remote replica on the remote array if neededRepeat…as automation, link bandwidth, change rate permit

SOURCE REMOTE

Network Links

Remote Replica

Local Replica

Local Replica

Source

Disk buffered replication allows for the incremental resynchronization between a Local Replica which acts as a source for a Remote Replica.

Benefits include:Reduction in communication link cost and improved resynchronization time for long-distance replication implementationsThe ability to use the various replicas to provide disaster recovery testing, point-in-time backups, decision support operations, third-party software testing, and application upgrade testing or the testing of new applications.




Synchronous + Extended Distance Consistent PIT

Synchronous replication between the Source and Bunker SiteCreate consistent PIT Local Replica at bunkerCreate Remote Replica of bunker Local Replica Optionally create additional Local Replica at Target site from the Remote Replica if neededRepeat…as automation, link bandwidth, change rate permit

SOURCE REMOTEBUNKER

SyncSource

Remote Replica

Local Replica

Local Replica

Remote Replica

Network Links

Network Links

Synchronous + Extended Distance Buffered Replication benefits include:Bunker site provides a zero RPO DR ReplicaThe ability to resynchronize only changed data between the intermediate Bunker site and the final target site, reducing required network bandwidthReduction in communication link cost and improved resynchronization time for long-distance replication implementationsThe ability to use the replicas to provide disaster recovery testing, point-in-time backups, decision support operations, third-party software testing, and application upgrade testing or the testing of new applications.

The Bunker to Remote replication is identical to the solution discussed in the previous slide. The key benefit of this solution is the zero RPO DR replica at the Bunker site provided by the Synchronous replication between the Source and Bunker arrays.




Remote Replicas – Tracking ChangesRemote replicas can be used for BC Operations– Typically remote replication operations will be suspended when the

remote replicas are used for BC Operations

During BC Operations changes will/could happen to both the source and remote replicas– Most remote replication technologies have the ability to track

changes made to the source and remote replicas to allow for incremental re-synchronization

– Resuming remote replication operations will require re-synchronization between the source and replica

Tracking changes to facilitate incremental re-synchronization between the source devices and remote replicas is done via the use of bitmaps in a manner very similar to that discussed in the Local Replication lecture. Two bitmaps, one for the source and one for the replica, would be created. Some vendors may keep the information of both bitmaps at both the source and remote sites, while others may simply keep the source bitmap at the source site and the remote bitmap at the remote site. When a re-synchronization (source to replica or replica to source) is required, the source and replica bitmaps are compared and only data that was changed is synchronized.




Primary Site Failure – Operations at Remote SiteRemote replicas are typically not available for use while the replication session is in progress

In the event of a primary site failure the replicas have to be made accessible for use

Create a local replica of the remote devices at the remote site

Start operations at the Remote site– No remote protection while primary site issues are resolved

After issue resolution at Primary Site– Stop activities at remote site– Restore latest data from remote devices to source– Resume operations at Primary (Source) Site

While remote replication is in progress the remote devices will typically not be available for use. This is to ensure that the no changes are made to the remote replicas. The purpose of the remote replica is to provide a good starting point for any recovery operation.

Prior to any recovery efforts with the remote replicas, it is always a good idea to create a local replica of the remote devices. The local replica can be used as a fall back if the recovery process somehow corrupts the remote replicas.

Restarting operations at the remote site and subsequently restoring operation back to the primary site requires a tremendous amount of upfront planning. The simple statement, “Start operations at the Remote site,” would have to be planned well ahead of time to account for various failure scenarios.




Array Based – Which TechnologySynchronous– Is a must if zero RPO is required– Need sufficient bandwidth at all times– Application response time elongation will prevent extended distance

solutions (rarely above 125 miles)

Asynchronous– Extended distance solutions with minimal RPO (order of minutes)– No Response time elongation– Generally requires lower Bandwidth than synchronous– Must design with adequate cache/buffer or sidefile/logfile capacity

Disk Buffered Consistent PITs– Extended distance solution with RPO in the order of hours– Generally lower bandwidth than synchronous or asynchronous

The choice of the appropriate array based remote replication depends on specific needs.

What are the RPO requirements? What is the distance between sites? What is the primary reason for remote replication? etc.




Storage Array Based – Remote ReplicationNetwork Options– Most vendors support ESCON or Fibre Channel adapters for remote

replicationCan connect to any optical or IP networks with appropriate protocol converters for extended distances

DWDMSONETIP Networks

– Some Vendors have native Gigabit Ethernet adapters which allows the array to be connected directly to IP Networks without the need for protocol converters

A dedicated or a shared network must be in place for remote replication. Storage arrays have dedicated ESCON, Fibre Channel or Gigabit Ehternet adapters, which are used for remote replication. The network between the two arrays could be ESCON or Fibre Channel for the entire distance. Such networks would be typically used for shorter distance. For extended distances, an optical or IP network must be used. Examples of optical networks are DWDM and SONET (discussed later). Protocol converters may have to be used to connect the ESCON or Fibre Channel adapters from the arrays to these networks. Gigabit Ethernet adapters can be connected directly to the IP network.

A network is required for remote replication. Because this topic is complex, the next three slides are meant to give you an overview of the network options that are available.




Dense Wavelength Division Multiplexing (DWDM)DWDM is a technology that puts data from different sources together on an optical fiber with each signal carried on its own separate light wavelength (commonly referred to as a lambda or λ).

Up to 32 protected and 64 unprotected separate wavelengths of data can be multiplexed into a light stream transmitted on a single optical fiber.

ESCON

Fibre Channel

Gigabit Ethernet

Optical Channels

Optical Electrical Optical Lambda λ

Dense Wavelength Division Multiplexing (DWDM) multiplexes wavelengths (often referred to as lambdas or represented by the symbol λ) onto a single pair (transmit and receive paths) of optical fibers.

A key benefit of DWDM is protocol transparency. Since DWDM is an optical transmission technique, the same interface type can be used to transport any bit rate or protocol. It also allows different bit rates and protocol data streams to be mixed on the same optical fiber. DWDM alleviates the need for protocol conversion, associated complexity, and the resulting transmission latencies.




Synchronous Optical Network (SONET)SONET is Time Division Multiplexing (TDM) technology where traffic from multiple subscribers is multiplexed together and sent out onto the SONET ring as an optical signal

Synchronous Digital Hierarchy (SDH) similar to SONET but is the European standard

SONET/SDH, offers the ability to service multiple locations, its reliability/availability, automatic protection switching, and restoration

SONET

OC3 OC48

OC48

SDH

STM-1 STM-16

STM-16

Synchronous Optical Networks (SONET) is a standard for optical telecommunications transport formulated by the Exchange Carriers Standards Association (ECSA) for the American National Standards Institute (ANSI). The equivalent international standard is referred to as Synchronous Digital Hierarchy and is defined by the European Telecommunications Standards Institute (ETSI). Within Metropolitan Area Networks (MANs) today, SONET/SDH rings are used to carry both voice and data traffic over fiber.




Rated Bandwidth

2488.0OC48/STM16622.08OC12/STM4155.5OC3/STM151.8OC134E32E145T31.5T11024Gigabit Ethernet1024 or 2048Fibre Channel200EsconBandwidth Mb/sLink

The slide lists the rated bandwidth in Mb/s for standard WAN (T1, T3, E1, E3), SONET (OC1, OC3, OC12, OC48) and SDH (STM1, STM4, STM16) Links. The rated bandwidth of ESCON, Fibre Channel, and Gigabit Ethernet is also listed.





Remote Replication Concepts– Synchronous/Asynchronous– Connectivity Options

Host and Array based Remote Replication Technologies– Functionality– Differences– Considerations– Selecting the appropriate technology





Check Your KnowledgeWhat is a Remote Replica?

What are the possible uses of Remote Replicas?

What is the difference between Synchronous and Asynchronous Replication?

Discuss one host based remote replication technology?

Discuss one array based remote replication technology?

What are differences in the bandwidth requirements between the array remote replication technologies discussed in this module?






EMC’s Remote Replication Solutions for the

Symmetrix and CLARiiON Arrays

EMC’s SRDF/Synchronous Replication Solution

EMC’s MirrorView/A Replication Solution

At this point, let’s apply what we’ve learned to some real world examples. Upon completion of thistopic you will be able to:

Enumerate EMC’s Remote Replication Solutions for the Symmetrix and CLARiiON arrays;

Describe EMC’s SRDF/Synchronous Replication Solution; and

Describe EMC’s MirrorView/A Replication Solution




EMC – Remote Replication SolutionsEMC Symmetrix Arrays– EMC SRDF/Synchronous– EMC SRDF/Asynchronous– EMC SRDF/Automated Replication

EMC CLARiiON Arrays– EMC MirrorView/Synchronous– EMC MirrorView/Asynchronous

All remote replication solutions that were discussed in this module are available on EMC Symmetrix and CLARiiON Arrays.

The SRDF (Symmetrix Remote Data Facility) family of products provides Synchronous, Asynchronous and Disk Buffered remote replication solutions on the EMC Symmetrix Arrays.

The MirrorView family of products provides Synchronous and Asynchronous remote replication solutions on the EMC CLARiiON Arrays.

SRDF/Synchronous (SRDF/S): High-performance, host-independent, real-time synchronous remote replication from one Symmetrix to one or more Symmetrix systems.

MirrorView/Synchronous (MirrorView/S): Host-independent, real-time synchronous remote replication from one CLARiiON to one or more CLARiiON systems.

SRDF/Asynchronous (SRDF/A): High-performance extended distance asynchronous replication for Symmetrix arrays using a Delta Set architecture for reduced bandwidth requirements and no host performance impact. Ideal for Recovery Point Objectives of the order of minutes.

MirrorView/Asynchronous (MirrorView/A): Asynchronous remote replication on CLARiiON arrays. Designed with low-bandwidth requirements, delivers a cost-effective remote replication solution ideal for Recovery Point Objectives (RPOs) of 30 minutes or greater.

SRDF/Automated Replication: Rapid business restart over any distance with no data exposure through advanced single-hop and multi-hop configurations using combinations of TimeFinder/Mirror and SRDF on Symmetrix Arrays.




EMC SRDF/Synchronous - IntroductionArray based Synchronous Remote Replication technology for EMC Symmetrix Storage Arrays– Facility for maintaining real-time physically separate mirrors of

selected volumes

SRDF/Synchronous uses special Symmetrix devices– Source arrays have SRDF R1 devices– Target arrays have SRDF R2 devices– Data written to R1 devices are replicated to R2 devices

SRDF uses dedicated channels to send data from source to target array– ESCON, Fibre Channel or Gigabit Ethernet are supported

SRDF is available in both Open Systems and Mainframe environments

EMC SRDF/Synchronous is an Array based Synchronous Remote Replication technology for EMC Symmetrix Storage Arrays. SRDF R1 and R2 volumes are devices dedicated for Remote replication. R2 volumes are on the Target arrays, while R1 volumes are on the Source arrays. Data written to R1 volumes is replicated to R2 volumes.




SRDF Source and Target VolumesSRDF R1 and R2 Volumes can have any local RAID Protection– e.g. Volumes could have RAID-1 or RAID-5 protection

SRDF R2 volumes are in a Read Only state when remote replication is in effect– Changes cannot be made to the R2 volumes

SRDF R2 volumes are accessed under certain circumstances– Failover – Invoked when the primary volumes become unavailable– Split – Invoked when the R2 volumes need to be concurrently

accessed for BC operations

SRDF R1 and R2 volumes can have any local RAID protection. SRDF R2 volumes are in a Read Only state when remote replication is in effect. SRDF R2 volumes are accessed under certain circumstances.




Global Cache Director

DiskDirector (DD)

ChannelDirector (CD)


DiskDirector (DD)

Remote LinkDirector (RLD)


Symmetrix Containing Target (R2) Volumes

Target Host

3

2

Global Cache Director

DiskDirector (DD)



DiskDirector (DD)



Symmetrix Containing Source (R1) Volumes

Source Host

1 4

1. Write received by Symmetrix containing Source volume

Application does not receive I/O acknowledgement until data is received and acknowledged by remote Symmetrix

Write completion time is extended - No impact on Reads

Most often used in campus solutions

4. Write complete sent to host3. Target Symmetrix sends acknowledgement to Source2. Source Symmetrix sends write data to Target

SRDF/Synchronous

SRDF/Synchronous is used primarily in SRDF campus environments. In this mode of operation, Symmetrix maintains a real-time mirror image of the data between the SRDF pairs.

Data on the Source (R1) volumes and the Target (R2) volumes is always identical .

The sequence of operations is:

1. An I/O write is received from the host/server into the cache of the Source

2. The I/O is transmitted to the cache of the Target

3. A receipt acknowledgment is provided by the Target back to the cache of the Source

4. An ending status is presented to the host/server

The transmission of data to the target and the receipt of acknowledgement from the target is done via specialized hardware on the array (depicted as Remote Link Director – RLD in the picture).

De-stage of data to disk in Source and Target Symmetrix is done on a “off-priority” basis.

If a link failure occurs before acknowledgement is received from the Target Symmetrix, then the operation is re-tried down the remaining links in the RA-group. If all links fail, then I/O is acknowledged to the host and the track is flagged as invalid to the remote mirror.




SRDF Operations - FailoverPurpose – Make Target Volumes Read Write

Source Volume status is changed to Read Only

SRDF Link is suspended

After

RWRO

SourceVolume

TargetVolume

RORW

SourceVolume

TargetVolume

Before

Failover operations are performed if the SRDF R1 Volumes become unavailable and the decision is made to start operations on the R2 Devices. Failover could also be performed when DR processes are being tested or for any maintenance tasks that have to be performed at the source site.

If failing over for a Maintenance operation: For a clean, consistent, coherent point in time copy which can be used with minimal recovery on the target side, some or all of the following steps may have to be taken on the source side:− Stop All Applications (DB or whatever else is running)− Unmount file system.− Deactivate the Volume Group− A failover leads to a RO state on the source side. If a device suddenly becomes RO from a RW

state, the reaction of the host can be unpredictable if the device is in use; therefore, the suggestion to stop applications, un-mount and deactivate Volume Groups.




SRDF Operations - FailbackMakes target volume Read Only, resumes link, synchronize R2 to R1, and write enables source volume

RWRO

SourceVolume

TargetVolume

Before

After

RORW

SourceVolume

TargetVolume

sync

The main purpose of the Failback operation is to allow the resumption of operations at the primary site on the source devices. Failback is typically invoked after a failover has been performed and production tasks are being performed on the Target site on the R2 devices. Once operations can be resumed at the Primary site, the Failback operation can be invoked. Ensure that applications are properly quiesced and volume groups deactivated before failback is invoked.

When failback is invoked, the Target Volumes become Read Only, the source volumes become Read Write, and any changes that were made at the Target site while in the failed over state are propagated back to the source site.




SRDF Operations - SplitEnables read and write operations on both source and target volumes

Suspends replication

RORW

SourceVolume

TargetVolume

Before

After

RWRW

SourceVolume

TargetVolume

The SRDF Split operation is used to allow concurrent access to both the Source and Target volumes. Target volumes are made Read Write and the SRDF replication between the Source and Target is suspended.




SRDF Operations – Establish/RestoreEstablish - Resume SRDF operation retaining data from source and overwriting any changed data on target

Restore - SRDF operation retaining data on target and overwriting any changed data on source

RORW

SourceVolume

TargetVolume

Establish

RORW

SourceVolume

TargetVolume

Restore

During current operations while in a SRDF Split state, changes could occur on both the Source and Target volumes. Normal SRDF replication can be resumed by performing an establish or a restore operations.

With either establish or restore, the status of the Target volume goes to Read Only. Prior to establish or restore, all access to the target volumes must be stopped.

The Establish operation is used when changes to the Target volume should be discarded while preserving changes that were made to the Source volumes.

The Restore operation is used when changes to the Source volume should be discarded while preserving changes that were made to the Target volumes. Prior to a restore operation, all access to the source and target volumes must be stopped. The Target volumes go to Read Only state, while the data on the Source volumes are overwritten with the data on the Target volumes.




EMC CLARiiON MirrorView/A OverviewOptional storage system software for remote replication on EMC CLARiiON arrays– No host cycles used for data replication

Provides a remote image for disaster recovery– Remote image updated periodically - asynchronously– Remote image cannot be accessed by hosts while replication is

active– Snapshot of mirrored data can be host-accessible at remote site

Mirror topology (connecting primary array to secondary arrays)– Direct connect and switched FC topology supported– WAN connectivity supported using specialized hardware

MirrorView/A is optional software supported on CX-series EMC CLARiiON arrays.

The design goal of MirrorView/A is to allow speedy recovery from a disaster, but at lower cost than synchronous solutions. It allows long distance connectivity in environments where some data loss is acceptable. It accomplishes this goal by using an asynchronous interval-based update mechanism. This means that changed data is accumulated at the local side of the link, then sent to the remote side at regular, user-defined intervals. The data on the remote image is always older than the data on the local image, by up to 2 interval times. Though this leads to data loss in the event of a disaster, it is an acceptable trade-off for many customers.

Supported connection topologies include direct connect, SAN connect, and WAN connect when appropriate Fibre Channel to IP conversion devices are used.




MirrorView/A TermsPrimary storage system– Holds the local image for a given mirror

Secondary storage system– Holds the local image for a given mirror

Bidirectional mirroring– A storage system can hold local and remote images

Mirror Synchronization– Process that copies data from local image to remote image

MirrorView Fractured state– Condition when a Secondary storage system is unreachable by the

Primary storage system

The terms ‘primary storage system’ and ‘secondary storage system’ are terms relative to each mirror. Because MirrorView/A supports bidirectional mirroring, a storage system which hosts local images for one or more mirrors may also host remote images for one or more other mirrors.

The process of updating a remote image with data from the local image is called synchronization. When mirrors are operating normally, they are either in the synchronized state or synchronizing. If a failure occurs, and the remote image cannot be updated, perhaps because the link between the CLARiiONs has failed, then the mirror is in a fractured state. Once the error condition is corrected, synchronization restarts automatically.




MirrorView/A ConfigurationMirrorView/A Setup – MirrorView/A software must be loaded on both Primary and

Secondary storage system– Remote LUN must be exactly the same size as local LUN– Secondary LUN does not need to be the same RAID type as Primary– Reserved LUN Pool space must be configured

Management via Navisphere Manager and CLI

MirrorView/A software must be loaded on both CLARiiONs, regardless of whether or not the customer wants to implement bi-directional mirroring.

The remote LUN must be the same size as the local LUN, though not necessarily the same RAID type. This allows flexibility in DR environments, where the backup site need not match the performance of the primary site.

Because MirrorView/A uses SnapView Snapshots as part of its internal operation, space must be configured in the Reserved LUN Pool for data chunks copied as part of a COFW operation. SnapView Snapshots, the Reserved LUN Pool, and COFW activity were discussed in an earlier module.

MirrorView/A, like other CLARiiON software, is managed by using either Navisphere Manager if a graphical interface is desired, or Navisphere CLI for command-line management.

Hosts can not attach to a remote LUN while it is configured as a secondary (remote) mirror image. If you promote the remote image to be the primary mirror image (in other words, exchange roles of the local and remote images), as is done in a disaster recovery scenario, or if you remove the secondary LUN from the mirror, and thereby turn it into an ordinary CLARiiON LUN, then it may be accessed by a host.




DC

MirrorView/A – Initial Synchronization

B E

Host

Primary Image

Secondary Image

Snapshot

RLP

A F

Tracking DeltaMap

Transfer DeltaMap 1 1 1 1 1 1

00 00 0 0

MAP MAP

MirrorView/A makes use of bitmaps, called DeltaMaps because they track changes, to log where data has changed, and needs to be copied to the remote image. As with SnapView Snapshots, the MirrorView image is seen as consisting of 64 kB areas of data, called chunks or extents.

This animated sequence shows the initial synchronization of a MirrorView/A mirror. The Transfer DeltaMap has all its bits set, to indicate that all extents need to be copied across to the secondary. At the time the synchronization starts, a SnapView Session is started on the primary, and it will track all changes in a similar manner to that used by Incremental SAN Copy. At the end of the initial synchronization, the secondary image is a copy of what the primary looked like when the synchronization started. Any changes made to the primary since then are flagged by the Tracking DeltaMap.




A


Host

Primary Image

Secondary Image

Snapshot

RLP

Tracking DeltaMap

Transfer DeltaMap 0 1 1 1 1 1

00 00 0 0

MAP MAP

DCB EA F




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap

Transfer DeltaMap 1 1 1 1

00 00 0 0

MAP

0 0

EB

MAP




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 0

MAPMAP

0 0

1

EB’




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 0

MAPMAP

0 0

1

EB’ C




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 1

MAPMAP

0 0

1

E’B’ C

E




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 1

MAPMAP

0 0

1

E’B’ C

E

D




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 1

MAPMAP

0 0

1

E’B’ C

E

D




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 1

MAPMAP

0 0

1

E’B’ C

E

D E




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B

Tracking DeltaMap


00 00 1

MAPMAP

0 0

1

E’B’ C D E F




ADC

MirrorView/A – Update

Host

Primary Image

Secondary Image

Snapshot

RLP

A F B C D E F

Tracking DeltaMap

Transfer DeltaMap

1 00 00 1

0 0 0 0 0

MAP MAP

0

E’B’

An update cycle starts, either automatically at the prescribed time, or initiated by the user. Prior to the start of data movement to the secondary, MirrorView/A starts a SnapView Session on the secondary, to protect the original data if anything goes wrong during the update cycle.

After the update cycle completes successfully, the SnapView Session and Snapshot on the secondary side are no longer needed, and are destroyed.




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A F B C D E F

1 00 00 1

0 0 0 0 0

MAP MAP

0

E’B’

Transfer DeltaMap

Tracking DeltaMap




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A’ F B C D E F

1 00 00 1

1 0 0 0 0

MAP MAP

0

E’B’

Transfer DeltaMap

Tracking DeltaMap




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A’ F B’ C D E F

1 00 00 0

1 0 0 0 0

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A’ F B’ C D E F

1 00 00 0

1 0 0 0 0

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A’ F B’ C D E’ F

0 00 00 0

1 0 0 0 0

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’ E




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP

A’ F B’ C D E’ F

0 00 00 0

1 0 0 0 0

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

E’




ADC

MirrorView/A –Promotion (Update Failure)

Host

Primary Image

Secondary Image

Snapshot

RLP

A’ F’ B’ C D E F

1 00 00 0

1 0 0 0 1

MAP MAP

0

E’B’

Transfer DeltaMap

Tracking DeltaMap

B

Should the update cycle fail for any reason (here a primary storage system failure) and it becomes necessary to promote the secondary, then the safety Session is rolled back and the secondary image is returned to the state it was in prior to the start of the update cycle.




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP


1 00 00 0

1 0 0 0 1

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’




ADC


Host

Primary Image

Secondary Image

Snapshot

RLP


1 00 00 0

1 0 0 0 1

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’




ADC


Host

Secondary Image

Snapshot

RLP


1 00 00 0

1 0 0 0 1

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’

Promote Secondary




ADC


Host

Secondary Image

Snapshot

RLP

A’ F’ B C D E F

1 00 00 0

1 0 0 0 1

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

BE’




ADC


Host

PrimaryImage

Snapshot

RLP

A’ F’ B C D E F

1 00 00 0

1 0 0 0 1

MAP MAP

1

E’’B’

Transfer DeltaMap

Tracking DeltaMap

E’




Consistency GroupsGroup of secondary images treated as a unit

Local LUNs must all be on the same CLARiiON

Remote LUNs must all be on the same CLARiiON

Operations happen on all LUNs at the same time– Ensures a restartable image group

Consistency Groups allow all LUNs belonging to a given application, usually a database, to be treated as a single entity and managed as a whole. This helps to ensure that the remote images are consistent, i.e. all made at the same point in time. As a result, the remote images are always restartable copies of the local images, though they may contain data which is not as new as that on the primary images.

It is a requirement that all the local images of a Consistency Group be on the same CLARiiON, and that all the remote images for a Consistency Group be on the same remote CLARiiON. All information related to the Consistency Group is sent to the remote CLARiiON from the local CLARiiON.

The operations which can be performed on a Consistency Group match those which may be performed on a single mirror, and affect all mirrors in the Consistency Group. If for some reason an operation cannot be performed on one or more mirrors in the Consistency Group, then that operation fails and the images remain unchanged.




Apply Your Knowledge SummaryKey points covered in this topic:

EMC’s Remote Replication Solutions for the Symmetrixand CLARiiON Arrays

EMC’s SRDF/Synchronous Replication Solution

EMC’s MirrorView/A Replication Solution

In this topic, we enumerated EMC’s Remote Replication solutions for the Symmetrix and CLARiiONarrays; described EMC’s SRDF/Synchronous Replication and MirrorView/A Replication solutions.





Section SummaryKey Points covered in this section:

Overview of Business Continuity

The solutions and the supporting technologies that enable business continuity and uninterrupted data availability– Backup and Recovery– Local Replication– Remote Replication

Basic Disaster Recovery techniques

These are the key points covered in this section. Please take a moment to review them.

If you have not already done so, please review the Case Studies prior to taking the assessment.

This concludes the training. Please proceed to the Course Completion slide to take the Assessment.

Remote Replication Case Study

Business Profile: A Manufacturing Corporation maintains the storage of their mission critical applications on high-end Storage Arrays on RAID 1 volumes. The corporation has two data centers which are 50 miles apart. Current Situation/Issue: The corporation’s mission critical Database application takes up 1 TB of storage on a high end Storage Array. In the past year, top management has become extremely concerned because they do not have DR plans which will allow for zero RPO recovery if there is a site failure. The primary DR Site is the 2nd Data Center 50 miles away. The company would like explore remote replication scenarios which will allow for near zero RPO and a minimal RTO. The company is aware of the large costs associated with network bandwidth and would like explore other remote replication technologies in addition to the zero RPO solution. Proposal: Propose a remote replication solution to address the company’s concern. Justify how your solution will ensure that the Company’s needs are met.

Business Continuity

Documents

trademarks of emc corporation

emc software

emc effect

business continuity

emc storage logix

business continuitywelcome

business continuitydescribe

business continuitysection