Top Banner
1 Disaster Recovery Broad Team – UCSD, UCOP, and others! (special credit to Kris Hafner & Elazar Harel) Presenter - Paul Weiss – Executive Director UCOP/IR&C [email protected] March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org
22

Disaster Recovery

Jan 11, 2016

Download

Documents

yehudi

Disaster Recovery. Broad Team – UCSD, UCOP, and others! (special credit to Kris Hafner & Elazar Harel) Presenter - Paul Weiss – Executive Director UCOP/IR&C [email protected]. March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org. Agenda. Business view and background as to how and why - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Disaster Recovery

1

Disaster Recovery

Broad Team – UCSD, UCOP, and others!

(special credit to Kris Hafner & Elazar Harel)

Presenter - Paul Weiss – Executive Director UCOP/IR&C

[email protected]

March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org

Page 2: Disaster Recovery

2

Agenda

• Business view and background as to how and why

• The services portfolio

• Technical details

• Network implications

• Lessons learned, going forward

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 3: Disaster Recovery

3

Situation as of 2Q2006• UCSD had almost no DR plan in place• UCOP used IBM contract in Colorado

– Cost $200k / yr + $600k/month if ever used– Had insufficient gear and network reserved,

cautiously estimate would be > 50% more cost if updated appropriately

– 40 hrs of testing / year limit, difficult to schedule– RPO (Recovery Point Objective) <= 7 days– RTO (Recovery Time Objective) <= 3 days– Required UCOP personnel to activate and operate– Past testing indicated decent mainframe recovery

plan in place, limited distributed system capability

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 4: Disaster Recovery

4

DR Concept

• UCOP required shorter RPO & RTO

• Found trusted partner (UCSD)– Willingness to be “married”

• Technical choices• Change management – ongoing• One “team”• Common principles

• Use the WAN “stupid”

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 5: Disaster Recovery

5

Keys to Approach– Buy enough storage, synchronize data in real or

near real time, avoid loading data during an actual DR event

– Mainframe – CBU option and buy memory– Other servers – buy sufficient gear to have

capacity available to run at either location without having to repurpose servers during event

– Must be able to test and retest – DR is not STATIC!

The decision to do it!

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 6: Disaster Recovery

6

Advantages of this Approach• Costs for UCOP are comparable to old DR plan• Costs for UCSD are <50% of a vendor solution• Capability is dramatically improved

– RTO and RPO < 1 day (and will be far less)– Can test as often as needed (we need it!)– Equipment is there and operational– More services can be “easily” added (and

have!) after the initial investment and can optimize over time

– UC personnel “on other side” will assist in case of disaster, long term goal is to recover without any personnel from down location immediately available

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 7: Disaster Recovery

7

Initial Critical Success Factors• UCOP assigned .5 FTE staff dedicated to

drive effort• One Team – UCOP and UCSD• Agree to basic principles, including $$$• Fight scope creep• Engage procurement personnel• Communicate, communicate, communicate• Test, Test, Test• The WAN!

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 8: Disaster Recovery

8

Current UCOP to UCSD DR Portfolio

– All Mainframe services (including 9 (and soon to be 10) PPS instances & UCRS)

– AYSO and all Benefits services– Endowment and Investment Accounting System– Active Directory– VPN– Email & File sharing– Web Servers– Banking/Treasury Systems– Loan Programs– Risk Services

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 9: Disaster Recovery

9

The Picture - Part I

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

UCOP UCSD

Page 10: Disaster Recovery

10

Current UCSD to UCOP DR Portfolio

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

– All Mainframe services (including HR, financial and student transactional backend systems)

– All Web Based systems for HR/PPS, Financial, Student, Telecommunications billing, etc.

– Google search appliances– Multi terabyte data warehouse– Multi terabyte production data for all mainframe and

open systems – Dev and QA testing data and LPAR’s for mainframe

applications – Stand Alone systems for Intl. Student tracking, Audit,

Coeus, and DARS systems

Page 11: Disaster Recovery

11

Future UCSD to UCOP DR Portfolio

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

– Portal/CMS backup for campus, business and student portals

– Single Sign-on, roles, affiliates authentication/authorization failover

– VPN– Active Directory– Domain controllers– Core MTA (Ironport for now)– Blackberry– Mailing lists– Mailbox machines

Page 12: Disaster Recovery

12

The Picture - Part II

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

UCOP UCSD

Page 13: Disaster Recovery

13

Then it got interesting

As positive word got out, more locations and functional areas realized that DR was

achievable

So…

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 14: Disaster Recovery

14

Other DR services in place or committed too

– UC Effort Reporting System (3Q2009)– UCOP Office of Technology Transfer Informix DB– UCOP IDP Shibboleth Server– UC Replacement Security Number (RSN)– UCOP TSM Server– UC Pathways (3Q2009)

– UCSD Med Mainframe, PPRC– UCSB Distributed DNS Server– UCLA Continuing Education of the Bar– UCSD External Relations– UCDC File Server– Irvine Secondary DNS and Web Server– SD Coastal Data Information Program

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 15: Disaster Recovery

15

And a Special Case!

– UCSB mainframe load– Four Steps:

• DR from UCSB to UCOP utilizing PPRC

• Do failover test to UCOP, if fully successful, keep production at UCOP

• DR from UCOP to UCSD - trivial

• Turn off UCSB mainframe

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 16: Disaster Recovery

16

The Picture - Part III

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

UCOP UCSD

UCI

San Diego Coastal

UCSDMC

UCSBUCDCUCLACEB

UCSD External Relations

Page 17: Disaster Recovery

17

Services being Considered

– UCOP California Institute for Energy and Environment

– UCLA Med PPRC

And what’s next?

Broader discussions are now occurring, not just w/ UCOP, but between more and more UC players – nice “halo” effect with many leveraging the WAN!

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

Page 18: Disaster Recovery

18

Technical Details

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

• SD & OP (and SB & SDMC) purchased comparable HW• IBM SAN & Cisco SAN switches, supports global

mirroring (PPRC – Peer to Peer remote copy)• Mainframe – memory upgrade and CBU option – must have

sufficient capacity on both sides to support total load• Worked through CENIC and local network teams to set up

appropriate links for PPRC to ensure throughput• Wrote (and are writing) special monitoring tools• Setup remote tape capabilities so we don’t have to use

outside vendor for offsite storage on tape copies• You need to remember that this hardware needs to be in

normal refresh cycle just like hardware on your primary floor

Page 19: Disaster Recovery

19

Network concerns

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

• Frame size

• For low traffic, default end to end of 1500 bytes – works fine

• OP/SD (more traffic) had to move into “jumbo frames” – 2300 bytes seems to work

• On HPR today, need to move to DC• @ OP – likely upgrade to 10Gb, at 1 Gb now• Must refine SLA’s & due diligence

• Acceptable catch up (RPO issue)• Better understanding of traffic

Page 20: Disaster Recovery

20

Network Layout

demeter128.48.101.35

vsftp-dr(rebab-vhost-dr)

vsftp-dr192.35.228.245

deimos-dr[LDAP]

192.35.228.72

vsftp(rebab-vhost)128.48.116.76

hrben-w2k03-dr

[File Server]PowerEdge 2950

192.35.228.48

hrben-w2k02-dr

[File Server]PowerEdge 2950

192.35.228.47

hrben-w2k01-dr

[File Server]PowerEdge 2950

192.35.228.46

hrben-w2k03 [File Server]

128.48.177.12

hrben-w2k02[File Server]

128.48.177.11

uasapp3(decimus)

[WAS]128.48.114.60

sseapp2(eros-en1)[WAS 5.1]

128.48.114.118

sseweb4(brannan)

128.48.101.99

High Availability DB Backup

Servers

deimos[LDAP]

128.48.116.35TO Z/OS 9/14/06

dcadmin1-dr[Bank Trans.]192.35.228.49

web2-dr(webdb) -X4200)192.35.228.82

dcadmin1[Bank Trans.]

web2 (webdb alias - linux)

128.48.134.180

web1(www alias -

linux)

128.48.101.37

ragno[Edify]

128.48.177.43

Mainframe128.48.96.64

SANIBM DS8100

(ucop-san-01-mgmt)128.48.96.128

bert(vesta)

[Sybase]128.48.96.163

(128.48.96.160)

depman3(noreiga)

128.48.114.95

IRC-NTS17128.48.177.17

sseapp3(natoma)[WAS 6.1]

128.48.114.140

ROUTERS, FIREWALL, &

SWITCHES(new VPN cards)

eia-01[Citrix]

128.48.118.22

sseweb3(bluxome)

128.48.101.98

UCOP

eia-05[Citrix]

128.48.118.28

netz[Edify]

128.48.177.42

uasdbs(septimus)[Sybase]

128.48.113.208(128.48.113.57)

uasapp1(primus)[WAS]

128.48.114.54

uasweb1(secundus)

128.48.114.55

uasapp2(quintus)[WAS]

128.48.114.57

uasweb2(sextus)

128.48.114.58

uasother(quartus-en1)128.48.114.84

Mainframe:· Z/OS· Host On Demand· FTP Server

F5 Load Balancer

hrben-w2k01[File Server]

128.48.177.10

web1-dr (www)

192.35.228.81

ragno-dr[Edify]

PowerEdge 2950192.35.228.44

Mainframe192.35.228.111 – uccmvst-dr (test alias)

SANIBM DS8100

bert-dr(vesta-dr) [Sybase]

UCDB1 & 4192.35.228.170

depman3-dr(noriega-dr)Sun x4200

192.35.228.141

p-irc-sdfs01PowerEdge 2950

192.35.228.45

sseapp3-dr(natoma-dr) [WAS]

Sun x4200192.35.228.142

F5 Load Balancer

eia-01-dr[Citrix]

PowerEdge 1850192.35.228.40

sseweb3-dr(bluxome-dr)Sun x4200

192.35.228.169

UCSD

eia-05-dr[Citrix]

PowerEdge1950192.35.228.41

netz-dr[Edify]

Atyourserviceonline1-dr

PowerEdge 2950192.35.228.43

uasdbs-dr(septimus-dr)

[Sybase]192.35.228.112

uasapp1-dr(primus-dr)

[WAS]192.35.228.136

uasweb-1dr(secundus-dr)

192.35.228.139

uasapp2-dr(quintus-dr)

[WAS]192.35.228.138

uasweb2-dr(sextus-dr)

192.35.228.140

uasother-dr(quartus-en1-dr)192.35.228.137

ROUTERS + FIREWALLCatalyst 6506

SAN Synchronization

rsync

rsync

rsync

rsync

AD Cluster Synchronization

RoboCopy

Manual Copy (HRB)

Manual Copy (HRB)

DFS

Mainframe:· Z/OS· Host On Demand· FTP Server

Windows Synchronization TBD

rsync

Workstation

Workstation

X X X

X X

X

X

p-irc-dc01[Active Directory]128.48.122.100

p-irc-dc04[Active Directory]PowerEdge 1950

192.35.228.42rsync

UCOP DR Equipment PlanGeneration 1:No Application Clustering November 2008

Infoblox[DNS]

Infoblox[DNS]

Infoblox[DNS]

Secondary SecondaryPrimary

rsync for flat files;Sybase load from backups

rsync for flat files;Sybase load from backups

November 6, 2008

RoboCopy or DFS

RoboCopy or DFS

rsync

rsync

rsync

S2

S3

S1

W1

W2

W3

W4

W5

W6

S5

P1

N1

D1

N2

Storage Array

Storage Array

PowerVault 221

W7

Catalyst 3560 Switch

Catalyst 3560 Switch

Catalyst 3560 Switch

N2

RoboCopy

RoboCopy

CE

NIC

W8

W9W10

S4

XMoves to Mainframe by 9/14

rsync

S6rsync

A1

N3

p-irc-evs01[exchange]

128.48.122.122 p-irc-exbe01[exchange]

128.48.122.86p-irc-exbe02

[exchange]128.48.122.87 p-irc-exfe01

[exchange]128.48.122.89

p-irc-sdevs01[exchange]

192.35.228.53

p-irc-sdexfe01[exchange]

192.35.228.58

p-irc-sdexbe01[exchange]

192.35.228.51p-irc-sdexbe02

[exchange]192.35.228.52

W12

W13

W14

W11

Replistor

Replistor

Replistor

Replistor

ucop-nts05[risk mgmt]

128.48.118.5

ucop-nts06[risk mgmt]

128.48.118.9

ucop-nts09[risk mgmt]

128.48.118.15

ucop-nts08[risk mgmt]

128.48.118.14

p-rs-dhcp01[risk mgmt]

128.48.xxx.xx

rsctx01[risk mgmt]

128.48.118.37

rsctx06[risk mgmt]

128.48.118.6

rsctx02[risk mgmt]

128.48.118.38

rsctx03[risk mgmt]

128.48.118.43

rsctx04[risk mgmt]

128.48.118.44

ucop-nts13-sd[risk mgmt]

192.35.228.56

ucop-nts12-sd[risk mgmt]

192.35.228.57

rsctx08-sd[risk mgmt]

192.35.228.39

rsctx07-sd[risk mgmt]

192.35.228.38

W15

W16W17

W18

RoboCopy

RoboCopy

RoboCopy

RoboCopy

RoboCopy

ares[risktrans]

128.48.118.30

diana[riskrdb]

128.48.118.29

diana-dr[riskrdb-dr]

192.35.228.124

ares-dr[risktrans-dr]

192.35.228.123

fac-ws02[admin Fac Loan)128.48.123.131

fac-ws02-dr[admin Fac Loan]

192.35.228.37

RoboCopy or DFS W19

A2A3

sacco-drSystemimager,secon

dary Xen Dom0192.35.228.78

vanzetti-drXen Dem0,

baba-yaga-dr-vhost & adena-dr-vhost, sysdo,wiki192.35.228.79

gimbri-vhost128.48.116.70

poplar128.48.97.133

S7S8

eddy (erbdb)128.48.113.60

freelon-dr (erbdb-dr)

192.35.228.74

hadr sync S9Upcoming – 12/2008

rsync

Page 21: Disaster Recovery

21

Implications due to “Success”

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

• OP WAN capacity connection upgrade

• Change management is a lot more complicated

• Some technical “lock in”

• Insufficient documentation and test plans – even now.

• Better monitoring tools required

• Org processes can be stressed

Page 22: Disaster Recovery

22

Lessons Learned

RIDING THE WAVES OF INNOVATION • cenic09.cenic.org

• WAN is an underutilized/unrecognized asset

• Geography is less of an inhibitor then many believe

• This project will never be completed

• Can/should continuously optimize this over time (examples – virtualization, better sharing)

• Adding DR capability is easier after initial heavy lifting - e.g. Mainframe