1 Disaster Recovery Broad Team – UCSD, UCOP, and others! (special credit to Kris Hafner & Elazar Harel) Presenter - Paul Weiss – Executive Director UCOP/IR&C [email protected] March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org
Jan 11, 2016
1
Disaster Recovery
Broad Team – UCSD, UCOP, and others!
(special credit to Kris Hafner & Elazar Harel)
Presenter - Paul Weiss – Executive Director UCOP/IR&C
March 9-11, 2009 • Long Beach, CA • cenic09.cenic.org
2
Agenda
• Business view and background as to how and why
• The services portfolio
• Technical details
• Network implications
• Lessons learned, going forward
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
3
Situation as of 2Q2006• UCSD had almost no DR plan in place• UCOP used IBM contract in Colorado
– Cost $200k / yr + $600k/month if ever used– Had insufficient gear and network reserved,
cautiously estimate would be > 50% more cost if updated appropriately
– 40 hrs of testing / year limit, difficult to schedule– RPO (Recovery Point Objective) <= 7 days– RTO (Recovery Time Objective) <= 3 days– Required UCOP personnel to activate and operate– Past testing indicated decent mainframe recovery
plan in place, limited distributed system capability
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
4
DR Concept
• UCOP required shorter RPO & RTO
• Found trusted partner (UCSD)– Willingness to be “married”
• Technical choices• Change management – ongoing• One “team”• Common principles
• Use the WAN “stupid”
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
5
Keys to Approach– Buy enough storage, synchronize data in real or
near real time, avoid loading data during an actual DR event
– Mainframe – CBU option and buy memory– Other servers – buy sufficient gear to have
capacity available to run at either location without having to repurpose servers during event
– Must be able to test and retest – DR is not STATIC!
The decision to do it!
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
6
Advantages of this Approach• Costs for UCOP are comparable to old DR plan• Costs for UCSD are <50% of a vendor solution• Capability is dramatically improved
– RTO and RPO < 1 day (and will be far less)– Can test as often as needed (we need it!)– Equipment is there and operational– More services can be “easily” added (and
have!) after the initial investment and can optimize over time
– UC personnel “on other side” will assist in case of disaster, long term goal is to recover without any personnel from down location immediately available
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
7
Initial Critical Success Factors• UCOP assigned .5 FTE staff dedicated to
drive effort• One Team – UCOP and UCSD• Agree to basic principles, including $$$• Fight scope creep• Engage procurement personnel• Communicate, communicate, communicate• Test, Test, Test• The WAN!
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
8
Current UCOP to UCSD DR Portfolio
– All Mainframe services (including 9 (and soon to be 10) PPS instances & UCRS)
– AYSO and all Benefits services– Endowment and Investment Accounting System– Active Directory– VPN– Email & File sharing– Web Servers– Banking/Treasury Systems– Loan Programs– Risk Services
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
9
The Picture - Part I
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
UCOP UCSD
10
Current UCSD to UCOP DR Portfolio
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
– All Mainframe services (including HR, financial and student transactional backend systems)
– All Web Based systems for HR/PPS, Financial, Student, Telecommunications billing, etc.
– Google search appliances– Multi terabyte data warehouse– Multi terabyte production data for all mainframe and
open systems – Dev and QA testing data and LPAR’s for mainframe
applications – Stand Alone systems for Intl. Student tracking, Audit,
Coeus, and DARS systems
11
Future UCSD to UCOP DR Portfolio
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
– Portal/CMS backup for campus, business and student portals
– Single Sign-on, roles, affiliates authentication/authorization failover
– VPN– Active Directory– Domain controllers– Core MTA (Ironport for now)– Blackberry– Mailing lists– Mailbox machines
12
The Picture - Part II
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
UCOP UCSD
13
Then it got interesting
As positive word got out, more locations and functional areas realized that DR was
achievable
So…
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
14
Other DR services in place or committed too
– UC Effort Reporting System (3Q2009)– UCOP Office of Technology Transfer Informix DB– UCOP IDP Shibboleth Server– UC Replacement Security Number (RSN)– UCOP TSM Server– UC Pathways (3Q2009)
– UCSD Med Mainframe, PPRC– UCSB Distributed DNS Server– UCLA Continuing Education of the Bar– UCSD External Relations– UCDC File Server– Irvine Secondary DNS and Web Server– SD Coastal Data Information Program
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
15
And a Special Case!
– UCSB mainframe load– Four Steps:
• DR from UCSB to UCOP utilizing PPRC
• Do failover test to UCOP, if fully successful, keep production at UCOP
• DR from UCOP to UCSD - trivial
• Turn off UCSB mainframe
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
16
The Picture - Part III
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
UCOP UCSD
UCI
San Diego Coastal
UCSDMC
UCSBUCDCUCLACEB
UCSD External Relations
17
Services being Considered
– UCOP California Institute for Energy and Environment
– UCLA Med PPRC
And what’s next?
Broader discussions are now occurring, not just w/ UCOP, but between more and more UC players – nice “halo” effect with many leveraging the WAN!
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
18
Technical Details
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
• SD & OP (and SB & SDMC) purchased comparable HW• IBM SAN & Cisco SAN switches, supports global
mirroring (PPRC – Peer to Peer remote copy)• Mainframe – memory upgrade and CBU option – must have
sufficient capacity on both sides to support total load• Worked through CENIC and local network teams to set up
appropriate links for PPRC to ensure throughput• Wrote (and are writing) special monitoring tools• Setup remote tape capabilities so we don’t have to use
outside vendor for offsite storage on tape copies• You need to remember that this hardware needs to be in
normal refresh cycle just like hardware on your primary floor
19
Network concerns
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
• Frame size
• For low traffic, default end to end of 1500 bytes – works fine
• OP/SD (more traffic) had to move into “jumbo frames” – 2300 bytes seems to work
• On HPR today, need to move to DC• @ OP – likely upgrade to 10Gb, at 1 Gb now• Must refine SLA’s & due diligence
• Acceptable catch up (RPO issue)• Better understanding of traffic
20
Network Layout
demeter128.48.101.35
vsftp-dr(rebab-vhost-dr)
vsftp-dr192.35.228.245
deimos-dr[LDAP]
192.35.228.72
vsftp(rebab-vhost)128.48.116.76
hrben-w2k03-dr
[File Server]PowerEdge 2950
192.35.228.48
hrben-w2k02-dr
[File Server]PowerEdge 2950
192.35.228.47
hrben-w2k01-dr
[File Server]PowerEdge 2950
192.35.228.46
hrben-w2k03 [File Server]
128.48.177.12
hrben-w2k02[File Server]
128.48.177.11
uasapp3(decimus)
[WAS]128.48.114.60
sseapp2(eros-en1)[WAS 5.1]
128.48.114.118
sseweb4(brannan)
128.48.101.99
High Availability DB Backup
Servers
deimos[LDAP]
128.48.116.35TO Z/OS 9/14/06
dcadmin1-dr[Bank Trans.]192.35.228.49
web2-dr(webdb) -X4200)192.35.228.82
dcadmin1[Bank Trans.]
web2 (webdb alias - linux)
128.48.134.180
web1(www alias -
linux)
128.48.101.37
ragno[Edify]
128.48.177.43
Mainframe128.48.96.64
SANIBM DS8100
(ucop-san-01-mgmt)128.48.96.128
bert(vesta)
[Sybase]128.48.96.163
(128.48.96.160)
depman3(noreiga)
128.48.114.95
IRC-NTS17128.48.177.17
sseapp3(natoma)[WAS 6.1]
128.48.114.140
ROUTERS, FIREWALL, &
SWITCHES(new VPN cards)
eia-01[Citrix]
128.48.118.22
sseweb3(bluxome)
128.48.101.98
UCOP
eia-05[Citrix]
128.48.118.28
netz[Edify]
128.48.177.42
uasdbs(septimus)[Sybase]
128.48.113.208(128.48.113.57)
uasapp1(primus)[WAS]
128.48.114.54
uasweb1(secundus)
128.48.114.55
uasapp2(quintus)[WAS]
128.48.114.57
uasweb2(sextus)
128.48.114.58
uasother(quartus-en1)128.48.114.84
Mainframe:· Z/OS· Host On Demand· FTP Server
F5 Load Balancer
hrben-w2k01[File Server]
128.48.177.10
web1-dr (www)
192.35.228.81
ragno-dr[Edify]
PowerEdge 2950192.35.228.44
Mainframe192.35.228.111 – uccmvst-dr (test alias)
SANIBM DS8100
bert-dr(vesta-dr) [Sybase]
UCDB1 & 4192.35.228.170
depman3-dr(noriega-dr)Sun x4200
192.35.228.141
p-irc-sdfs01PowerEdge 2950
192.35.228.45
sseapp3-dr(natoma-dr) [WAS]
Sun x4200192.35.228.142
F5 Load Balancer
eia-01-dr[Citrix]
PowerEdge 1850192.35.228.40
sseweb3-dr(bluxome-dr)Sun x4200
192.35.228.169
UCSD
eia-05-dr[Citrix]
PowerEdge1950192.35.228.41
netz-dr[Edify]
Atyourserviceonline1-dr
PowerEdge 2950192.35.228.43
uasdbs-dr(septimus-dr)
[Sybase]192.35.228.112
uasapp1-dr(primus-dr)
[WAS]192.35.228.136
uasweb-1dr(secundus-dr)
192.35.228.139
uasapp2-dr(quintus-dr)
[WAS]192.35.228.138
uasweb2-dr(sextus-dr)
192.35.228.140
uasother-dr(quartus-en1-dr)192.35.228.137
ROUTERS + FIREWALLCatalyst 6506
SAN Synchronization
rsync
rsync
rsync
rsync
AD Cluster Synchronization
RoboCopy
Manual Copy (HRB)
Manual Copy (HRB)
DFS
Mainframe:· Z/OS· Host On Demand· FTP Server
Windows Synchronization TBD
rsync
Workstation
Workstation
X X X
X X
X
X
p-irc-dc01[Active Directory]128.48.122.100
p-irc-dc04[Active Directory]PowerEdge 1950
192.35.228.42rsync
UCOP DR Equipment PlanGeneration 1:No Application Clustering November 2008
Infoblox[DNS]
Infoblox[DNS]
Infoblox[DNS]
Secondary SecondaryPrimary
rsync for flat files;Sybase load from backups
rsync for flat files;Sybase load from backups
November 6, 2008
RoboCopy or DFS
RoboCopy or DFS
rsync
rsync
rsync
S2
S3
S1
W1
W2
W3
W4
W5
W6
S5
P1
N1
D1
N2
Storage Array
Storage Array
PowerVault 221
W7
Catalyst 3560 Switch
Catalyst 3560 Switch
Catalyst 3560 Switch
N2
RoboCopy
RoboCopy
CE
NIC
W8
W9W10
S4
XMoves to Mainframe by 9/14
rsync
S6rsync
A1
N3
p-irc-evs01[exchange]
128.48.122.122 p-irc-exbe01[exchange]
128.48.122.86p-irc-exbe02
[exchange]128.48.122.87 p-irc-exfe01
[exchange]128.48.122.89
p-irc-sdevs01[exchange]
192.35.228.53
p-irc-sdexfe01[exchange]
192.35.228.58
p-irc-sdexbe01[exchange]
192.35.228.51p-irc-sdexbe02
[exchange]192.35.228.52
W12
W13
W14
W11
Replistor
Replistor
Replistor
Replistor
ucop-nts05[risk mgmt]
128.48.118.5
ucop-nts06[risk mgmt]
128.48.118.9
ucop-nts09[risk mgmt]
128.48.118.15
ucop-nts08[risk mgmt]
128.48.118.14
p-rs-dhcp01[risk mgmt]
128.48.xxx.xx
rsctx01[risk mgmt]
128.48.118.37
rsctx06[risk mgmt]
128.48.118.6
rsctx02[risk mgmt]
128.48.118.38
rsctx03[risk mgmt]
128.48.118.43
rsctx04[risk mgmt]
128.48.118.44
ucop-nts13-sd[risk mgmt]
192.35.228.56
ucop-nts12-sd[risk mgmt]
192.35.228.57
rsctx08-sd[risk mgmt]
192.35.228.39
rsctx07-sd[risk mgmt]
192.35.228.38
W15
W16W17
W18
RoboCopy
RoboCopy
RoboCopy
RoboCopy
RoboCopy
ares[risktrans]
128.48.118.30
diana[riskrdb]
128.48.118.29
diana-dr[riskrdb-dr]
192.35.228.124
ares-dr[risktrans-dr]
192.35.228.123
fac-ws02[admin Fac Loan)128.48.123.131
fac-ws02-dr[admin Fac Loan]
192.35.228.37
RoboCopy or DFS W19
A2A3
sacco-drSystemimager,secon
dary Xen Dom0192.35.228.78
vanzetti-drXen Dem0,
baba-yaga-dr-vhost & adena-dr-vhost, sysdo,wiki192.35.228.79
gimbri-vhost128.48.116.70
poplar128.48.97.133
S7S8
eddy (erbdb)128.48.113.60
freelon-dr (erbdb-dr)
192.35.228.74
hadr sync S9Upcoming – 12/2008
rsync
21
Implications due to “Success”
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
• OP WAN capacity connection upgrade
• Change management is a lot more complicated
• Some technical “lock in”
• Insufficient documentation and test plans – even now.
• Better monitoring tools required
• Org processes can be stressed
22
Lessons Learned
RIDING THE WAVES OF INNOVATION • cenic09.cenic.org
• WAN is an underutilized/unrecognized asset
• Geography is less of an inhibitor then many believe
• This project will never be completed
• Can/should continuously optimize this over time (examples – virtualization, better sharing)
• Adding DR capability is easier after initial heavy lifting - e.g. Mainframe