Data centers sometimes fail. You can build in safeguards and fail safe mechanisms and redundancy through backup systems but like all engineered systems, data centers can -- and sometimes do -- fail. See Table 1 for some of the notable data center outages of 2011 and 2012 to see how even the biggest brands with access to the best technology and resources can suffer from data center outages. WHITEPAPER Data center outages impact, causes, costs, and how to mitigate
6
Embed
WHITEPAPER - Netmagic Solutions · data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data centers sometimes fail. You can build in safeguards and fail safe mechanisms and redundancy through backup systems but like all engineered systems, data centers can -- and sometimes do -- fail. See Table 1 for some of the notable data center outages of 2011 and 2012 to see how even the biggest brands with access to the best technology and resources can suffer from data center outages.
WHITEPAPER
Data center outagesimpact, causes, costs, and how to mitigate
WHITEPAPER 02
Netmagic Solutions
Few days Water flooded data centers in New York after Hurricane Sandy
Several websites and other services down
Huffington Post, Buzzfeed, Gawker and several others
Few hours Both primary and backup systems failed
A well publicized campaign to encourage athletes and visitors to the Olympics to tweet was affected
Twitter
7 hours Power failure in data center CRM services to customers affected
Salesforce
6 days Online banking down across U.S.
29 million users affectedBank of America
4 days Amazon EC2 (elastic compute cloud) services went down
Users affected worldwideAmazon Web Services
2 - 4 days Customers lost access to applications such as TurboTax Online, QuickBooks Online, Quicken and QuickBase.
Several thousandsIntuit
2 days Gmail affected 120,000 users affectedGoogle
24 hours plus Unavailable worldwide Millions of users affectedBlackberry
24 hours plus Yahoo Mail outage Users affected worldwideYahoo
24 – 72 hours Windows Live, Hotmail inboxes disappear
Users affected worldwideMicrosoft
24 hours plus
4 – 8 hours
Series of data outages
Netflix streaming service affected
Several US states unable to get LTE service
20 million users affected
Verizon
Netflix
4 days Amazon EC2 (elastic compute cloud) services went down
Users affected worldwideAmazon Web Services
TABLE 1Notable Data Center Outages in 2011 and 2012
WHO HOW LONG WHAT HAPPENED IMPACT
2012
2011
Source: See Ref 1, Ref 2
What is inside a data center?
Causes and cost of data center outage
A data center is a configuration of server rooms, cooling units, storage, batteries, and generators. At the core of a data center are racks and racks of servers. Servers need power, lots of it -- a typical large data center occupies 50,000 square feet of space and consumes 5 MW of power.Bringing in so much power generates massive amounts of heat. This heat is carried away by cooling units that force cool air from the floor, through the racks, and into ducts above.
Data centers collect and store vast amounts of data. This data needs to be stored safely, often for several years (as in the case of financial information). The hardware for storage is therefore stored in secure locations –for example, in underground mines.
Since data centers run on power and utility power can fail, every data center has batteries for backup – thousands of them stacked up and constantly being charged. In the event of a power failure, these battery banks provide power.
But batteries can provide power only for a few minutes at most. To provide power during longer power failures and blackouts, most data centers have banks of diesel generators on standby. And since these massive diesel generators need fuel, data centers need to store thousands of liters of diesel fuel.
Information on data centers is hard to come by. Because data centers are critical pieces of IT infrastructure and store sensitive customer data, data center managers are fiercely protective of their privacy. Probably the first and only major survey of data center outages and costs associated with these outages are two studies by the Michigan based Ponemon Institute sponsored by Emerson Network Power. Both studies are limited to U.S. data centers but can be considered representative of the industry.
WHITEPAPER 03
Netmagic Solutions
So how can businesses ensure that disruptions due to data center glitches are minimized?
First, some perspective.Using an outsourced data center is,in almost all cases, a whole lot more reliable and cost-effective for a company thanbuilding one in-house. That’s because a third-party data center is able to share the very high cost of the technology, infrastructure, and personnel that go into building the data center among multiple customers. In fact, the economies of scale are so compelling that while data centers are growing in size, they are declining in numbers (see Ref 3). Which just means that more companies are outsourcing more of their IT infrastructure to third-party data centers.
Second, it helps to know what makes up a data center in order to better understand what is involved in keeping it robust.
WHITEPAPER 04
Netmagic Solutions
Datacenter outages – the Indian contextIn the 2011Data Center Risk Index published by hurleypalmerflatt, an engineering consultancy, and Cushman & Wakefield, a real estate consultancy, India ranked at the bottom of the 20 countries ranked in descending order of risk associated with running a data center. The U.S., Canada, and Germany were at the top of the rankings.On the face of it, this is a dismal ranking for a country that is at the center of the global outsourcing revolution. On closer look though, things are not as bad as they seem. To begin with, the Data Center Risk Index is a weighted average of 11 macro and local factors covering a wide range of attributes from the cost of energy to political instability to inflation to availability of water. Depending on their priorities and approaches to risk, individual customers will arrive at significantly different assessments of risk.
This was best highlighted during the world’s largest power blackout when an estimated 600 million people in the northern half of India lost power for two days in July 2012. In spite of the massive disruption across several areas of the economy from public transport to industry to hospitals, there were no reports of major disruptions in data centers anywhere in India (see Ref 4). One ostensible reason is that the bulk of the data centers are located in Mumbai and the south of India while the blackout was in the northern half of India. But the real reason was that India has a chronic power problem and data centers are geared to work through intermittent, low, and no power from public utilities. Most third-party data centers have power back up for days on end – it’s just another risk to be managed.
Outage causesThe first study, National Survey on Data Center Outages, published in September 2010, surveyed 453 individuals responsible for data center operations in the U.S. Of these, 95% said they had an unplanned data center outage in the last two years. Each respondent averaged 2.48 complete shutdowns with an average downtime of 107 minutes. Apart from complete shutdowns, respondents reported far more frequent partial rack- or row-based outages – an average of 6.8 row-based outages with an average downtime of 152 minutes, and an average of 11.2 rack-based outages with an average duration of 153 minutes in a two-year period.
The most frequently cited root causes of data center outage were: UPS battery failure (65%), UPS capacity exceeded (53%), human error (51%), and UPS equipment failure (49%).
The most common responses to unplanned outages were to repair, replace or purchase additional IT or infrastructure equipment, followed by contacting the equipment vendor for support.
Tier 1: Basic99.671% availability
Susceptible to disruptions from both planned and unplanned activity
Single path for power and cooling distribution, no redundant components (N)
May or may not have a raised floor, UPS, or generator
Takes 3 months to implement
Annual downtime of 28.8 hours
Must be shut down completely to perform preventive maintenance
Tier 2: Redundant Components99.741% availability
Less susceptible to disruptions from both planned and unplanned activity
Single path for power and cooling distribution, includes redundant components (N+1)
Includes raised floor, UPS, or generator
Takes 3 to 6 months to implement
Annual downtime of 22.0 hours
Maintenance of power path and other parts of the infrastructure require a processing shutdown
Enables planned activity without disrupting computer hardware operation, but unplanned events will still cause disruption
Multiple power and cooling distribution paths, but with only one active path, includes redundant components (N+1)
Includes raised floor and sufficient capacity and distribution to carry load on one path while performing maintenance on the other
Takes 15 to 20 months to implement
Annual downtime of 1.6 hours
Tier 4:Fault Tolerant99.995% availability
Planned activity does not disrupt critical load and data center can sustain at least one worst-case unplanned event with no critical load impact
Multiple active power and cooling distribution paths, includes redundant components (2 (N+1), i.e., 2 UPS each with (N+1) redundancy)
Takes 15 to 20 months to implement
Annual downtime of 0.4 hours
TABLE 2Data Center Resilience Tier Levels
WHITEPAPER 05
Netmagic Solutions
Going up the levels has a significant cost impact -- construction costs for Tier 3, for instance, are double that for Tier 1.So organizations need to carefully determine an appropriate tier level for their different needs. eBay for example, started out with all their applications in a Tier 4 data center till they analyzed their needs more closely and determined that 80% of their equipment could be shifted out without loss of reliability – search, for instance, could be in a Tier 2 center whereas databases and network backbones needed to be in a Tier 4 center. eBay says they cut their data center Capex and Opex by half by matching applications to data center tier level (see Ref 5).
Experts recommend the following to minimize data center outages and mitigate damage:
} Invest in better equipment. It’s tempting to save money by buying cheap but the cost of hardware failure is very high.
} Provide redundancy -- relying on any single machine or a single component in the core architecture is disastrous.
} When it comes to crucial data, never assume that someone else is automatically protecting you. Have backups.
} Have your data available on multiple servers in multiple data centers. Even consider having them in different geographical regions and spread between different service providers.
How to mitigate data center outages
Outage costs
How to evaluate data center reliability
The second Ponemon Institute study, Calculating the Cost of Data Center Outages, published in February 2011, surveyed 41 independent data centers in the U.S. that experienced at least one complete or partial unplanned shutdown in the previous 12 months.
The survey revealed that data center outages have significant financial consequences ranging from a minimum cost of $38,969 to a maximum of $1,017,746 per organization. The average cost of a data center outage was $505,502 per incident. ($ = 55 INR).
Historically, data centers have been designed in the absence of established standards. This made it very difficult for network managers to choose technologies to build and benchmark data centers. In 2005, the Telecommunications Industry Association (TIA) published TIA-942, the first standards to specifically address data center infrastructure. The TIA-942 standards cover site space and layout, cabling infrastructure, tiered reliability, and environmental considerations.
Of these, the tiered reliability standards are directly useful to organizations looking to evaluate data center resilience across vendors.
The TIA standards, based on a system pioneered by the New York-based Uptime Institute in the mid-nineties, prescribe architectural, security, electrical, mechanical, and telecommunications recommendations.
There are four tiers of availability from Tiers 1 to 4, with Tier 4 being the most resilient. See Table 2 for a description of the tiers – redundancy is indicated in terms of N where N represents only the necessary system need.
WHITEPAPER 06
The content you have downloaded has been produced with thoughtful, original research efforts by Netmagic. Please do not duplicate or misuse it. You may quote portions of our research in your own material provided you include a proper attribution to this original source. You are free to share this content on the web with
ConclusionData center outages are real and they can cause significant loss of revenue. The frequency and duration of data center outages varies by the size of the data center. Outages become less frequent and shorter in duration as data centers increase in size. The smaller the data center the longer and more common the outages. IT equipment failure is the most expensive root cause and human error is the least expensive.But the benefits of outsourcing IT infrastructure to a third-party data center far outweigh the risks. As with all engineered systems, the risk is quantifiable and manageable.
References:
Major data center outages in 2011: http://www.evolven.com/blog/2011-devastating-outages-major-brands.html