DECIPHERING THE VIRTUAL DATA CENTER - Dell …...customer experience by improving availability, scalability, visibility, maintainability, resiliency, and predictability. The sections

DECIPHERING THE VIRTUAL DATA CENTER

Andrew BartleyBusiness Operations AnalystEMC Corporation [email protected]

Rich ElkinsSenior Lab Systems Engineer EMC Corporation [email protected]

mailto:[email protected]

mailto:[email protected]

2014 EMC Proven Professional Knowledge Sharing 2

Table of Contents

Overview .................................................................................................................................... 3

Introduction ................................................................................................................................ 5

Being Data-Driven ...................................................................................................................... 5

Big Data Enhancing Self-Service ............................................................................................... 6

Classes of Metrics ...................................................................................................................... 7

Resource Capacity ................................................................................................................. 7

Resource Availability .............................................................................................................. 8

Resource Utilization ................................................................................................................ 8

Examples and Benefits of Collecting Metrics .............................................................................. 9

Resource Capacity ................................................................................................................. 9

Resource Availability .............................................................................................................10

Resource Utilization ...............................................................................................................12

Resource Distribution ............................................................................................................13

Conclusion ................................................................................................................................15

Bibliography ..............................................................................................................................16

Disclaimer: The views, processes, or methodologies published in this article are those of the

authors. They do not necessarily reflect EMC Corporation’s views, processes, or

methodologies.


Overview

Data center budgets are shrinking while ever expanding services are expected by end users.

Virtualization and cloud services have already improved data center efficiencies, but business

units still demand more. Meanwhile, any proposals for investments in new services are met with

increasing scrutiny; simply purchasing more resources for your virtualized environment is no

longer an option. Using traditional forecasting and budgeting techniques can make it nearly

impossible to correctly determine where to invest your limited resources.

This Knowledge Sharing article will explain how using predictive analytics can improve the

utilization of your existing data center. It will be particularly useful to any IT professional that has

experienced shrinking budgets and increased demands. The article provides generalized

instructions along with specific examples of how to analyze historical resource demands and

predict future demand. These tools will enable IT professionals to improve utilization of existing

resources.

IT and data center management will need to monitor usage metrics as they transition from

legacy IT models to newer more dynamic models. These new dynamic models must support on-

demand service and automation; on-demand services require that the right resources be

available at the right time for end user consumption. To correctly forecast and predict data

center resource demand, IT professionals must harness the power of Big Data analytics to

intelligently design their services.

New, dynamic IT models support this trend with architectures that support dynamic elastic

environments that maximize efficiency by allowing the same hardware to perform more use

cases. These new innovations are not enough on their own to reduce operating and capital

costs; a higher understanding of the environment is required to properly manage it and maintain

service levels.

Patterns of service consumption can be identified by organizing and adding intelligence to vast

amounts of raw data. This requires two inputs: raw data and long-term business plans. We

already collect many different forms of data in our data centers: service requests, equipment

management data, equipment logs, resource utilization, etc. Analyzing and summarizing this

raw data can show trends in consumption. By mapping these consumption trends to long-term

business plans, we can more accurately predict future data center resource demands based on

similar historical records.


While the specific applications of Big Data analytics will vary drastically based on your specific

business, the general concepts are similar. While a large company developing a new software

product will require certain resources for each of their developers, not every developer will be

working on the new product at the same time. Analyzing historical records can help identify

resource utilization and resource demand magnitude over time; this will help in predicting that

the right resources will be available at the right time.

It is critical to data center management that metrics be measured and analyzed to understand

how resources are consumed, why they are consumed, and when demand will rise and fall. By

understanding how specific resources are being consumed, such as compute, memory, storage,

and network utilization, data center management can provide an improved and more reliable

service. Usage statistics can be used to determine and mitigate spikes in demand, enforce

different levels of QoS, determine what program the resources are being consumed for, and

where to make future resource investments. The collected metrics coupled with the ability to

analyze and plan around them enables elasticity, continuous delivery, and an improved

customer experience.


Introduction

Is your organization data-driven? Many data centers are missing great opportunities for

improvement in the services they offer by not using their own historical and real-time operational

data to manage their environments. To illustrate this point, ask yourself: Does my organization

have a Service Level Agreement (SLA)? If so, how were those service levels determined? Are

my current processes capable of meeting the SLA 99% of the time? What proof do I have that

this is (or is not) the case? The purpose of this article is not to instruct organizations to rewrite

their SLAs; crafting an effective SLA is an extremely complicated process for which many

instructional courses, tutorials, and seminars are better suited (Sloan, 2002). The purpose of

this article is to encourage challenging the status quo of your organization to use more data in

your decision making.

Written with practicality in mind, this article uses several examples to reinforce the concepts with

real-world context. This is by no means a complete list of applications for data analysis, but

rather are common use cases that many data centers will likely find useful. With each example,

we describe when the examples would be appropriately used, what data to collect, how to

analyze that data, and where the newly obtained information can be used to drive decisions.

Each example set will have different levels of complexity that can be used depending on your

organization’s business needs. These examples are meant to be used as learning aids; they are

not meant to be used as templates for running your organization. All organizations must choose

their own metrics to monitor based on their strategic goals and business conditions.

Being Data-Driven

You have probably felt the effects of not having worked in a data-driven organization. Without

the data necessary to make informed decisions, SLAs will be missed more frequently, projects

will more often be late, and stress will be higher than if your decisions were based on

emotionless data. “Far too often the meetings and conversations of today tend to focus on

opinions, intuition, emotions, and other non-numeric or subjective sources of information”

(Kiemele, Schmidt, & Berdine, 2000). Although non-numeric information can be useful and at

times is the only information available, understanding your business quantitatively is critical to

properly managing it effectively (Kiemele, Schmidt, & Berdine, 2000).

Has your data center service organization ever discussed offering a new product or service?

Without data to represent what your customers actually want, the conversation quickly turns into

an emotion-based event rather than a business-based event. In an age of ever-shrinking


budgets and constantly increasing expectations, these emotional decisions are no longer

sufficient. Data centers can no longer take the risk “if we build it, they will come”. This

investment model is fading as time goes on in favor of a “just-in-time” IT investment strategy

(Verge, 2013). Data center investments need to be based on current business demands while

still being flexible enough to recognize and react to changes rapidly. Discussions about new

services without relevant data take too long and do not guarantee that a desired service will be

produced.

Luckily, collecting quantitative data is easier now than it has ever been thanks to an increasingly

digital world. In the past, quantitatively measuring a process was a time-intensive process.

Tracking throughput of a given process often involved a dedicated person to manually observe

and record a process. Once enough data had been collected, one or several analysts would

then need to manually calculate your desired data. Modern data center software and platforms

make data collection look effortless when compared to older data collection methods.

Big Data Enhancing Self-Service

The rise of cloud computing and increasingly dynamic technologies further removes data center

management and operational teams from the consumers of data center services and resources.

One of the technologies enabling the separation is Self-Service. CA Technologies defines Self-

Service as empowering end users to gain access to resources and services they want without

IT or data center staff intervention (CA Technologies, 2009). This enables resource consumers

to gain access to resources while reducing the visibility of how and why the resources are being

used. While the industry has widely accepted the benefits of virtualization and Self-Service

(EMC, 2012) (VMware, 2009), the abstraction of resources requires a more sophisticated set of

record keeping to ensure that resources are appropriately allocated. The only way to mitigate

this is to collect metrics that accurately reflect your Self-Service process (Wladawsky-Berger,

2013).

Self-service metrics are diverse and easy to capture. Data center management software can

determine which organizations are using data center resources, the amount of resources they

are consuming, and how long they are using the resources. Linking user and organizational

information to resource being used, duration of use, and utilization of resources during the

duration of use can greatly aid in enforcing data center policies. For example, if an individual or

an organization is consuming a greater percentage of resources than the business deems

necessary, the data center could throttle usage or apply a financial fee.


Classes of Metrics

The benefit of collecting and analyzing data center metrics is in understanding the overall

operation of the data center. Without collecting metrics data center management will not be able

to make informed business decisions, which can reduce performance and drive up costs.

Collecting specific metrics provide transparency to daily data center operations, which lead to

data-driven decisions improving the overall operation of the data center. Capturing and

analyzing three metric categories is critical to maintaining a high level of service within the data

center. The three categories of metrics are:

1. Resource Capacity

2. Resource Utilization

3. Resource Availability

Collecting and understanding metrics related to these three categories will significantly enhance

customer experience by improving availability, scalability, visibility, maintainability, resiliency,

and predictability. The sections below define the three classes of metrics to collect and how they

contribute to deciphering the data center.

Resource Capacity

Resource Capacity is the most basic set of metrics to collect and analyze for any data center

manager. According to the Information Technology Infrastructure Library (ITIL), measuring

capacity supports the optimal provisioning of IT- and data center-related services in order to

meet business demands (ITIL Library). Monitoring basic data center resources such as storage,

compute, memory, network, and personnel is critical to fully understand the state of your data

center and the quality of the services you provide. Specific Resource Capability metrics to

collect include:

Total CPU resources

CPU resources consumed

Total network bandwidth

Network bandwidth consumed

Total storage resources

Storage resources consumed

Total memory resources

Memory resources consumed

# of incidents due to capacity

shortages

Duration of capacity shortages

Accuracy of previous capacity

forecast

Capacity reserves during times of

increased and decreased demands

Unplanned capacity adjustments


Analyzing these metrics provides a manager with visibility as to when they need to procure

additional resources. Determining maximum capacity in relation to the level of currently

consumed capacity is critical to maintaining data center services. The visibility provided by

collecting these resources makes justifying additional resources a much simpler process. By

understanding resource capacity within the data center, a data center manager will always

ensure the appropriate capacity of resources to meet business requirements.

Resource Availability

Resource Availability is the method of ensuring the proper resources are available to meet

elastic customer and business demand. Resource availability ensures that a data center can

sustain services to support business needs while keeping costs economical (RMS Services).

This differs from resource capacity since resource capacity only measures that the resources

exist, not if the resources are available for use. Regularly scheduled maintenance downtime and

unscheduled incident downtime both can reduce the availability of resources in a data center.

The availability of resources is typically tied to a SLA or an agreed upon level for a specific

duration (TeamQuest). The benefit of collecting and analyzing these resources is in

understanding data center resource resilience. There are always unplanned events that can

consume and make resources unavailable. It is critical to collect downtime metrics so that

management can plan for and mitigate such events, reducing risk to the data center.

Resource Utilization

The KPI Library defines Resource Utilization as the number of hours of work assigned to a

resource or group of resources as a percentage of their availability for a given time period (KPI

Library, 2009). This means that the level of resource consumption can be measured for the

duration of time the resources are assigned to a project or workload. Measuring utilization

provides transparency to how resources are utilized and reveal trends that aid in forecasting.

This is important because utilization metrics will enable data center management to capacity

plan so that there are resources available for demand spikes, while not drastically

overprovisioning environments. Utilization metrics determine key factors such as whether data

center resources are properly allocated for specific workloads or projects, enabling

management to make resource adjustments so that data center resources are properly

allocated and distributed. This can significantly save capital expenses and increase elasticity by

providing the correctly provisioned environments to meet resource demands at different times.


Examples and Benefits of Collecting Metrics

“Over time, Big Data and advanced data science applications will enable us to take operational

decision making to a whole new level in a wide variety of disciplines.” (Wladawsky-Berger,

2013)

There are many benefits for data center management to collect and aggregate the four metric

categories, capacity, availability, utilization, and distribution. The following section explains high

level metrics to collect, what can be gained, and examples where applicable. It is important to

remember that the use case examples and benefits listed in this article are not exhaustive.

Resource Capacity

To effectively measure resource capacity, data center management will need to monitor

performance, consumption, and throughput on resources such as CPU, network, and storage. It

is important that appropriate resources are monitored and collected over time so that trends can

be identified and analyzed; this trend analysis will enable better capacity planning and

forecasting in the future. While the services provided by a data center will dictate which metrics

are worth collecting, aggregating, and analyzing, most data centers will want to measure several

different network, server, and virtualization metrics over time. Below is a list of common data

center metrics that might be collected (IT Process Maps, 2013):

# of Incidents due to Capacity Shortages

Resolution Time of Incidents due to Capacity Shortages

# of Capacity Adjustments

# of Unplanned Capacity Adjustments

Capacity Reserves (unit of measurements vary based on resource)

% of services and infrastructure components with capacity monitoring

For example, the business could acquire a larger user community and need to rapidly expand

their data center. This could require additional compute, memory, storage, SAN, and network

capacity that would need to be scoped to meet new business requirements. Metrics such as the

number and duration of capacity shortages can be collected and analyzed so that management

can measure whether the data center successfully increased capacity to meet business needs.

Understanding capacity metrics provides many benefits to the data center. Capacity metrics

enable data center management to measure whether they are meeting business requirements


at the most basic level. Metrics such as compute, storage, memory, and network consumption

provide increased visibility into what resources have been procured and deployed. This type of

transparency over time allows for easier justification of new resources, greater predictability,

and better capacity planning, which enable scalability to meet business demands.

Resource Availability

Measuring availability is critical to maintaining data center services. To understand whether or

not achieved availability meets the agreed upon availability via a SLA, it is critical to monitor and

collect metrics (IT Process Maps). Data center management must test the reliability and

resiliency of their resource capacity to understand availability. The metrics that should be

measured when testing availability are the accessibility of agreed upon data center services in

the SLA which include (IT Process Maps, 2013):

Number of service interruptions

Duration of service interruptions

Percentage of services and infrastructure being actively monitored

It is important to perform tests simulating outages so that metrics can be collected on recovery

of services. Understanding how resources such as CPU, storage, memory, and network react to

simulated outages and disruptions can provide valuable insight toward preventing downtime.

These metrics will provide transparency on the resiliency of data center resources and will help

with planning future hardening of resources to ensure their availability.


Figure 1: Service Running Normally

For example, a data center may have a virtualization environment with hundreds of servers

clustered together as hypervisors. The hypervisors have datastores on an array connected via a

SAN (Figure 1). The hypervisors are hosting thousands of VMs serving a wide range of

business needs. The resources for the VM requirements have been scoped to meet business

requirements and are included in the data center’s available capacity. A failure with the array

hosting the datastore or the SAN connection to the hypervisors could have a devastating effect

on the availability of virtualized resources (Figure 2). Even though the environment still has the

resource capacity to support the VMs, the resources will not be available for use due to a failure

in the datastore. Using availability metrics gained through availability testing, a data center can

determine which areas of recovery were the most challenging and plan for availability

improvements so that availability outages are shorter or avoided altogether, reducing data

center risk.


Figure 2: Service Disruption

Understanding resource availability is critical to the success and health of the data center.

Collecting availability metrics provides transparency into how resilient resources are and what

the potential impact might be if there is an unplanned event. The visibility provided by availability

metrics help planning for maintenance cycles in order to improve service and harden resources.

Without understanding and improving resource availability, all investments into additional

capacity would be exposed to unacceptable levels of risk.

Resource Utilization

Utilization can be measured in many different ways with many different tools and methods. The

simplest way to define utilization is in the formula below (Bright Hub PM, 2013):

Measuring utilization provides overall data center capacity utilization and how resources are

utilized on a project-by-project basis. This information enables better allocation of current

capacity to ensure elasticity and availability of capacity throughout the data center. Figure


3 shows the percentage of storage space utilized for each of three projects. Without any context,

these data may not have much value. If one of the project managers requested additional

storage space for their team, this graph helps make a decision. Imagine that manager of Project

C is requesting additional storage. She brings up that members of her team are not getting

enough space to complete their nightly builds. Relative to other projects, it is clear Project C has

enough overall capacity allocated. Without this graph, the request for additional storage might

have been approved and caused the company to spend more money on storage than is actually

required. While it is not clear what is causing the storage problems from this data alone, it can

already be ruled out that Resource Capacity is not the root cause.

Figure 3: Storage Utilization Example

Understanding how different projects, groups, and workloads utilize capacity is critical to a data

center. Data on how resources are being utilized provides visibility into how and when resources

are consumed and increases predictability that aid future capacity planning. Transparency in

resource consumption provides data that enable data center management to create and enforce

effective policies that help ensure that resources are available to business units who truly need

them. Knowledge gained from measuring utilization empowers data center management to

create policies that satisfy consumer capacity needs and keep data center capital costs down

while still meeting business requirements.

Resource Distribution

Collecting the three previously identified classes of metrics enables effective distribution of data

center resources. However, collecting these metrics does not guarantee ideal resource

distribution; rather they are a gateway to the critical thinking and analysis required for resource

distribution decisions. To achieve distributed resources, a data center will need to collect

capacity, availability, and utilization metrics to more accurately forecast and plan. Capacity

metrics will provide transparency and measurement as to whether optimal level of capacity has

0% 50% 100%

Project A

Project B

Project C

Project D

Storage Utilization

Storage Utilization


been met to meet business requirements. Availability metrics collected alerts the data center to

the level of risk the capacity is exposed to and how to harden the capacity to ensuring

availability. Deploying resources based on utilization metrics enables the properly provisioned

capacity to meet project needs while still ensuring sufficient elasticity for competing business

demands. Planning based on the three metrics—capacity, availability, and utilization—will help

any data center achieve optimal resource distribution.

Figure 4 illustrates a data center that has collected metrics and used them to perform effective

planning enabling effective resource distribution throughout the data center. The level of

capacity has been scoped to meet business needs, which can be measured and validated

through capacity metrics. This is illustrated by the provisioned level of capacity never exceeding

the maximum capacity. Utilization nearly meets the provisioned resource level while not

exceeding it, indicating properly provisioned resources. Risk planning has also been performed,

which is indicated by capacity always remaining available even during service interruptions.

Figure 4: Resource appropriately Distributed and Available

Collecting resource capacity, availability, and utilization metrics is critical to the operation of an

effective data center. By collecting and analyzing these metrics, data center management can

not only more accurately forecast data center trends, but can also use it to determine success of

new initiatives and to justify additional resources. Without collecting specific metrics, it is

impossible to understand how current data center operations are currently running.

0%10%20%30%40%50%60%70%80%90%

100%

Resource Distribution

Provisioned, Utilized Unavailable due to outage Provisioned, Unutilized Unprovisioned


Conclusion

Metrics provide visibility on the types of resources used, who are using them, how they are

being used, when they are being used, the duration of use, how they are utilized, and whether

there is capacity or availability shortages. All these metrics collected over time create trends,

which can be used to increase accuracy of forecasting and service delivery. These trends can

be used to predict spikes in demand and identify areas of vulnerability. Understanding these

trends contribute to more accurate forecasting and risk reduction. In turn, better forecasting

enables better capacity planning creating a more efficient data center. Analyzing the trends that

are revealed by metric collection will help in building a highly effective data center with robust,

well distributed capacity.

The complexity, scale, and diversity of services provided by data centers heightens the

importance of collecting meaningful metrics so that management can make data driven

decisions. It is critical that the metrics chosen to collect are accurate indicators of data center

performance. Absent these metrics, data center management is left with making best guesses

at best to satisfy their planning requirements. This can be extremely problematic and lead to

negative impact on capital budgets and services provided. Thus, it is imperative to measure and

understand metrics that truly represent the performance your virtual data center.


Bibliography

Bright Hub PM. (2013 йил 7-March). Performance Measurement of Resource Planning and

Utilization - Part VI. Retrieved 2014 йил 18-January from Bright Hub PM:

http://www.brighthubpm.com/monitoring-projects/19276-performance-measurement-of-

resource-planning-and-utilization-part-vi

CA Technologies. (2009 йил 29-October). What is "Self-Service"? From Service Management:

http://blogs.ca.com/itil/2009/10/26/what-is-quot-self-service-quot/

EMC. (2012 йил June). An IT-as-a-Service Handbook. From

http://www.emc.com/collateral/software/white-papers/h10801-stepstoitaas-wp.pdf

IT Process Maps. (n.d.). Availability Management. Retrieved 2014 йил 18-January from

http://wiki.en.it-processmaps.com/index.php/Availability_Management

IT Process Maps. (2013 йил 3-August). ITIL KPIs Service Design. From ITIL Process:

http://wiki.en.it-

processmaps.com/index.php/ITIL_KPIs_Service_Design#ITIL_KPIs_Capacity_Management

ITIL Library. (n.d.). Capacity Management. Retrieved 2014 йил 18-January from Open Guide:

http://www.itlibrary.org/index.php?page=Capacity_Management

Kiemele, M. J., Schmidt, S. R., & Berdine, R. J. (2000). Basic Statistics: Tools for Continuous

Improvement. Colorado Springs, Colorado: Air Academy Press.

KPI Library. (2009). Resource Utilizaiton. Retrieved 2014 йил 18-January from Project Portfolio:

http://kpilibrary.com/kpis/resource-utilization-2

RMS Services. (n.d.). ITIL Availability Management. Retrieved 2014 йил 18-January from ITIL

Defined Services: http://www.rms.co.uk/itil-availablity-management

Sloan, J. D. (2002 йил 5-November). Tutorial: Service Level Agreements. From Wofford

College: http://webs.wofford.edu/sloanjd/netlab/oldweb/mng/slas.htm

TeamQuest. (n.d.). ITIL Availability Management. Retrieved 2014 йил 18-January from ITIL

Information: https://www.teamquest.com/resources/itil/service-delivery/availability-management/


Verge, J. (2013 йил 21-October). Smaller Data Centers and Markets Emerge as M&A Sweet

Spot. Retrieved 2013 йил 23-December from Data Center Knowledge:

http://www.datacenterknowledge.com/archives/2013/10/21/smaller-data-centers-and-markets-

emerge-as-ma-sweet-spot/

VMware. (2009). Key Features and Benefits. From VMware:

http://www.vmware.com/files/pdf/key_features_vsphere.pdf

Wladawsky-Berger, I. (2013 йил 13-September). Data-Driven Decision Making: Promises and

Limits. From Wall Street Journal: http://blogs.wsj.com/cio/2013/09/27/data-driven-decision-

making-promises-and-limits/

Wladawsky-Berger, I. (2013 йил 27-September). Data-Driven Decision Making: Promises and

Limits. Retrieved 2013 йил 20-December from Wall Street Journal:

http://blogs.wsj.com/cio/2013/09/27/data-driven-decision-making-promises-and-limits/

EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION

MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO

THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.

DECIPHERING THE VIRTUAL DATA CENTER - Dell …...customer experience by improving availability, scalability, visibility, maintainability, resiliency, and predictability. The sections

Documents