DECIPHERING THE VIRTUAL DATA CENTER Andrew Bartley Business Operations Analyst EMC Corporation [email protected] Rich Elkins Senior Lab Systems Engineer EMC Corporation [email protected]
DECIPHERING THE VIRTUAL DATA CENTER
Andrew BartleyBusiness Operations AnalystEMC Corporation [email protected]
Rich ElkinsSenior Lab Systems Engineer EMC Corporation [email protected]
2014 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Overview .................................................................................................................................... 3
Introduction ................................................................................................................................ 5
Being Data-Driven ...................................................................................................................... 5
Big Data Enhancing Self-Service ............................................................................................... 6
Classes of Metrics ...................................................................................................................... 7
Resource Capacity ................................................................................................................. 7
Resource Availability .............................................................................................................. 8
Resource Utilization ................................................................................................................ 8
Examples and Benefits of Collecting Metrics .............................................................................. 9
Resource Capacity ................................................................................................................. 9
Resource Availability .............................................................................................................10
Resource Utilization ...............................................................................................................12
Resource Distribution ............................................................................................................13
Conclusion ................................................................................................................................15
Bibliography ..............................................................................................................................16
Disclaimer: The views, processes, or methodologies published in this article are those of the
authors. They do not necessarily reflect EMC Corporation’s views, processes, or
methodologies.
2014 EMC Proven Professional Knowledge Sharing 3
Overview
Data center budgets are shrinking while ever expanding services are expected by end users.
Virtualization and cloud services have already improved data center efficiencies, but business
units still demand more. Meanwhile, any proposals for investments in new services are met with
increasing scrutiny; simply purchasing more resources for your virtualized environment is no
longer an option. Using traditional forecasting and budgeting techniques can make it nearly
impossible to correctly determine where to invest your limited resources.
This Knowledge Sharing article will explain how using predictive analytics can improve the
utilization of your existing data center. It will be particularly useful to any IT professional that has
experienced shrinking budgets and increased demands. The article provides generalized
instructions along with specific examples of how to analyze historical resource demands and
predict future demand. These tools will enable IT professionals to improve utilization of existing
resources.
IT and data center management will need to monitor usage metrics as they transition from
legacy IT models to newer more dynamic models. These new dynamic models must support on-
demand service and automation; on-demand services require that the right resources be
available at the right time for end user consumption. To correctly forecast and predict data
center resource demand, IT professionals must harness the power of Big Data analytics to
intelligently design their services.
New, dynamic IT models support this trend with architectures that support dynamic elastic
environments that maximize efficiency by allowing the same hardware to perform more use
cases. These new innovations are not enough on their own to reduce operating and capital
costs; a higher understanding of the environment is required to properly manage it and maintain
service levels.
Patterns of service consumption can be identified by organizing and adding intelligence to vast
amounts of raw data. This requires two inputs: raw data and long-term business plans. We
already collect many different forms of data in our data centers: service requests, equipment
management data, equipment logs, resource utilization, etc. Analyzing and summarizing this
raw data can show trends in consumption. By mapping these consumption trends to long-term
business plans, we can more accurately predict future data center resource demands based on
similar historical records.
2014 EMC Proven Professional Knowledge Sharing 4
While the specific applications of Big Data analytics will vary drastically based on your specific
business, the general concepts are similar. While a large company developing a new software
product will require certain resources for each of their developers, not every developer will be
working on the new product at the same time. Analyzing historical records can help identify
resource utilization and resource demand magnitude over time; this will help in predicting that
the right resources will be available at the right time.
It is critical to data center management that metrics be measured and analyzed to understand
how resources are consumed, why they are consumed, and when demand will rise and fall. By
understanding how specific resources are being consumed, such as compute, memory, storage,
and network utilization, data center management can provide an improved and more reliable
service. Usage statistics can be used to determine and mitigate spikes in demand, enforce
different levels of QoS, determine what program the resources are being consumed for, and
where to make future resource investments. The collected metrics coupled with the ability to
analyze and plan around them enables elasticity, continuous delivery, and an improved
customer experience.
2014 EMC Proven Professional Knowledge Sharing 5
Introduction
Is your organization data-driven? Many data centers are missing great opportunities for
improvement in the services they offer by not using their own historical and real-time operational
data to manage their environments. To illustrate this point, ask yourself: Does my organization
have a Service Level Agreement (SLA)? If so, how were those service levels determined? Are
my current processes capable of meeting the SLA 99% of the time? What proof do I have that
this is (or is not) the case? The purpose of this article is not to instruct organizations to rewrite
their SLAs; crafting an effective SLA is an extremely complicated process for which many
instructional courses, tutorials, and seminars are better suited (Sloan, 2002). The purpose of
this article is to encourage challenging the status quo of your organization to use more data in
your decision making.
Written with practicality in mind, this article uses several examples to reinforce the concepts with
real-world context. This is by no means a complete list of applications for data analysis, but
rather are common use cases that many data centers will likely find useful. With each example,
we describe when the examples would be appropriately used, what data to collect, how to
analyze that data, and where the newly obtained information can be used to drive decisions.
Each example set will have different levels of complexity that can be used depending on your
organization’s business needs. These examples are meant to be used as learning aids; they are
not meant to be used as templates for running your organization. All organizations must choose
their own metrics to monitor based on their strategic goals and business conditions.
Being Data-Driven
You have probably felt the effects of not having worked in a data-driven organization. Without
the data necessary to make informed decisions, SLAs will be missed more frequently, projects
will more often be late, and stress will be higher than if your decisions were based on
emotionless data. “Far too often the meetings and conversations of today tend to focus on
opinions, intuition, emotions, and other non-numeric or subjective sources of information”
(Kiemele, Schmidt, & Berdine, 2000). Although non-numeric information can be useful and at
times is the only information available, understanding your business quantitatively is critical to
properly managing it effectively (Kiemele, Schmidt, & Berdine, 2000).
Has your data center service organization ever discussed offering a new product or service?
Without data to represent what your customers actually want, the conversation quickly turns into
an emotion-based event rather than a business-based event. In an age of ever-shrinking
2014 EMC Proven Professional Knowledge Sharing 6
budgets and constantly increasing expectations, these emotional decisions are no longer
sufficient. Data centers can no longer take the risk “if we build it, they will come”. This
investment model is fading as time goes on in favor of a “just-in-time” IT investment strategy
(Verge, 2013). Data center investments need to be based on current business demands while
still being flexible enough to recognize and react to changes rapidly. Discussions about new
services without relevant data take too long and do not guarantee that a desired service will be
produced.
Luckily, collecting quantitative data is easier now than it has ever been thanks to an increasingly
digital world. In the past, quantitatively measuring a process was a time-intensive process.
Tracking throughput of a given process often involved a dedicated person to manually observe
and record a process. Once enough data had been collected, one or several analysts would
then need to manually calculate your desired data. Modern data center software and platforms
make data collection look effortless when compared to older data collection methods.
Big Data Enhancing Self-Service
The rise of cloud computing and increasingly dynamic technologies further removes data center
management and operational teams from the consumers of data center services and resources.
One of the technologies enabling the separation is Self-Service. CA Technologies defines Self-
Service as empowering end users to gain access to resources and services they want without
IT or data center staff intervention (CA Technologies, 2009). This enables resource consumers
to gain access to resources while reducing the visibility of how and why the resources are being
used. While the industry has widely accepted the benefits of virtualization and Self-Service
(EMC, 2012) (VMware, 2009), the abstraction of resources requires a more sophisticated set of
record keeping to ensure that resources are appropriately allocated. The only way to mitigate
this is to collect metrics that accurately reflect your Self-Service process (Wladawsky-Berger,
2013).
Self-service metrics are diverse and easy to capture. Data center management software can
determine which organizations are using data center resources, the amount of resources they
are consuming, and how long they are using the resources. Linking user and organizational
information to resource being used, duration of use, and utilization of resources during the
duration of use can greatly aid in enforcing data center policies. For example, if an individual or
an organization is consuming a greater percentage of resources than the business deems
necessary, the data center could throttle usage or apply a financial fee.
2014 EMC Proven Professional Knowledge Sharing 7
Classes of Metrics
The benefit of collecting and analyzing data center metrics is in understanding the overall
operation of the data center. Without collecting metrics data center management will not be able
to make informed business decisions, which can reduce performance and drive up costs.
Collecting specific metrics provide transparency to daily data center operations, which lead to
data-driven decisions improving the overall operation of the data center. Capturing and
analyzing three metric categories is critical to maintaining a high level of service within the data
center. The three categories of metrics are:
1. Resource Capacity
2. Resource Utilization
3. Resource Availability
Collecting and understanding metrics related to these three categories will significantly enhance
customer experience by improving availability, scalability, visibility, maintainability, resiliency,
and predictability. The sections below define the three classes of metrics to collect and how they
contribute to deciphering the data center.
Resource Capacity
Resource Capacity is the most basic set of metrics to collect and analyze for any data center
manager. According to the Information Technology Infrastructure Library (ITIL), measuring
capacity supports the optimal provisioning of IT- and data center-related services in order to
meet business demands (ITIL Library). Monitoring basic data center resources such as storage,
compute, memory, network, and personnel is critical to fully understand the state of your data
center and the quality of the services you provide. Specific Resource Capability metrics to
collect include:
Total CPU resources
CPU resources consumed
Total network bandwidth
Network bandwidth consumed
Total storage resources
Storage resources consumed
Total memory resources
Memory resources consumed
# of incidents due to capacity
shortages
Duration of capacity shortages
Accuracy of previous capacity
forecast
Capacity reserves during times of
increased and decreased demands
Unplanned capacity adjustments
2014 EMC Proven Professional Knowledge Sharing 8
Analyzing these metrics provides a manager with visibility as to when they need to procure
additional resources. Determining maximum capacity in relation to the level of currently
consumed capacity is critical to maintaining data center services. The visibility provided by
collecting these resources makes justifying additional resources a much simpler process. By
understanding resource capacity within the data center, a data center manager will always
ensure the appropriate capacity of resources to meet business requirements.
Resource Availability
Resource Availability is the method of ensuring the proper resources are available to meet
elastic customer and business demand. Resource availability ensures that a data center can
sustain services to support business needs while keeping costs economical (RMS Services).
This differs from resource capacity since resource capacity only measures that the resources
exist, not if the resources are available for use. Regularly scheduled maintenance downtime and
unscheduled incident downtime both can reduce the availability of resources in a data center.
The availability of resources is typically tied to a SLA or an agreed upon level for a specific
duration (TeamQuest). The benefit of collecting and analyzing these resources is in
understanding data center resource resilience. There are always unplanned events that can
consume and make resources unavailable. It is critical to collect downtime metrics so that
management can plan for and mitigate such events, reducing risk to the data center.
Resource Utilization
The KPI Library defines Resource Utilization as the number of hours of work assigned to a
resource or group of resources as a percentage of their availability for a given time period (KPI
Library, 2009). This means that the level of resource consumption can be measured for the
duration of time the resources are assigned to a project or workload. Measuring utilization
provides transparency to how resources are utilized and reveal trends that aid in forecasting.
This is important because utilization metrics will enable data center management to capacity
plan so that there are resources available for demand spikes, while not drastically
overprovisioning environments. Utilization metrics determine key factors such as whether data
center resources are properly allocated for specific workloads or projects, enabling
management to make resource adjustments so that data center resources are properly
allocated and distributed. This can significantly save capital expenses and increase elasticity by
providing the correctly provisioned environments to meet resource demands at different times.
2014 EMC Proven Professional Knowledge Sharing 9
Examples and Benefits of Collecting Metrics
“Over time, Big Data and advanced data science applications will enable us to take operational
decision making to a whole new level in a wide variety of disciplines.” (Wladawsky-Berger,
2013)
There are many benefits for data center management to collect and aggregate the four metric
categories, capacity, availability, utilization, and distribution. The following section explains high
level metrics to collect, what can be gained, and examples where applicable. It is important to
remember that the use case examples and benefits listed in this article are not exhaustive.
Resource Capacity
To effectively measure resource capacity, data center management will need to monitor
performance, consumption, and throughput on resources such as CPU, network, and storage. It
is important that appropriate resources are monitored and collected over time so that trends can
be identified and analyzed; this trend analysis will enable better capacity planning and
forecasting in the future. While the services provided by a data center will dictate which metrics
are worth collecting, aggregating, and analyzing, most data centers will want to measure several
different network, server, and virtualization metrics over time. Below is a list of common data
center metrics that might be collected (IT Process Maps, 2013):
# of Incidents due to Capacity Shortages
Resolution Time of Incidents due to Capacity Shortages
# of Capacity Adjustments
# of Unplanned Capacity Adjustments
Capacity Reserves (unit of measurements vary based on resource)
% of services and infrastructure components with capacity monitoring
For example, the business could acquire a larger user community and need to rapidly expand
their data center. This could require additional compute, memory, storage, SAN, and network
capacity that would need to be scoped to meet new business requirements. Metrics such as the
number and duration of capacity shortages can be collected and analyzed so that management
can measure whether the data center successfully increased capacity to meet business needs.
Understanding capacity metrics provides many benefits to the data center. Capacity metrics
enable data center management to measure whether they are meeting business requirements
2014 EMC Proven Professional Knowledge Sharing 10
at the most basic level. Metrics such as compute, storage, memory, and network consumption
provide increased visibility into what resources have been procured and deployed. This type of
transparency over time allows for easier justification of new resources, greater predictability,
and better capacity planning, which enable scalability to meet business demands.
Resource Availability
Measuring availability is critical to maintaining data center services. To understand whether or
not achieved availability meets the agreed upon availability via a SLA, it is critical to monitor and
collect metrics (IT Process Maps). Data center management must test the reliability and
resiliency of their resource capacity to understand availability. The metrics that should be
measured when testing availability are the accessibility of agreed upon data center services in
the SLA which include (IT Process Maps, 2013):
Number of service interruptions
Duration of service interruptions
Percentage of services and infrastructure being actively monitored
It is important to perform tests simulating outages so that metrics can be collected on recovery
of services. Understanding how resources such as CPU, storage, memory, and network react to
simulated outages and disruptions can provide valuable insight toward preventing downtime.
These metrics will provide transparency on the resiliency of data center resources and will help
with planning future hardening of resources to ensure their availability.
2014 EMC Proven Professional Knowledge Sharing 11
Figure 1: Service Running Normally
For example, a data center may have a virtualization environment with hundreds of servers
clustered together as hypervisors. The hypervisors have datastores on an array connected via a
SAN (Figure 1). The hypervisors are hosting thousands of VMs serving a wide range of
business needs. The resources for the VM requirements have been scoped to meet business
requirements and are included in the data center’s available capacity. A failure with the array
hosting the datastore or the SAN connection to the hypervisors could have a devastating effect
on the availability of virtualized resources (Figure 2). Even though the environment still has the
resource capacity to support the VMs, the resources will not be available for use due to a failure
in the datastore. Using availability metrics gained through availability testing, a data center can
determine which areas of recovery were the most challenging and plan for availability
improvements so that availability outages are shorter or avoided altogether, reducing data
center risk.
2014 EMC Proven Professional Knowledge Sharing 12
Figure 2: Service Disruption
Understanding resource availability is critical to the success and health of the data center.
Collecting availability metrics provides transparency into how resilient resources are and what
the potential impact might be if there is an unplanned event. The visibility provided by availability
metrics help planning for maintenance cycles in order to improve service and harden resources.
Without understanding and improving resource availability, all investments into additional
capacity would be exposed to unacceptable levels of risk.
Resource Utilization
Utilization can be measured in many different ways with many different tools and methods. The
simplest way to define utilization is in the formula below (Bright Hub PM, 2013):
Measuring utilization provides overall data center capacity utilization and how resources are
utilized on a project-by-project basis. This information enables better allocation of current
capacity to ensure elasticity and availability of capacity throughout the data center. Figure
2014 EMC Proven Professional Knowledge Sharing 13
3 shows the percentage of storage space utilized for each of three projects. Without any context,
these data may not have much value. If one of the project managers requested additional
storage space for their team, this graph helps make a decision. Imagine that manager of Project
C is requesting additional storage. She brings up that members of her team are not getting
enough space to complete their nightly builds. Relative to other projects, it is clear Project C has
enough overall capacity allocated. Without this graph, the request for additional storage might
have been approved and caused the company to spend more money on storage than is actually
required. While it is not clear what is causing the storage problems from this data alone, it can
already be ruled out that Resource Capacity is not the root cause.
Figure 3: Storage Utilization Example
Understanding how different projects, groups, and workloads utilize capacity is critical to a data
center. Data on how resources are being utilized provides visibility into how and when resources
are consumed and increases predictability that aid future capacity planning. Transparency in
resource consumption provides data that enable data center management to create and enforce
effective policies that help ensure that resources are available to business units who truly need
them. Knowledge gained from measuring utilization empowers data center management to
create policies that satisfy consumer capacity needs and keep data center capital costs down
while still meeting business requirements.
Resource Distribution
Collecting the three previously identified classes of metrics enables effective distribution of data
center resources. However, collecting these metrics does not guarantee ideal resource
distribution; rather they are a gateway to the critical thinking and analysis required for resource
distribution decisions. To achieve distributed resources, a data center will need to collect
capacity, availability, and utilization metrics to more accurately forecast and plan. Capacity
metrics will provide transparency and measurement as to whether optimal level of capacity has
0% 50% 100%
Project A
Project B
Project C
Project D
Storage Utilization
Storage Utilization
2014 EMC Proven Professional Knowledge Sharing 14
been met to meet business requirements. Availability metrics collected alerts the data center to
the level of risk the capacity is exposed to and how to harden the capacity to ensuring
availability. Deploying resources based on utilization metrics enables the properly provisioned
capacity to meet project needs while still ensuring sufficient elasticity for competing business
demands. Planning based on the three metrics—capacity, availability, and utilization—will help
any data center achieve optimal resource distribution.
Figure 4 illustrates a data center that has collected metrics and used them to perform effective
planning enabling effective resource distribution throughout the data center. The level of
capacity has been scoped to meet business needs, which can be measured and validated
through capacity metrics. This is illustrated by the provisioned level of capacity never exceeding
the maximum capacity. Utilization nearly meets the provisioned resource level while not
exceeding it, indicating properly provisioned resources. Risk planning has also been performed,
which is indicated by capacity always remaining available even during service interruptions.
Figure 4: Resource appropriately Distributed and Available
Collecting resource capacity, availability, and utilization metrics is critical to the operation of an
effective data center. By collecting and analyzing these metrics, data center management can
not only more accurately forecast data center trends, but can also use it to determine success of
new initiatives and to justify additional resources. Without collecting specific metrics, it is
impossible to understand how current data center operations are currently running.
0%10%20%30%40%50%60%70%80%90%
100%
Resource Distribution
Provisioned, Utilized Unavailable due to outage Provisioned, Unutilized Unprovisioned
2014 EMC Proven Professional Knowledge Sharing 15
Conclusion
Metrics provide visibility on the types of resources used, who are using them, how they are
being used, when they are being used, the duration of use, how they are utilized, and whether
there is capacity or availability shortages. All these metrics collected over time create trends,
which can be used to increase accuracy of forecasting and service delivery. These trends can
be used to predict spikes in demand and identify areas of vulnerability. Understanding these
trends contribute to more accurate forecasting and risk reduction. In turn, better forecasting
enables better capacity planning creating a more efficient data center. Analyzing the trends that
are revealed by metric collection will help in building a highly effective data center with robust,
well distributed capacity.
The complexity, scale, and diversity of services provided by data centers heightens the
importance of collecting meaningful metrics so that management can make data driven
decisions. It is critical that the metrics chosen to collect are accurate indicators of data center
performance. Absent these metrics, data center management is left with making best guesses
at best to satisfy their planning requirements. This can be extremely problematic and lead to
negative impact on capital budgets and services provided. Thus, it is imperative to measure and
understand metrics that truly represent the performance your virtual data center.
2014 EMC Proven Professional Knowledge Sharing 16
Bibliography
Bright Hub PM. (2013 йил 7-March). Performance Measurement of Resource Planning and
Utilization - Part VI. Retrieved 2014 йил 18-January from Bright Hub PM:
http://www.brighthubpm.com/monitoring-projects/19276-performance-measurement-of-
resource-planning-and-utilization-part-vi
CA Technologies. (2009 йил 29-October). What is "Self-Service"? From Service Management:
http://blogs.ca.com/itil/2009/10/26/what-is-quot-self-service-quot/
EMC. (2012 йил June). An IT-as-a-Service Handbook. From
http://www.emc.com/collateral/software/white-papers/h10801-stepstoitaas-wp.pdf
IT Process Maps. (n.d.). Availability Management. Retrieved 2014 йил 18-January from
http://wiki.en.it-processmaps.com/index.php/Availability_Management
IT Process Maps. (2013 йил 3-August). ITIL KPIs Service Design. From ITIL Process:
http://wiki.en.it-
processmaps.com/index.php/ITIL_KPIs_Service_Design#ITIL_KPIs_Capacity_Management
ITIL Library. (n.d.). Capacity Management. Retrieved 2014 йил 18-January from Open Guide:
http://www.itlibrary.org/index.php?page=Capacity_Management
Kiemele, M. J., Schmidt, S. R., & Berdine, R. J. (2000). Basic Statistics: Tools for Continuous
Improvement. Colorado Springs, Colorado: Air Academy Press.
KPI Library. (2009). Resource Utilizaiton. Retrieved 2014 йил 18-January from Project Portfolio:
http://kpilibrary.com/kpis/resource-utilization-2
RMS Services. (n.d.). ITIL Availability Management. Retrieved 2014 йил 18-January from ITIL
Defined Services: http://www.rms.co.uk/itil-availablity-management
Sloan, J. D. (2002 йил 5-November). Tutorial: Service Level Agreements. From Wofford
College: http://webs.wofford.edu/sloanjd/netlab/oldweb/mng/slas.htm
TeamQuest. (n.d.). ITIL Availability Management. Retrieved 2014 йил 18-January from ITIL
Information: https://www.teamquest.com/resources/itil/service-delivery/availability-management/
2014 EMC Proven Professional Knowledge Sharing 17
Verge, J. (2013 йил 21-October). Smaller Data Centers and Markets Emerge as M&A Sweet
Spot. Retrieved 2013 йил 23-December from Data Center Knowledge:
http://www.datacenterknowledge.com/archives/2013/10/21/smaller-data-centers-and-markets-
emerge-as-ma-sweet-spot/
VMware. (2009). Key Features and Benefits. From VMware:
http://www.vmware.com/files/pdf/key_features_vsphere.pdf
Wladawsky-Berger, I. (2013 йил 13-September). Data-Driven Decision Making: Promises and
Limits. From Wall Street Journal: http://blogs.wsj.com/cio/2013/09/27/data-driven-decision-
making-promises-and-limits/
Wladawsky-Berger, I. (2013 йил 27-September). Data-Driven Decision Making: Promises and
Limits. Retrieved 2013 йил 20-December from Wall Street Journal:
http://blogs.wsj.com/cio/2013/09/27/data-driven-decision-making-promises-and-limits/
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION
MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO
THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.