RAMPS © Reliability, Availability, Maintainability, Predictability, Scalability Presented by Joe Soroka For additional information visit www.totalsitesolutions.com
Mar 29, 2015
RAMPS©RAMPS©
Reliability, Availability, Maintainability, Predictability,
Scalability
Reliability, Availability, Maintainability, Predictability,
Scalability
Presented by Joe Soroka
Presented by Joe Soroka
For additional information visit
www.totalsitesolutions.com
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
While budgets may be tighter the requirement for maximum uptime has not gone away
The design of your facility is only one piece of the pie that will effect your site’s uptime
It is important that we are aware of how Reliability, Availability, Maintainability, Predictability and Scalability all affect your site’s uptime
Reliability is the ability of a system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances
RELIABILITYRELIABILITY
ReliabilityReliability
What is reliability?• Weibull• Markov Reward modeling
Modeling• IEEE Gold Book• Procedures: accurate, confirmed/tested
Equipment selection• Generator• UPS Systems• EPO Systems• Switchgear• Monitoring systems
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Reliability• Reliability modeling • Equipment • Commissioning• Operations & maintenance
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Bathtub curve of reliability– Infant mortality
• Burn in/load testing• Commissioning
– Useful life• Proper maintenance
– End of life• Identify and replace prior to entering this period
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• The reliability of a system is no greater than the weakest component in a system series
• In a complex system you need to identify and quantify the importance of each component in the system
• A reliability block diagram is a graphical representation of the components of the system and how they are related to reliability
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Many of the reliability design ideas share a common philosophy with those recommended for availability
• This is because there is a very close relationship between reliability and availability
• While reliability is about how long an application runs between failures, availability is the ability of a system to tolerate failures and how long it is accessible to the users
• Obviously, when a system's components and services are highly reliable, they cause fewer failures from which to recover and thereby help increase availability
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Major manufacturers– Past experiences– Local maintenance support– Parts distribution centers
• Fine line between leading edge and bleeding edge• Formal submittal review meetings
Equipment
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Generator’s isolation valves• ATS bypass• TVSS indicators and alarms• Lightening protection• EPO systems
– Wiring– Control relays– Covers– Diagrams– Testing– Day 2 changes
Equipment
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Generators– Redundant batteries– Battery monitoring– Fuel level monitoring– Water heater jacket isolation valves– Silicon heater hoses– Coolant level pre-alarms, both cores– Water separators (Racor Filters) with alarms– Engine diagnostic link
Equipment
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• UPS systems– Dual input– Maintenance bypass cabinet– Advanced monitoring– Battery monitoring – Redundant battery strings for VRLAs– Site specific procedures
Equipment
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Automatic Transfer Switches (ATS) – Maintenance bypass or wrap around breakers– Phase sync monitoring– Pause Neutral/dual solenoids– Monitoring
• Transient Voltage Surge Suppression (TVSS) – Monitoring– Indication of operation– Surge counter
Equipment
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• EPO systems– Wiring in conduit and not open plenum– Control relay coils should not be energized until
activation– Secondary covers installed over the EPO buttons– Detailed and accurate schematics diagrams– System should be designed so it can be tested– System should be capable of making day 2
changes without risk– Part of an engineered drawing and not a cloud
saying “by others”
Equipment
For additional information visit www.totalsitesolutions.com
• Thermal runway– Increase heat density
• Reduce time to thermal runway• Increase the need for a reliable HVAC system• Specialized HVAC systems • Possibly switching from emergency to UPS
power• Long UPS battery runtimes may be unclear
ReliabilityReliability
• Rack layout, equipment airflow direction– Cold/hot aisle– Enclosed hot aisles
• Type rack– Doors– Vents– Fans
Equipment
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Water storage– Chilled water
• In the event of power outage or temporary chiller failure, do you have the capability to ride through
– Makeup water• How reliable is the city water supply• Do you have diverse sources• Water storage tanks• Well• Other water sources
Equipment
For additional information visit www.totalsitesolutions.com
Commissioning – With each project being unique, there is a need to determine how much commissioning is appropriate for the project. Factors that influence this decision include:
ReliabilityReliability
• Building’s mission-criticality • Facility’s use or purpose• Complexity of the building’s systems• Building type and size• Project type, whether existing building
system or retrofit, or both• Building tenant or occupant
demographics• System reliability requirements• Owner’s objective in commissioning
the building; IAQ, system reliability and/or
energy efficiency• Project budget
Commissioning
For additional information visit www.totalsitesolutions.com
ReliabilityReliability
• Use a pilot/copilot approach Commercial airplanes do not fly with just one pilot - why would you
• Standardize as much as possible– Standard procedures– Standard process
• Use a Computer Maintenance Management System (CMMS)– Timely reports and schedules– Accurate information– Archive past performance– Instant access to information
Operation and Maintenance
For additional information visit www.totalsitesolutions.com
Availability is the ability of a system to tolerate failures
Refers to the time that a system is available to its users
This means the process continues to be served through the failure and that, ideally, the failure is transparent to the user
AVAILABILITYAVAILABILITY
For additional information visit www.totalsitesolutions.com
Availability Availability
• Availability• Design• Resources• Procedures
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Availability is typically expressed by the number of nines
• Downtime per year
Availability # of nines Downtime90% 1-nine 36.5 days/year99% 2 nines 3.65 days/year99.9% 3 nines 8.76 hours/year99.99% 4 nines 52 minutes/year99.999% 5 nines 5 minutes/year99.9999% 6 Nines 31 seconds/year
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Failures can be attributed to the following causes:
• Design failures– This class of failures takes place due to inherent
design flaws in the system. In a well designed system, this class of failures should make a very small contribution to the total number of failures
• Infant mortality – This class of failures cause newly manufactured
hardware to fail. This type of failure can be attributed to manufacturing problems like poor soldering, leaking capacitor etc.
– These failures should not be present in systems leaving the factory as these faults will show up in proper factory system burn-in tests
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Random failures– Random failures can occur during the entire life-
cycle of a system. These failures can lead to system failures. Redundancy is provided to recover from this class of failure
– • Wear out
– Once a hardware module has reached the end of its useful life, degradation of component characteristics will cause hardware modules to fail. These types of faults can be weeded-out by preventive maintenance and routing of hardware
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Designing systems with sufficient levels of redundancy
• Eliminating single points of failure• Availability design guidelines
– Consult your engineer– TIA Standard - TIA 942 – Uptime Institute – Tier Definition
Design
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
Design
• System design should have multiple paths– Active or passive, depending upon the site
reliability requirements– If redundant paths need to be VE? out to meet
the project budget, consider adding the breaker or valve now or later; when budget allows add the actual feed
– By adding the breaker or valve up front you will be able to install temporary cable or piping when an emergency arises
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
Design
• When performing maintenance, and decreasing the availability of system redundancy, move the reduction of availability away from the critical load and toward the utility as much as possible– i.e. If you had a system plus system design and you
are going to take the UPS out of service for maintenance, do not just open the UPS system and allow downstream dual cord devices and static transfer switch handle the loss of redundancy (?)
– Place the UPS in maintenance bypass to continually feed the second source with stable power
– Better yet, place the UPS on generators or alternate UPS supply to avoid sending unprotected utility power to the critical load
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
Resources
• Technical resources– Operation staff– Response staff– Maintenance & repair staff
• Parts– Onsite spares– Manufacturer spares– Vendor spares– Supply houses
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Operation staff– Whether you are using in-house or
contracted staff, it is important to ensure they have the proper resources
• Proper access to the facility• If using key card system what happens when
the card readers lose power? Who has the keys?
• Do you have all of your operation staff’s phone numbers
– Cell numbers and home numbers – Company and personal emails
Resources
Operation StaffOperation Staff
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Emergency response– Types of emergency responses
• Additional operation staff• Electrical, mechanical & plumbing contractors• General construction• Testing and repair firms• Fire and security• Hazardous material spill
– List of suppliers and vendors• Emergency contact information • Alternate contact information
– Contracts in place to execute after hours support
– Meet them before an emergency arises, have them at the site for lunch
Resources
Response StaffResponse Staff
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Do you have the necessary contracts in place?
• Is there maintenance your operation staff can perform in house?
• Do you have alternate contact numbers for your maintenance providers?
• Do they have proper access to the facility?• Do you have a second string waiting on the
sidelines in case of an emergency?
Resources
Maintenance & Repair StaffMaintenance & Repair Staff
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Parts and supplies– Define and assess critical parts– Stock critical parts onsite
• Have an annual budget for spare parts that increases a little each year
– Verify that your vendors and contractors have spare parts handy
– Identify supply houses and suppliers that have parts you need
– Have after hours phone number(s) to get parts from supply houses
– Have contracts in place and make sure they are active
Resources
PartsParts
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
Procedures
• Operation• Maintenance • Emergency• Troubleshooting
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Operation procedures– Have detailed procedures that are specific
to your developed site– Procedures should be tested and verified– Procedures should be inventoried and
updated regularly– Operating procedures should be placed at
the point of use and not locked-up in the building manger’s office
Procedures
OperationOperation
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Maintenance procedures– Have detailed procedures for maintenance– Ask your maintenance provider to furnish all of
the required maintenance procedures prior to performing maintenance, so you can review and comment on them
– Use detailed procedures during your maintenance activities
– Review procedures after the maintenance has been completed
Procedures
MaintenanceMaintenance
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Emergency procedures– In case of an emergency, where are your
procedures – Can you access them– Are they at multiple locations– During an emergency is not the time to try
to figure out how to restore a system– Perform dry runs on the procedures at least
once a year– Update and change, as required
Procedures
EmergencyEmergency
For additional information visit www.totalsitesolutions.com
AvailabilityAvailability
• Manuals – Available– Correct
• Drawings– Available and complete– As-builts
• Develop troubleshooting flow diagrams
Procedures
TroubleshootingTroubleshooting
For additional information visit www.totalsitesolutions.com
Maintainability is defined as the probability of performing a successful repair action or preventative maintenance within a given time
In other words, maintainability measures the ease and speed with which a system can be restored to operational status
MAINTAINABILITYMAINTAINABILITY
MaintainabilityMaintainability
• Design
• Equipment
• Staff
• Location
• Maintenance program
• Training
• Coordination
• Maintenance windows
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Goals of Maintainability– Maximize efficiency and accuracy of on-line
replacement of system components
– Facilitate and minimize troubleshooting time at each level of maintenance activity
– Allow test, checkout, troubleshooting and repair procedures to be unit-specific and structured to aid in identification of faulty units, then sub units
– Reduce downtime
– Provide easy access to malfunctioning components
– Allow for high degree of standardization
– Minimize time and cost of maintenance training
– Simplify new equipment design and shorten design time by using previously developed, standard building blocks
Design
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Equipment Access• Labeling• Minimize troubleshooting
time– Monitoring– Procedures– Standardization – Test and service points
Design
For additional information visit www.totalsitesolutions.com
• Accessibility refers to the relative ease with which a system can be accessed
– Sufficient clearance to use the tools needed to complete the tasks
– Adequate space to permit convenient removal and replacement of components
– Adequate visual exposure to the task area
– Adequate safety and working clearances– Adequate space for required rigging
equipment – Adequate hallway, corner and door
clearances back to loading dock
Design
Equipment AccessibilityEquipment Accessibility
MaintainabilityMaintainability
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Equipment rooms should be designed so that rapid, safe and easy removal and replacement of malfunctioning components can be accomplished by one technician, when possible
Design
Ease Removal and ReplacementEase Removal and Replacement
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Labeling should:– Identify a specific device– Identify the purpose or function
of a specific device– Present critical information – Present safety Information– Should be legible– Should use contrasting colors
• Ensure that your labeling is controlled to ensure its accuracy and standardization
• Periodic inspections and examinations
Design
LabelingLabeling
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Comprehensive monitoring• Procedures• Standardization• Test and service points
Design
Minimize Troubleshooting TimeMinimize Troubleshooting Time
For additional information visit www.totalsitesolutions.com
• Monitoring capabilities
– Event notification
– Event reconstruction
– Event mitigation
– Determine maintenance frequencies
– Allow for accurate and efficient
communication of events
Design
MonitoringMonitoringMaintainabilityMaintainability
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• What type of monitoring system do I need?– No monitoring
• Not recommended for any mission critical facility
– Remote Alarm Status Panel (RASP)• No trending or time stamping• Gives visual and auditable notification• Usually for one device or system
– Monitoring with dry contacts • Limited number of points• Limited time stamping • Status is either on or off
– Serial interfaces• Comprehensive data • Data points with values rather than on/off• Flexible and expandable
Design
MonitoringMonitoring
For additional information visit www.totalsitesolutions.com
• Emergency Operating Procedures (EOP)– Developed for failure modes– Readily available for use – locate at point-of-
service– Should be developed and tested during the
commissioning phase– Detailed – switch level– Update any changes discovered
• Method Operating Procedure (MOP)– Developed for all operations– Detailed – switch level– Have back-out procedures included– Use with pilot/copilot approach– Update any changes discovered– Should be developed and tested during the
commissioning phase
MaintainabilityMaintainability
Design
ProceduresProcedures
For additional information visit www.totalsitesolutions.com
• Trouble-shooting procedures– Trouble-shooting flow charts– Restoration procedures
• Maintenance procedures – Detailed procedures – Include measure points for future
trending– Used and completed during maintenance
MaintainabilityMaintainability
Design
ProceduresProcedures
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Common procedures error traps– In-field decisions– Vague instructions– Undefined or uncommon terms– Burdensome or complex
instruction– Multiple actions – Inconsistent statements or
actions– Misleading or missing critical
information– Interfacing with external
procedures– Lack of ownership– Lack of quality assurance review
Design
ProceduresProcedures
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Standardization ensures consistency and comparability of knowledge and parts– Acronyms
• Reduce confusion
– Manufacturers• Reduced spare part counts• Familiarization with operations and
maintenance
– Layouts• Reduce confusion• Increase ease-of-use
– Labeling • Reduce confusion
StandardizationStandardization
Design
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Test points provide a means for conveniently and safely determining the operational status of equipment and isolating malfunctions
• Test points, strategically placed, make signals available to the technician for checking, adjusting or troubleshooting
• Service points provide means for lubricating, filling, draining, charging and similar functions
Test and Service PointsTest and Service Points
Design
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• General principles for test and service points– Avoiding need for frequent
testing and service– Standardization– Test and service point
compatibility– Labeling dangerous test and
service compatibility– Distinctively different
connectors and fittings– Location of test, service and
adjustment points
Test and Service PointsTest and Service Points
Design
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Ordering the right accessories with your equipment can make a big difference when it comes to the maintainability of your equipment
• When ordering equipment or reviewing design documents, solicit input from your operations and maintenance staff involved
• It’s much cheaper to order it right the first time, than to upgrade it later in the field
EquipmentEquipment
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Water separators for fuel
• Radiator water level
• Isolation valves on water jacket heaters
• Generator-mounted circuit breakers
• Battery cables
• Battery monitoring
• Fuel-level monitor
Equipment
GeneratorsGenerators
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Annual infrared thermal scanning
• Protective relays
• Breaker testing
• PLC Code
– Hard copy
– Up-loadable copy
• Beware of small UPS systems
• Station batteries
• Internal cleaning
• Mimic bus
Equipment
SwitchgearSwitchgear
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Maintenance bypass– Order it with a maintenance bypass or
design the system to have a manually operated breaker bypass to wrap around the ATS to both sources
Equipment
Automatic Transfer SwitchesAutomatic Transfer Switches
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• AC filter capacitors– 3-5 years
• DC filter capacitors– 3-5 years
• Transfer circuits– Capture the transfer between UPS and
bypass
• Procedures– Detail PM procedures– Capture before and after readings
• Calibration/maintenance– Capture details– Don’t just do a “dust and clean” PM
Equipment
UPS SystemsUPS Systems
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• VLA (flooded)– Vented lead acid– Quarterly maintenance
• VRLA (sealed)– Valve-regulated lead acid– Semi-annual maintenance
• Float voltage• Room temperature• Proper maintenance• Water as required• Battery monitoring • Batteries found
– UPS systems– Generators– Switchgear– PLCs and breakers– Telecom equipment
Equipment
BatteriesBatteries
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Shutdown alarms– Identify and understand them
• EPO circuits– If used, is it maintainable?
• Monitoring– Main– Sub-panels– Branch circuit breakers
• Snap-in vs. bolt-in breakers– Use bolt-in breakers only
• Transformers– K-rated
Equipment
PDU’sPDU’s
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Permanently installed load banks
• Generator testing– Annual load test
– Troubleshooting
• UPS system testing– Annual load test
– Troubleshooting
• Paralleling gear– Set-up and calibration
– Troubleshooting
Equipment
Load BanksLoad Banks
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Alternate water source needs to be capable of supplying water, so that the primary water source can be removed for maintenance
• Usage metering should be on each water source
• Types of alternate water source– City water– Wells– Storage tanks
Equipment
Water SourceWater Source
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Alignment
─ Will reduce wear and tear on shafts, bearings and seals
─ Reduce vibration
─ Decrease current draw
• Bearings
─ Accessible grease fittings
─ Grease as required
• Infrared thermal scanning
─ Motor problems
─ Alignment issues
Equipment
PumpsPumps
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Temperature and humidity set points – Should be set the same
• Humidifiers – Have replacements for bulbs and
canisters• Filters
– Use a pre-filter in dirty locations– Make sure your dirty filter Differential
Pressure (DP) switch is set correctly• Alignment
– Proper alignment will reduce wear on the shaft and bearings
• Bearings– Grease when required– Infrared thermal heat scan
• Refrigerant leaks can activate fire alarms
Equipment
CRAH/CRACCRAH/CRAC
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Dispatched service– Verify your vendors qualifications as a
company– Request resumes of the people performing
work at your site– Review their technical aptitude – Verify your vendors training programs
• Onsite operation and maintenance staff– Verify that they are managed correctly (in-
house or contracted)– Verify your staff’s resumes and qualifications – Review their technical aptitude– Verify training programs
StaffStaff
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Location and access of valuable resources is important when situations arise– 3:00 am Sunday morning is not the time to
try to locate fuses required to get your site up and running
• There are various resources you should consider before the need arises;– Equipment– Technicians – Parts– Procedures– Manuals– Drawings
LocationLocation
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• It is important that your operation and maintenance staff is adequate and regularly trained
• When an emergency occurs they should have the confidence and experience to complete the task at hand– Available training methods;
• Self paced• Classroom• Web based• Manufacturer’s training• On-the-job training• Procedure development • Training module development• Test beds• Simulators
TrainingTraining
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Work activities – it is important to closely coordinate maintenance activities, to maintain a reliable, efficient and safe working environment
• During outage windows we have the tendency to plan too many activities at once. Make sure you don’t have too many people working in the same space at once
CoordinationCoordination
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Pay particular attention to planning of your maintenance activities – CRAC units – refrigerant leaks will activate the
fire systems; make sure you disable the fire system* prior to charging a system
– Under floor cleaning – can activate the fire alarm system; make sure you deactivate the fire alarm system* before you start to clean under the floor
– There are other maintenance activities and tests that could mistakenly set-off the fire alarm system
*When you disable a fire alarm system, make sure you follow the required procedures by OSHS, NFPA, local authorities, your company and your insurance underwriter. This could include, but is not limited to; additional fire extinguishers, posting fire watch, notification, special procedures, and tagging
CoordinationCoordination
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• Maintenance activities – If you are planning to transfer your UPS to a
generator maintenance bypass to perform maintenance on the UPS, PM the generator first
– If you are planning to perform an open transfer to the building electrical system, inspect your UPS batteries first
– Be aware of maintenance activities of building-wide systems that can effect the data center’s
• Chillers• Pumps • Electrical service
CoordinationCoordination
For additional information visit www.totalsitesolutions.com
• Maintenance windows
• Downtime vs. reduced reliability
• Reduction in reliability
• Design system to have various maintenance capabilities
• Move away from critical loads and towards utility
Maintenance WindowsMaintenance WindowsMaintainabilityMaintainability
“Make sure you plan your maintenance windows carefully between IT and Facilities.”
For additional information visit www.totalsitesolutions.com
MaintainabilityMaintainability
• IT maintenance windows are often loaded with IT tasks and therefore are not completely available for facilities tasks
• Need to clearly define the true window for facility maintenance– Maintenance window is midnight to 6 am
– IT takes an hour to shut down and an hour to start-up
– Real outage is limited to 1 am to 5 am
Maintenance WindowsMaintenance Windows
For additional information visit www.totalsitesolutions.com
Predictability is the ability to detect the onset of a failed system before it happens
Predictive analysis can be performed by:• Reviewing PM data• Conducting failure analysis• Monitoring systems• Trending • Advance diagnostics
PREDICTABILITYPREDICTABILITY
PredictabilityPredictability
• Reviewing PM data– PM should not only be a time to complete
preventative maintenance tasks, but also be used as a diagnostic tool
– Use detailed PM guides and complete them so they can be reviewed later
– Review your PM task list and add additional items that can be used to perform predictive analysis
– Record before and after data. This is important to set baselines and conduct trending
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Conducting failure analysis– Event occurs– Complete an incident report
• Incident report should only contain facts of what happened during the event
– Stabilize the system– Repair the system
• Take accurate and specific notes• Take before and after readings• Document
For additional information visit www.totalsitesolutions.com
Predictability Predictability
• Conduct root cause analysis– It is not necessary to prevent the first, or root cause from
happening– It is merely necessary to break the chain of events at any
point and thus final failure cannot occur
• Recommendations – Make recommendation to prevent future failures– Implement those changes in the failed system and other
similar systems – When the fault leads to an initial design problem,
redesign is necessary – Where the fault leads back to equipment failure, develop
ways to improve the component wear, quality and life– Where the fault leads back to a failure of procedures, it is
necessary to either address the procedural weakness or to install a method to protect against the damage caused by the procedural failure
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Monitoring systems– Install a monitoring system– Monitor as much as you can, as long as you
do something with the points you select– Know what you are monitoring and what
effects the points– Develop your point list to assist you in
predictive analysis – Comprehensive monitoring systems will
provide you with the best information
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Trending– Once your monitoring system is installed,
select key points to trend– Use your trends to develop replacement
and PM intervals – Items you can trend:
• Temperatures• Pressure• Flow rates• Usage
– Time– Consumption
• Load
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Advance diagnostic techniques– Infrared thermal imaging– Oil analysis– Coolant analysis– Fuel analysis– Ultrasonic analysis– Power quality testing – Battery impedance testing – Vibration testing– Motor analysis– Eddy current analysis – Laser alignment– Balancing
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Uses for an IR camera– Belt tension– Pump alignment– Bearings– Electrical connections– Turbo chargers– Roof leaks– Poor insulation– Room seals
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Infrared thermography – Is the process of developing visual
images that represent variations in the IR spectrum
– Any object that is above absolute zero omits IR energy
– IR spectrum is between 2.0 and 15 microns
– IR spectrum falls outside the range of the human eye
– IR cameras detect the temperature changes that can potentially mean the presence of conditions or stressors that act to decrease the life of the equipment design
– The IR camera can have many uses in a data center
Unless you are the Predator you will need to use an IR
Camera
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
Fuse ConnectionOverloaded Breaker
Loose Cable Defective Breaker
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
Pump AlignmentWater Under Roof
Tank Level Missing Insulation
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Oil analysis– Oil analysis is used to define three
basic machine conditions • Condition of the oil can determine
lubricate viscosity, acidity , etc.• Lubrication system condition: Have
physical boundaries been violated? i.e. fuel in oil
• Machine condition by looking for wear particulars
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Oil analysis– Oil condition is most easily determined by
measuring the viscosity, acid number and base number
– Additional tests can determine the presence and/or effectiveness of oil additives such as anti-wear addictiveness, antioxidants, corrosion inhibitors, and anti-foam agents
– Component wear can be determined by measuring the amount of wear metals such as iron, copper, chromium, aluminum, lead, tin and nickel, and can identify when a particular part is wearing
– Contamination is determined by measuring water content, specific gravity, and the level of silicon. Change in specific gravity typically indicates presence of other oil or fuel contamination
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
Metals Engines Gears
Iron Cylinder heads, rings, gears, crankshafts
Gears, bearings
Chrome Rings, liners, exhaust valves Roller bearings
AluminumPistons, thrust bearings, turbo bearings, main bearings
Pump, thrust washers
Nickel Valve plating, steel alloy from crankshaft, camshafts
Steel alloy from roller bearings
CopperLube coolers, main and rod bearings, bushings, turbo bearings
Brushings, thrust plates
Lead Main and rod bearings, bushings, lead solder
Bushings, grease contamination
Tin Piston flashing, bearing overlays, bronze alloy
Bearing cage metal
Silver Wrist pin bushings, silver solder from lube coolers
Silver solder from lube coolers
Titanium Gas turbine bearings. Hubs, turbine blades
N/A
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Coolant analysis– Regular coolant testing and routine maintenance
can help you achieve maximum system efficiency and save you time and money in less downtime
– A cooling system is subject to pitting, corrosion, cavitations, erosion and electrolysis
– Although coolants are formulated to help prevent these problems from occurring, coolant analysis will identify if they are present and determine if the coolant you're using is providing adequate protection
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Fuel analysis– Fuel analysis can point to solutions for filter
plugging, loss of power or poor injector performance
– Testing bulk fuel storage tanks can verify compliance with required supplier specifications
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Ultrasonic inspection– Ultrasonic or ultrasound are sound waves above
20kHz to 100kHz that can not be heard by humans– Unlike IR, ultrasound travels a short distance from
the source– Ultrasonic detectors can be used to detect
component wear, fluid leaks, vacuum leaks and steam trap failures
– Even though such a leak may not be audible to the human ear, ultrasound will still be detectable with the appropriate tool
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Pressure and vacuum leaks can occur in various locations – Compressed air– Heat exchangers– Boilers– Condensers– Tanks– Pipes– Valves– Steam traps
• Ultrasonic inspections can detect these small leaks
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Mechanical systems suffer from wear through constant operation, and ultrasonic inspection can detect wear in these systems
• Mechanical applications– Bearings– Lack of lubrication– Pumps– Motors– Gear/gearboxes– Fans– Compressors
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Mechanical devices are not the only devices that omit ultrasonic sound. Electrical equipment will also generate ultrasonic waves if arching, tracking or corona are present
• Electrical applications– Arching, tracking and corona– Switchgear– Transformer– Insulators– Circuit breakers
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Power quality testing– Hardware and software are
frequently blamed for all types of problems that may actually originate from within your building’s electrical distribution system; poor power quality
– In many cases, the number one indication that you have a power quality problem is intermittent, unexplained technology equipment or process failures
– Responding service technicians may complete a work report with the words “no trouble found"
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Impedance testing– A substitute to performing a full
load test– The internal resistance of a cell
can be determined by how that cell responds to a momentary load
– The instantaneous voltage drop and load current applied are used to calculate the resistance
– Most cell testers can check the impedance with the battery online or offline
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Vibration analysis– The level and frequency of the
vibration of rotating machinery are not distinguishable to the human touch
– Can be used to discover and diagnose a wide range of problems related to rotating equipment
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Vibration monitoring can detect;– Unbalance– Eccentric rotors– Misalignment– Mechanical looseness or
weakness • Types of systems that vibration
analysis should be performed on;– Generators– Cooling tower fans– Chillers– Pumps– CRAH/CRAC– Air handlers
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Tests used to perform motor analysis – Infrared – Vibration analysis– Surge comparison – Motor current signature comparison
• Motor faults or conditions can be detected– Winding short circuits– Open coils– Improper torque settings– As well as other mechanical problems
For additional information visit www.totalsitesolutions.com
Predictability Predictability
• Types of motor analysis– Surge comparison testing
identifies insulation deterioration by applying a high frequency transient surge to equal parts of a winding, and by comparing the resulting voltage waveform
– Motor Current Signature Analysis (MCSA) provides a non-intrusive method of detecting mechanical and electrical problems
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Eddy current analysis– Detects surface and subsurface
defects– Detects variations in alloy, heat
treatments, hardness, structure and other physical metallurgical conditions
– Should be done on chillers each year when the tubes are being cleaned
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Alignment inspection – Shafts and pumps should have the proper
alignment, and is best accomplished by using laser alignment
– When machines are improperly aligned there are added loads to the bearings and couplings which can result in early and unplanned failures
For additional information visit www.totalsitesolutions.com
PredictabilityPredictability
• Balance– Reduce wear and tear on
bearings, shafts and motors– Can be detected with the use
of infrared cameras and vibration meters
– Requires balancing equipment to verify and correct balancing
For additional information visit www.totalsitesolutions.com
SCALABILITYSCALABILITY
Scalability is a desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged without impact to operations
For example, it can refer to the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added
ScalabilityScalability
• What do we want… a flexible, scalable, reliable, highly performing, and highly available computer infrastructure that adapts to a wide range of continuously evolving and challenging demands
For additional information visit www.totalsitesolutions.com
• Requirements analysis• Basis of Design (BOD)• Design
– Modular approach– Avoid excessive equipment– Pay as you go
• Expansion techniques
What does it take?
ScalabilityScalability
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Good planning and decisions are the foundation of a highly scalable facility
• At no point in the lifecycle of a mission-critical facility can you have greater impact on scalability then during the design phase
• Start with a Requirements Analysis (RA) of your data center needs
• Use the results of your RA to develop a Basis of Design (BOD)
• The RA and BOD are living documents and you need to update them as changes occur
For additional information visit www.totalsitesolutions.com
• Requirements analysis
– Growth modeling takes the hardware platform requirements and turns them into space, power and cooling requirements
– Considers both current and future technology impacts on space, power and cooling
– Typically done for 3+ year planning
– This leads to the critical infrastructure’s BOD
Requirements AnalysisRequirements AnalysisScalabilityScalability
For additional information visit www.totalsitesolutions.com
• Roadmap to a reliable and quality-designed site
• More often then not, the BOD is lacking in detail
• Define the requirements of the site
• Defines the reliability, availability, maintainability, scalability and operational parameters
• Should be updated regularly
Basis of DesignBasis of DesignScalabilityScalability
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Designing with scalability in mind• Scalability
– Reduced initial cost – Reduced time to install equipment – Reduces the requirements of purchasing
large systems – Not an advantage for fast-growing facilities
• Modular design can be more precisely matched to reflect;– Lower capital investment “Pay as you go
approach”– Budget/capital constraints– Controlled growth– Unanticipated growth
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Equipment rooms– When possible, design equipment rooms with
space for expansion– Design hallways, corridors and doors to allow
access for new equipment – Conserve wall space for future panels and
equipment
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Switchgear– Expansion breakers– Expansion cells– Be aware of bussing configuration, use fully-
rated bus throughout– Use larger frame breakers with adjustable trips – Have expansion in your Programmable Logic
Controller (PLC)• Have access to programming codes• Have current backup
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• UPS systems – Size parallel cabinet and static switch for full
build-out– If modules are upgradeable, size feeders to full
build-out– If equipped with sync control cabinet, size for
full build- out
• Remember– When you start to add more then 3 modules in
parallel, the redundancy begins to drop
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Critical distribution– Dual main input
• Allow for the possibility of a second source to supply load during cutover or expansion activities
• Could be used to connect temporary equipment for emergencies
• Load bank testing
– Spare breakers • Allow for additional PDU and
expected new load • Up-frame the breaker so that larger
loads may be added– i.e. use 400A frame breakers with
225A rating plugs to power PDUs
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Power Distribution Units (PDUs)– Typically you run out of circuits before
capacity– Install junction box below floor to allow for
additional power whips. Bottom plates usually do not have enough knock-out
– Order PDU’s with additional 225A sub-fed breakers to support additional Remote Power Panel (RPP)
– Consider in-row PDU’s to save space
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• EPO systems– Plan on the fact that the EPO system will have
items added and removed from it– EPO should be an engineered device and not a
cloud stating ”by others”– System should be documented– Should have an Active, Test and Off mode of
operation– Installed with isolation relays – Centrally located in an EPO control cabinet
with room for expansion
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Chilled water systems– When possible, up-size piping– Have additional valves installed under the
floor so you can add CRAH units as needed– Have valves installed for additional pumps and
chillers – Have a valve connection that can be easily
hooked-up to a temporary chiller
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Monitoring systems– Make sure that the system is expandable– Some systems are not up-gradable, while others
require adding another module to the communication trunk
– Make sure you will not be locked in with an uncooperative manufacturer
– Have access to the programming function and required passwords
For additional information visit www.totalsitesolutions.com
ScalabilityScalability
• Expansion techniques– Implementation of new systems while
the facility is in “production” is a business reality
– The need for hot cutover occurs more often. For safety reasons, hot cutover should be a last resort
– With proper upfront planning, the need for hot taps and cutovers can be reduced or eliminated
For additional information visit www.totalsitesolutions.com
UPTIMEUPTIME
Uptime (Ŷ) is a measure of the time a system has been "up“, running and available. It came into use to describe the opposite of downtime, times when a system was not operational
ρ = Reliability
ά = Availability
ц = Maintainability
∏ = Predictability
∑ = Scalability
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
Reliability (ρ) is the ability of a system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
Availability (ά) is the ability of a system to tolerate failures
Refers to the time that a system is available to its users
This means the process continues to be served through the failure and that, ideally, the failure is transparent to the user
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
Maintainability (ц) is defined as the probability of performing a successful repair action or preventative maintenance within a given time
In other words, maintainability measures the ease and speed with which a system can be restored to operational status
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
Predictability (∏) is the ability to detect the onset of a failed system before it happens
Predictive analysis can be performed by:
– Reviewing PM data– Conducting failure analysis– Monitoring systems– Trending – Advance diagnostics
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
Scalability (∑) is a desirable property of a system which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged
For example, it can refer to the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added
UPTIMEUPTIME
ρ * ά *ц * ∏ * ∑ = Ŷ
RELIABILITY
AVAILABILITY
MAINTAINABILITY
PREDICTABILITY
SCALABILITY
Be sure to look at more than just the design of your facility…
don’t miss a step. Use RAMPS to achieve maximum uptime!
RAMPS©RAMPS©
Reliability, Availability, Maintainability, Predictability,
Scalability
Reliability, Availability, Maintainability, Predictability,
Scalability
Presented by Joe Soroka
Presented by Joe Soroka
For additional information visit
www.totalsitesolutions.com