Data Center Availability: Moving from Fault Tolerance to Fault Avoidance

Tier 4 data centers host mission-critical computers and servers with backup facilities intended to keep them running independently regardless of what happens. The Uptime Institute defines the standards they must meet to achieve this.

These include fully redundant cooling, power, network links, and storage. They should also have compartmentalized security zones controlled by biometric access.

However, collocating in a Tier 4 data center is out of reach for most companies in terms of cost. Moreover a Tier 3 is likely to have only marginally less availability. Besides, physical infrastructure is not the only factor affecting uptime.

Why the Need to Detect Equipment Failure Before It Happens

The Uptime Institute’s 2017 Executive Symposium at Las Vegas featured a webinar titled ‘Beyond Fault Tolerance: Data Center Active Fault Avoidance Strategies’. This focused on how even the best data centers “can become vulnerable during highly stressful incidents”.

Its CTO shared how deploying a range of increasingly cost-effective monitoring techniques could help detect equipment failures before they happen. In this way a data center moves from tolerance (a reactive stance) to active fault avoidance.

A Systems Approach to Go About This Thoroughly

A systems approach is a management strategy emphasizing the interactive nature and interdependence of external and internal factors in an organization, according to Business Dictionary.

Consultants commonly use it to evaluate market elements which affect the profitability of a business. However, in this instance we can use it for active fault avoidance, by monitoring a data center in its totality. Here we distinguish two methods.

A Building Management System (BMS) allows control room operators to monitor systems, and obtain insights into their status.
A Building Automation System (BAS) takes this further by providing automatic, central control over ventilation, cooling, heating, lighting, and other systems.

Both applications use programmable logic controllers to monitor the state of input sensors attached to equipment, and make decisions in terms of their customizable programs. Control room operators can use these inputs to stay in touch with the fully integrated system.

Building Management Through a ‘Single Pane of Glass’

Having all the information visually available on a single control panel elevates building management for fault avoidance to new levels of oversight. There are a variety of options on the market depending on whether an off-the-shelf or customized solution is required.

A good solution enables control operators to detect emerging equipment problems before they reach fall-over stage. This puts them a position to proactively prevent faults, as opposed to activating a back-up system when the primary one fails.

In future we may delegate their roles to highly advanced artificial intelligence, and thereby avoid human error caused by stressful situations.

The Role of Formalized Procedures and Training

IBM Systems believes human error is the top cause of downtime. The other primary drivers are server hardware, operating systems, and new applications. A major system outage can cripple a business while it lasts.

BMS and BAS systems can help keep building systems running smoothly, although they still need the watchful eyes of control room operators. Therefore it is essential to mitigate human error through a hierarchy of procedures detailing steps to take when a situation moves outside specification.

Ongoing training and retraining is therefore essential to ensure fast reaction times. This is no place for book knowledge and theory. A data center’s quality assurance consultant should create realistic scenarios that bring the procedures to life.

Those procedures must be peer-reviewed and contain step-by-step descriptions and illustrations of relevant control panels and test instruments. They must also establish safety boundaries at which point the response team should back off to avoid harm.

A well-founded data center team should formulate these procedures during commissioning, and make sure everybody knows what they have to do when problems develop. Finally, there must be close-out reviews every time a procedure is activated to implement any lessons learned.

How to Move from a Fault Tolerance to a Fault Avoidance Culture

It is best practice to install a Building Management, or Building Automation System during data center construction, when compatible equipment with sensors can be sourced. It’s equally important to have skilled people in the building at all times who are able to respond quickly.

Moreover, a primary responder must always be present in the control room. There must be no exceptions to this rule. They also need a backup to ensure there is someone available to notify the maintenance team the instant a warning light flashes on a panel.

The best control room operators often score highly on the Myers-Briggs ISTP scale. They are systematic people who obtain inspiration from keen observation and problem-solving.

Therefore they are fact-oriented and detail-minded, as opposed to being high on creativity. They make ideal employees in any environment using computers and other electronic devices as tools.

Finally, it’s essential to monitor the culture to make sure you have a cohesive team that can work together under pressure, and support each other. They should be part of the selection process when choosing a new member. You could have the best BAS or BMS system in the world and still fail to avoid faults if your maintenance people are fatigued or demotivated.

Finding the Right Blend of Preventative and Predictive Maintenance

Proper care of building systems belongs at the heart of a fault avoidance program. Both help extend asset life, prevent unexpected breakdowns, and save on maintenance costs:

Preventative maintenance takes place at scheduled times as per the advice of suppliers, and may not take account of local conditions.
Predictive maintenance becomes necessary when performance information is no longer within the acceptable scale.

A data center moving from fault tolerance to fault avoidance therefore needs to find the right mix between these two maintenance regimes. Some maintenance will always be scheduled like changing fluids and replacing filters.

However, increasingly intelligent equipment sensors monitored by dedicated operators are allowing us to draw closer to that old adage ‘if it’s working, don’t fix it because there is nothing wrong’. That said, we do still need to take building systems off line at regular intervals to inspect and test them.