5 Easy Steps for Assessing Data Center Reliability

Reliable Data Center
Reliable Data Center

I know data centers. As an Accredited Tier Designer from the Uptime Institute, I have operated multiple data centers, personally designed more than 1,000,000 square feet of them, and have also sold data center colocation space to everything from SMB’s to Fortune 10 organizations. After auditing more than 300 data centers and doing thousands of tours, I have found that each facility has its own unique story and circumstance. Disturbingly, most customers have no idea what they have on their hands and what their risks are.

Understanding How to Assess Data Center Reliability

RFPs vs Tours

It is incredible to me just how many decisions get made through strenuous processes that add little insight. My favorite story of a customer was a thirty-page RFP that they submitted to me before coming on the tour. This RFP was downloaded from some data center website and had very particular, useless set of questions that gave no insight.

I also had another customer describe traditional RFP processes as a pointless CYA (Cover Your Ass). If you hear a bunch of N plus this and N plus that, I want to warn you now that has little to no bearing on the overall reliability of the facility. Don’t buy the obfuscation; those are terrible ways to describe reliability. It’s worth noting that the customer in question later disclosed that he had no idea what most of the RFP was talking about, and he had to google it himself.

I am not saying RFP’s are dead. I have seen some great ones, but you need to understand what purpose they serve. Ultimately a good RFP should capture specific needs, availability of space, and commercial terms. Consulting engineers do a design review of one-line documents. One good example in an RFP is clearances of doors because the customer uses over-sized cabinets. Door clearance is an essential consideration for that customer, so it is an excellent question for an RFP.

A large part of the buying process for customers is “touring the facility.” Many customers treat this as though they are going to audit our facility to ensure that we are providing a reliable service. Newsflash – these tours are incredibly polished and often designed to obfuscate or overwhelm any questions that you have. See this article on how to buy data centers to learn more about why the data center tour is crucial. An important point to remember is this – you are touring the people and the operation, not just the infrastructure.

Overall, I believe the data center tour is a fantastic opportunity to establish a human connection with your potential provider and use it as a first point of leveraging negotiations for getting a great deal on your data center.

How Capable is your Data Center?

While in a perfect world, facilities would be more upfront about their design and capabilities, they are financially incentivized to win your business and may not always be upfront. The Uptime Institute has developed a list of standards known as the Tier Certification process to provide clarity. In recent years, they have grown their focus to training people with “people standards,” such as Accredited Tier Designer and Accredited Operations Specialists. These are some great certifications to look for to get an early fit on how serious a provider is.

Now the good news is that most facilities worth their salt (there’s at least 2-3 in each city these days) are at least concurrently maintainable – and are generally well-run operations. That does not mean they are not without fault. If you don’t ever want to go down, what you may be looking for is an entirely fault-tolerant data center. While NO infrastructure is infallible, the difference between the two can mean having zero outages in your entire career and a few pockmarking your decision to go with a particular provider. All else equal, I think the difference in results from a fault-tolerant versus concurrently maintainable infrastructure tips the scales in favor of the provider who went the extra mile to create full fault tolerance.

The bad news is that designing fault tolerance in facilities is an INCREDIBLY tricky problem. We believe that ALL IT Infrastructure should follow the design process of fault tolerance, so it would truly benefit you to think about this engineering problem and see how you can apply that to your infrastructure in the IT stack. Instead of relying on an RFP, it is probably best to learn how these things work so that you can conclude yourself. This simple heuristic developed by us will at least help you shake out 95% of the “fakers.”

If you desire 100% certainty, you should hire a professional for an audit or external review. The Uptime Institute does these, but you can probably also get an opinion of reliability from a consulting engineer with an ATD. Please note that ATD’s cannot “Certify” a Tier Rating.

What is the Data Center Tier System?

Before we get too far, lets quickly review the Tier Levels per the Uptime Institute.

Tier I – Basic Capacity (Basic infrastructure such as generator and UPS and cooling to provide data center)

Tier II – Redundant Capacity (Fully redundant capacity components (N+1 generator, cooling, UPS, etc.), no requirement for redundant distribution paths (switchgear, distribution panels, chiller feeds, electrical feeds, etc.)

Tier III – Concurrently Maintainable – Fully redundant capacity components, any portion of distribution paths, and capacity components must be able to be taken down under a planned maintenance scenario in an operational environment without affecting IT load. Consider extreme temperatures.

Tier IV – Fault Tolerant – Any type of unplanned fault may occur at any point within the system without affecting IT load. There are other considerations, such as continuous cooling ratings and runtime ratings, but this mostly sums it up.

(If you want to know even more about the Tier system and the Uptime Institute, please read our blog post here.)

The 5 Steps to Checking Data Center Reliability

The tiered rating system is intentionally vague because, factually speaking, many different types of designs can achieve the different Tier ratings. A broad application is good for engineers, kind of, but not helpful for end users attempting to discern a design short of saying “show me the certification,” which is unfortunately not widely adopted in colocation data centers.

There is a much larger book with many considerations, but to save you the pain of that, we wanted to give you five points/questions to mostly narrow it down in the most effective way possible while evaluating data centers.

  1. Ask them how they identify the rating of their facility. Do they recognize it as fault-tolerant? Concurrently Maintainable? Do they try to mix it by saying, “fault-tolerant electrical and concurrently maintainable chillers?” That means it may score, with an audit, Tier III, or concurrently maintainable – or in general, the least reliable component of your facility dictates your rating (read about statistical bottlenecks if you are interested in the academic view of this).
  2. Is It On A Chiller Loop? (Then It Is Probably Concurrently Maintainable, at Best, Unless They Have Dual Coil HVAC Units that provide a DX backup or a particular control scheme). If They claim they are fault-tolerant and are on a chiller loop, ask them to “prove it” then get with an ATD Accredited consulting engineer or have them show you their Uptime Institute Certifications. Tier IV Fault Tolerant Designs can utilize chillers, but they are rarely done in the colocation industry. You can count on a single hand how many there are in the country; there are no Houston Data Centers, for example, that achieve Fault Tolerance with Chillers. BE SKEPTICAL WITH CHILLERS! Every major fault that I have seen for critical cooling failures in recent years has been due to chillers.
  3. Does it have distinct physically separate utility rooms and paths with entirely independent generators and UPS that don’t share any bus or electrical infrastructure? If you don’t get two sets of cables coming to your rack from two FULLY separate and physically isolated line ups (separate rooms with different paths), it PROBABLY isn’t fault-tolerant. If they claim it is, ask them to “prove it.
  4. Does it have single anything? Path, switchgear, UPS, generator tanks, piping, network paths, meet me room(s)? Then it’s not fault-tolerant. As a matter of fact, it is likely Tier I or Tier II (basic capacity or redundant capacity).
  5. Do an “AX TEST” – and ask for run times and performance UNDER A FAILURE CONDITION. You should be able to run on remaining generators after a single, worst-case fault, for example, for an unlimited number of hours fully loaded on the hottest day of the year. What if THIS room catches on fire? Is there any point in the facility you can “AX” and make the entire thing go down? A review is often best done on a one-line diagram or by walking around and discussing it. By the way, a system is not fault-tolerant if it requires manual intervention to continue operations of the facility.

Extra Considerations

 There are many other considerations, such as “continuous cooling capability,” runtime, ancillary systems, EPO configurations, and plenty of other gotchas in determining fault tolerance, which is why these audits are sometimes best left to the professionals.

However, this guide is a great start that will provide far more insight as to the reliability of a facility than most RFP’s.

In practice, we see most data centers fall into these categories:

  1. Mixed and intermingled everything, single electrical rooms, common bus, chiller loop, etc. Usually a Tier I or Tier II Facility (Most Enterprise Facilities or Low-End Colocation Data Centers )
  2. High-End Colocation Grade Data Center – Has Chillers – Probably a Tier III Concurrently Maintainable Facility. Mostly fault-tolerant electrical, but overall concurrently maintainable.
  3. Entirely Fault-Tolerant Data Center – Probably doing a type of Hybrid DX, or has a Tier Certification From Uptime Institute with Chillers. Entirely fault-tolerant electrical infrastructure.

Are you interested in finding out if something is fault-tolerant? What category do you believe your data center is? Send it to me, and I will give you my opinion for free!

At TRG, we designed our Houston Data Center to be entirely fault-tolerant, the only one in Houston. We would love to provide you with a tour so that you can learn more about it and the TRG difference. Contact us to find out more!