Data availability and uptime are now primary concerns for businesses in all industries. With increasing numbers of companies now relying on digital systems for the vast majority of their processes, the focus on data availability is becoming ever more important. As a result, we’re seeing far more conversations about how to achieve the very best levels of SLA uptime, and which processes companies should be putting in place to protect themselves from the damage that unexpected downtime could potentially cause.
Fault tolerance is one of the key talking points amongst IT professionals today, but relatively few outside of the IT sector have a good understanding of what this term really means, particularly in the context of data centers. With fault tolerance becoming increasingly important as time goes on, it’s worth taking the time to understand what is meant by the term, and how a good level of knowledge around fault tolerance could result in more reliable systems for your entire business.
What is a fault tolerant data center?
The phrase fault tolerant is often used to describe data centers. Seen as a standard of quality and a sure sign of reliability, a fault tolerant data center is one that has no single point of failure. Facilities are purpose-built to avoid such a point of failure and fully equipped with a range of technology that significantly improves the fault tolerance of the center as a whole.
A high level of fault tolerance can make a real impact in terms of the reliability of a data center, but it’s not the only thing that companies need to consider. Datacenter downtime can also be avoided by practicing fault avoidance. The use of continuous monitoring systems, good training practices, and meticulous maintenance all come together to help prevent any faults from occurring, thereby keeping downtime to a minimum.
Data centers like ours are built with fault tolerance in mind. TRG’s facility has been built to avoid any single point of failure.
Understanding the tier system
A tier system has long been used to help explain the capabilities of different data centers. The system is composed of four tiers, with each one giving a clear indication of the performance of different sites. The four levels of the system include Tier I (Basic Capacity), Tier II (Redundant Capacity), Tier III (Concurrently Maintainable), and Tier IV, which is the tier that denotes fault tolerance. Let’s take a closer look at what these tiers mean.
Tier I: Basic Capacity
Tier I data centers are amongst the most affordable options. While they do not provide the high levels of fault tolerance that Tier IV centers will, they are usually sufficient for the needs of companies looking for a basic level of support for existing systems. These data centers tend to include features like cooling equipment, engine generators, and an uninterruptible power supply.
Tier II: Redundant Capacity
The basic level of service that Tier I data centers provide is improved by those in the Tier II bracket. These data centers also include power and cooling components, which help companies to complete maintenance tasks without disrupting systems. Such components are also useful in limiting the chance of any downtime caused by equipment failures.
Tier III: Concurrently Maintainable
Tier III data centers provide a clear benefit to companies that are always looking to expand and improve the service they offer. They are built in such a way that shutdowns are never required during maintenance tasks, and equipment can be replaced with no need for any downtime at all. This is achieved through the addition of a redundant delivery path, which is used for power and cooling, alongside all the redundant critical components of a Tier II data center.
Tier IV: Fault Tolerance
The highest level of reliability and security is provided by Tier IV data centers. Widely known as fault tolerant data centers, these facilities have to have two parallel power and cooling systems. This means, should any equipment failures or interruptions occur, the center’s generators, cooling systems, double electrical rooms, and purpose-designed infrastructure will completely minimize the risk of downtime.
The Importance of Fault Avoidance
While infrastructure plays a big role in ensuring data center availability, the biggest improvements in uptime are found when facilities look beyond fault tolerance and start practicing fault avoidance. In fact, tiers can mean little in terms of data center availability without fault avoidance.
Simplified, fault avoidance aims to limit downtime considerably, with an approach that centers around prevention rather than a cure. Years of experience operating data centers has taught us that downtime can be avoided altogether with the right level of monitoring, thorough maintenance, and well-trained personnel.
A 24/7 facilities team and designated Primary Alert Watcher (PAW) provide continuous monitoring, a vital part of any good fault avoidance strategy. This ensures that any issues are picked up on quickly, and an immediate response can be organized. As a result, more serious problems will be avoided, and downtime can be minimized.
Building Management System (BMS) and Building Automation System (BAS) are two of the most important tools when it comes to data center monitoring and practicing active fault avoidance. In simple terms, a BMS lets operators monitor systems and gather insights from them whereas BAS goes a step further, offering automated responses based on data insights. These automatic responses often include control over ventilation, cooling, heating and more. Both of these systems also use Programmable Logic Controllers that let operators monitor equipment individually or the building in its entirety.
Balancing Predictive and Preventative Maintenance
Maintenance should be a key consideration for businesses hoping to avoid downtime. In fault avoidance, having a maintenance regime is crucial for preventing incidents before they occur. There are two main types of maintenance:
- Preventative – Regularly scheduled maintenance undertaken on the advice of suppliers
- Predictive – Monitoring equipment and leveraging data to understand where the most likely point of failure will be
Practicing effective fault avoidance involves finding a healthy mix of both of these regiments.
Human error is of course another leading cause of downtime, which is why this too should be part of any good fault avoidance strategy. Businesses practicing fault avoidance will need to prioritize staff training and formalize Methods of Procedure (MOPs) to be followed in the event of an incident. These procedures should always be peer-reviewed, and they must include clear guidelines as to when team members should stop any interventions to minimize risk.
Make downtime a thing of the past with a fault tolerant data center
Our data centers are all designed with fault tolerance and fault avoidance in mind, offering everything ambitious organizations need to ensure their work is never interrupted. If you’d like to hear more about what a fault tolerant data center could do for your company, or are interested in exploring the options further, contact us.