What Makes a Data Center Fault Tolerant?

Data availability and uptime are now primary concerns for businesses in all industries. With increasing numbers of companies now relying on digital systems for the vast majority of their processes, the focus on data availability is becoming ever more important. As a result, we’re seeing far more conversations about how to achieve the very best levels of SLA uptime, and which processes companies should be putting in place to protect themselves from the damage that unexpected downtime could potentially cause.

Fault tolerance is one of the key talking points amongst IT professionals today, but relatively few outside of the IT sector have a good understanding of what this term really means, particularly in the context of data centers. With fault tolerance becoming increasingly important as time goes on, it’s worth taking the time to understand what is meant by the term, and how a good level of knowledge around fault tolerance could result in more reliable systems for your entire business.

What is a fault tolerant data center?

The phrase fault tolerant is often used to describe data centers. Seen as a standard of quality and a sure sign of reliability, a fault tolerant data center is one that has no single point of failure. Facilities are purpose-built to avoid such a point of failure and fully equipped with a range of technology that significantly improves the fault tolerance of the center as a whole.

A high level of fault tolerance can make a real impact in terms of the reliability of a data center, but it’s not the only thing that companies need to consider. Datacenter downtime can also be avoided by practicing fault avoidance. The use of continuous monitoring systems, good training practices, and meticulous maintenance all come together to help prevent any faults from occurring, thereby keeping downtime to a minimum.

Data centers like ours are built with fault tolerance in mind. TRG’s facility has been built to avoid any single point of failure.

Understanding the tier system

A tier system has long been used to help explain the capabilities of different data centers. The system is composed of four tiers, with each one giving a clear indication of the performance of different sites. The four levels of the system include Tier I (Basic Capacity), Tier II (Redundant Capacity), Tier III (Concurrently Maintainable), and Tier IV, which is the tier that denotes fault tolerance. Let’s take a closer look at what these tiers mean.

Tier I: Basic Capacity

Tier I data centers are amongst the most affordable options. While they do not provide the high levels of fault tolerance that Tier IV centers will, they are usually sufficient for the needs of companies looking for a basic level of support for existing systems. These data centers tend to include features like cooling equipment, engine generators, and an uninterruptible power supply.

Tier II: Redundant Capacity

The basic level of service that Tier I data centers provide is improved by those in the Tier II bracket. These data centers also include power and cooling components, which help companies to complete maintenance tasks without disrupting systems. Such components are also useful in limiting the chance of any downtime caused by equipment failures.

Tier III: Concurrently Maintainable

Tier III data centers provide a clear benefit to companies that are always looking to expand and improve the service they offer. They are built in such a way that shutdowns are never required during maintenance tasks, and equipment can be replaced with no need for any downtime at all. This is achieved through the addition of a redundant delivery path, which is used for power and cooling, alongside all the redundant critical components of a Tier II data center.

Tier IV: Fault Tolerance

The highest level of reliability and security is provided by Tier IV data centers. Widely known as fault tolerant data centers, these facilities have to have two parallel power and cooling systems. This means, should any equipment failures or interruptions occur, the center’s generators, cooling systems, double electrical rooms, and purpose-designed infrastructure will completely minimize the risk of downtime.

The Importance of Fault Avoidance

While infrastructure plays a big role in ensuring data center availability, the biggest improvements in uptime are found when facilities look beyond fault tolerance and start practicing fault avoidance. In fact, tiers can mean little in terms of data center availability without fault avoidance.

Simplified, fault avoidance aims to limit downtime considerably, with an approach that centers around prevention rather than a cure. Years of experience operating data centers has taught us that downtime can be avoided altogether with the right level of monitoring, thorough maintenance, and well-trained personnel.

24/7 Staff

A 24/7 facilities team and designated Primary Alert Watcher (PAW) provide continuous monitoring, a vital part of any good fault avoidance strategy. This ensures that any issues are picked up on quickly, and an immediate response can be organized. As a result, more serious problems will be avoided, and downtime can be minimized.

Monitoring

Building Management System (BMS) and Building Automation System (BAS) are two of the most important tools when it comes to data center monitoring and practicing active fault avoidance. In simple terms, a BMS lets operators monitor systems and gather insights from them whereas BAS goes a step further, offering automated responses based on data insights. These automatic responses often include control over ventilation, cooling, heating and more. Both of these systems also use Programmable Logic Controllers that let operators monitor equipment individually or the building in its entirety.

Balancing Predictive and Preventative Maintenance

Maintenance should be a key consideration for businesses hoping to avoid downtime. In fault avoidance, having a maintenance regime is crucial for preventing incidents before they occur. There are two main types of maintenance:

Preventative – Regularly scheduled maintenance undertaken on the advice of suppliers
Predictive – Monitoring equipment and leveraging data to understand where the most likely point of failure will be

Practicing effective fault avoidance involves finding a healthy mix of both of these regiments.

Formalized Training

Human error is of course another leading cause of downtime, which is why this too should be part of any good fault avoidance strategy. Businesses practicing fault avoidance will need to prioritize staff training and formalize Methods of Procedure (MOPs) to be followed in the event of an incident. These procedures should always be peer-reviewed, and they must include clear guidelines as to when team members should stop any interventions to minimize risk.

The Commercial Considerations of Fault Tolerance

What is the cost of implementing fault tolerance?

When thinking about what goes into making a data center fault tolerant, it’s also important to consider the commercial practicalities. When it comes to fault tolerance in small scale facilities, the designs are usually delivered in a 2N capacity. This means the costs are essentially doubled. Adding the costs of maintaining these systems (which will require specialist 24/7 teams to support) and it generally won’t make sense to implement for capacities below several megawatts (around 4-5 is the threshold). In these instances it is much more economically viable to utilize a larger colocation data center that can benefit from economies of scale.

The fact remains, large data center providers can achieve fault tolerance at a much lower overhead cost. As a result, customers can benefit from better value for money. Economies of scale let larger facilities invest in 24/7 staffing for better continuity of service. Customers then also get the peace of mind that best-in-breed experts are working to provide these services.

Ultimately, most large scale data centers are able to provide Tier IV facilities at the same cost it takes to build a Tier II data center yourself.

Your time

Another core consideration should be that of you and your team’s time, and where is it best spent. The most successful results come when a team does what they are best at – and doesn’t waste time undertaking tasks outside of their own capabilities.

Implementing fault tolerance requires a high level of facilities management and critical systems design. Ask yourself if this is within your organization’s core competencies because, if not, fault tolerance could be little more than a distraction that keeps you away from your core purposes. Even worse, it could turn into a costly mistake that requires a complete rework. This is another reason why larger data center providers benefit from having the time, resources and expertise to invest into achieving fault tolerance at a high level.

What About The Cloud

We can’t think about fault tolerant data centers without also addressing The Cloud. You’ll find the most successful organizations will use a hybrid strategy for fault tolerance. This essentially means hosting the primary system in a colocation facility, and then using The Cloud as a backup target in a disaster recovery location.

Looking at The Cloud and its role in fault tolerance, business continuity and disaster recovery the fact remains that there are still various risks involved – especially when we think about it at hyperscale. Any error or problem could potentially bring the whole system down. This is one of the instances where scale can also work against you as the risks get larger the more you scale up in The Cloud.

This is why it’s still not always the best idea to be on a giant shared computer grid where one error can throw the whole thing off. In our experience, best practice is to host systems in a locally provided colocation data center, and then leverage a hybrid strategy for backups.

You’ll find colocation data centers will have the features and service levels to provide cloud-like experiences. In TRG’s case, we have cloud on-ramps and multi-site capabilities that let us provide the same fault tolerance as cloud providers.

Make downtime a thing of the past with a fault tolerant data center

Our data centers are all designed with fault tolerance and fault avoidance in mind, offering everything ambitious organizations need to ensure their work is never interrupted. If you’d like to hear more about what a fault tolerant data center could do for your company, or are interested in exploring the options further, contact us.