What is high availability | A simple explanation

High availablity means that a system remains operational even if one or more of its components fail. This article describes the fundamentals in simple words.

In order to understand high availability we should mentally detach ourselves from the details of implementation. There is no blueprint for the perfect HA cluster as it largely depends on the requirements of the applications running on it.

Let’s start with thinking about which components we need to provide any kind of service. Obviously we need some sort of compute power (which does not necessarily have to be provided by a CPU as shown here: http://www.it-automation.com/2021/07/09/how-parallel-computing-will-shape-our-future.html). Our compute power will be useless until we feed it with some data, thus we need storage. And in order to make an application available it needs a network.

So the three components we need are compute, storage and network, even a Raspberry Pi can provide that. It is very common among small businesses or at home to have a single server that provides some kind of service. But Murphy is out there to get you: It is not a question if one of those components will fail, it is just a question of when. No matter if a system is highly available or not, we have to be prepared for things to go wrong.

Think about a single server that hosts a web application. If that server fails and won’t boot anymore it may take days or even weeks to fix it. While that may be OK for a small business that scenario will be catastrophic if we serve millions of customers on a daily basis. What if that happened to Google or Amazon?

High availability refers to the idea that in case of a failure we automatically switch over to another instance of the failed resource. To be able to do that we have to be redundant with compute, storage and network. As simple as that may sound as complex it may get (check out http://www.it-automation.com/2021/06/11/cap-theorem-simply-explained.html).

Probably you know what a RAID (Redundant Array of Independent Disk) is. It helps us to be redundant with our storage so our system remains operational even in the case of disk failures. Does that mean we have tricked Murphy? What if the entire hosts goes down? Solution: We need a second host! But what if the entire datacenter burns out (see https://www.reuters.com/article/us-france-ovh-fire-idUSKBN2B20NU)? Solution: We need a second datacenter (see https://aws.amazon.com/about-aws/global-infrastructure/regions_az/#Availability_Zones). But what if there is an earthquake in that region that destroys all datacenters at once? Solution: We need to have redundancy in another area of the world (https://aws.amazon.com/about-aws/global-infrastructure/regions_az/#Regions)! But what if a comet strikes the earth (see https://en.wikipedia.org/wiki/Comet_Shoemaker%E2%80%93Levy_9). Solution: We need to go to another planet. Even AWS does not have a solution for that yet but probably they are working on one.

This simple thought experiment shows that we can never get rid of Murphys law. The only thing we can do is to increase our level of availability. That typically goes in steps of 90%, 99.9%, 99.99%, 99.999%, 99.9999%. With each 9 we add our costs grow exponentially. It is our job to find the most economic setup somewhere in between adding a second hard disk and building a data center on another planet.

High availability can get really complex and we haven’t even covered the overarching pieces like power supply and cooling. The good part is that cloud providers have taken away a lot of the complexity. Nowadays even the smallest company can migrate their server from the broom closet to an EC2 instance.