Here’s how companies like Google, Cloudflare, Facebook, and more reach 99.999% uptime, and your business can too.
With most businesses finding it hard to achieve a 99.9% uptime throughout the year, achieving a goal of 99.999% uptime looks daunting to developers. Here’s how to reach 99.99% uptime for your business.
It’s like asking someone to build a bridge that would never collapse or a machine that would never break down no matter what.
In short, it is a hard goal to achieve but yes it is achievable.
But how do we achieve it?
To figure out ways to help us achieve this goal we must first understand what high availability/uptime means, the reasons why we fail to achieve it in the first place and then work our way up the ladder.
So what is a high availability system?
Simply put, a high availability system is those that have no downtime (or a very little downtime). Availability becomes extremely important while deciding the service level agreement (SLA) for a particular product/service. Cloud vendors like Google, Amazon, Microsoft, and others set an SLA around their availability at nearly 99.9% which means a minimum commitment of 99.9% uptime for their cloud services. This might not look like a good uptime but as the complexity of system increases, an SLA of 99.9% is considered very good across industries.
SLA is critical to B2B companies since most of the customers look at your uptime history and your SLA guarantee before making a purchasing decision.
We take it to the next level to have an availability of 5 nines i.e., 99.999%. Now, this availability is what we desire and is considered top-notch by industry standards.
99.999% uptime means your system can be down only for a total of approximately five minutes and fifteen seconds per year.
Hard to accomplish – yes. Impossible – absolutely not.
Understanding Complexity – Stairs vs Escalators Analogy.
Imagine a time when there were no lifts and escalators. People used to use stairs to climb buildings. Those stairs were made of pure concrete and steel. If proper care was taken, they could even last generations without fail as is proven by stairs in many historical monuments some of which were built thousands of years ago.
But climbing stairs became tiring soon when the buildings started to grow taller. This led to inventions like lifts and escalators. These were complex systems that needed a lot of things to work together to achieve the goal of not having to climb the stairs at all. They saved us a lot of time and pushed the human race forward.
Now, between an escalator and stairs, which do you think has a higher uptime?
Of-course stairs have much higher availability. Stairs are capable of 100% uptime but escalators aren’t because they need to be taken offline for routine maintenance and repairs. It is also because escalators have a lot of points of failure (electricity, load, pulleys, chains, etc) and stairs have just one if they collapse completely.
Similar is the case with web services today. Maintaining a 100% uptime on a static website was much easier but today websites and products have matured into complex systems that have multiple dependencies and points of failures. Adding complexity can lead to many new features and benefits but it makes it extremely difficult for businesses to have high reliability and availability.
It is also the reason why a person with a single page static website might have much higher availability than Amazon. Even with its IT team of hundreds of thousands of smart people, Amazon is dealing with much higher complexity.
Thus building a high uptime system requires deciding between a series of tradeoffs and making some sacrifices as well.
Higher availability can be ensured by removing the different points of failure that might result in downtime.
Ensure that you eliminate all single point of failures
Single points of failure are points that result in the whole system to fail if they fail and hence is extremely risky to have one. One of the most important foundations of ensuring a high availability system is to ensure a “zero” single point of failure.
This increases redundancy, cost, and management effort and time because you’re running multiple things at the same time vs just one, but it reduces the risk of significant downtime and improves reliability.
To understand it better let us consider an example of a single server and all users connected to that server. This system is the simplest system we can think of. This server is responsible for serving all the users connected to it. What happens when the traffic surges and this server goes down?
This renders the whole system useless. The system will appear offline to all the users connected to it and will suffer downtime. The users will continue experiencing downtime until this server goes online and resumes operation again. A system of this type is very simple but extremely unreliable.
The availability of this system can be improved just by adding one more server to the system as shown below.
When a user tries to connect to the system, it is connected to either Server 1 or Server 2. If the traffic surges on Server 1 or if Server 1 becomes unavailable due to any technical reason, the traffic is automatically routed to Server 2 and thus the system remains operational to its users. But this system will need to have a monitoring and distribution system in place that would continuously monitor both the servers and distribute the traffic accordingly between the two servers.
This is where the load balancer comes into the picture. It ensures that the traffic is distributed uniformly between the servers when users try to connect and also it keeps a check on the health of the servers.
Can you find an issue with this system?
Yes, we are back with the same problem of having a single point of failure in the system only this time it’s the load balancer. What will happen if the load balancer in this system fails? The whole system will fail.
To reduce this risk, we need to introduce an additional redundancy in the form of an additional backup load balancer that would be in sync with the first load balancer and will be operational when the first load balancer fails.
This, however, introduces a new type of problem – Failover to the redundant load balancer would require a DNS change which would take a while and until then the system would suffer downtime. A solution to this problem is to have an IP decoupled from Load Balancer and have the Backup Load balancer switch and take it’s IP as soon as the first load balancer fails.
This will create static IP addresses that would float between the load balancers. If one is unavailable due to any reason, the other one will be able to handle the incoming traffic.
This setup is the foundation on which IT teams build a high availability system. Though the actual system involving databases and multiple other dependencies and redundancies isn’t this simple, this forms the basis of all the bigger and more complex systems used in businesses.
Have multiple instances of anything that might have even the slightest chance of failing.
The next thing you need to do is set up database servers. In a huge majority of cases, you’ll not only be serving a static website but also complex applications over the internet. So now you’ll add two instances of the database server and each of the database servers will store its own copy of data.
The problem with this approach is if two database servers store different copies of data then you’ll have two different copies of data over time. So, you need to make sure two database servers talk to each other and sync data between them. This is called replication.
Any major database vendor like PostgresSQL, MySQL, MsSQL has a feature that helps you do this.
The toughest thing to do with the database servers while ensuring high availability in such distributed redundant systems is balancing between consistency of data version and ensuring high reliability at the same time. The problem that we would face here is that with an increasing number of database servers, there is a high probability that the data on the servers might not be the exact mirror image of data on the other servers. There will be few milliseconds to seconds of sync delay between servers which will result in inconsistent data for your users. Imagine you share a Facebook post and you refresh the page and it’s gone – just because when you shared a post on Facebook it was stored on the database server 1 and when you refreshed the page it was fetched from database server 2 and the sync between server 1 and server 2 happened after you shared the post.
This does happen a few times but the risk of these inconsistencies is so low than the risk of downtime. If your business requires no inconsistencies in data then you need to configure the database server to make sure you write data and sync that data to other servers then call the write “done”. Most databases by default are configured to call the write “done” when the write is written on one server.
Hosting across multiple geographical locations
AWS, GCP, and Azure usually suffer datacenter-wide downtimes that take pieces of the internet offline (usually this is very rare, for example, see AWS downtime of 2017 which was caused by human error) It teaches us a very important lesson about how hosting in a single geographical location can be disastrous to availability.
To optimize for high availability, it’s worthwhile to consider hosting across multiple geographic zones. When an outage occurs and impacts your system, there’s a better chance of availability with servers distributed geographically across the whole region.
This also prevents you from floods, earthquakes, or any other natural disaster that might take complete datacenter out.
Infrastructure when developed as a single block results in the whole system to crumble when anything inside it breaks. It’s like putting all the eggs in the same basket. This is a recipe for failure and low availability.
Microservices is en engineering methodology where you break your service down into multiple smaller services that can be independently deployed into any server. Breaking into multiple services happen by feature – for example, we have payment service, authentication service, a service that serves files, etc. Each of these services can be independently deployed on any server.
Using micro-services helps alleviate this problem to some extent. Micro-services break the different functions of an application into different individual systems. These systems operate independently but together keep the whole application up and running. It’s like headlamps, tail lamps, and stereo in your car. If any of these systems stop working, the car doesn’t stop working.
In a similar way, when a microservice goes down it brings down a single functionality and not the whole system. The system remains operational with all other functionalities.
This again has a tradeoff to take care of. A higher number of micro-services means greater operational complexity for developers and higher attention and resources needed to ensure all the functionalities are online and operational.
And yes, you need multiple instances (at least 2, for each of these services)
Ensure better CI/CD processes to avoid pushing bad code.
Most of the downtimes we normally see are majorly due to human errors compared to architecture failure. Pushing bad code can bring the system crashing down pretty fast. Hence it is extremely important to have well defined documented process of testing code and bugs before deploying to production.
You need to have a solid and foolproof CI/CD process which takes the human element out of testing code, and shipping code to staging and then to production. More on building this CI/CD pipeline in the later post.
Precautionary measures here would save a lot of time and resources going forward.
Balancing complexity with availability.
The race to high availability is a never-ending process and in an ideal world, your systems would have 100% availability. But it isn’t the case in the real world. Chasing the dream of higher and higher availability burns a lot of cash and resources in the process.
Improving availability from 99% to four nines i.e, 99.99% is much easier than increasing from 99.99% to five-nines i.e., 99.999%. Improving availability increases the complexity of the whole system and hence makes the system resource-intensive.
Hence, it is important to strike a balance between the amount of complexity you can bear to have while ensuring the highest availability possible.
To help make these decisions it would be good to ask questions such as these mentioned below before deciding the strategy:
- What kind of downtime would be acceptable to my customers — a few minutes, hours, days or even maybe weeks?
- What is the per second revenue loss to the company because of downtime?
- Do I have the right monitoring tools in place like Fyipe to ensure that my team knows when things go down?
- What is the amount of resources I need to put in to improve availability? Is it worth the effort?
- What are the risks that I need to take with increased complexity to ensure the availability of my choice? Are these risks acceptable?
- When my system goes down, do I have a concrete, reliable, and foolproof incident management strategy in place? Here’s how to create one.
The answers to all these questions will help you find out the right strategy going forward to ensure the high availability of your system. The answer to these questions would vary with company, customers, region and even time but the essence remains the same.
We at Fyipe ensure the high availability of our system 99.999% to help others ensure the same. Our ability to do so has helped us win the trust of companies worldwide. This blog is a small gesture to help you achieve the same goals but in an easier way.