High Availability and Disaster Recovery Considerations
What is high availability? How can the cloud help you with high availability? How can you plan and recover from a disaster?
Define the Requirements
Recovery Time Objective (RTO)
Recovery Point Objective (RPO)
mean time to repair (MTTR)
mean time between failures (MTBF)
Service Level Agreement (SLA)
Before investigating high availability, you need clear requirements or the cost estimation will be incorrect. The following is a list of key concepts to help define the business requirements for high availability.
RTO: This concept is the maximum acceptable time for an application to be unavailable.
RPO: This concept is the maximum acceptable duration for data loss after a disaster.
MTTR: This concept is the average time that it takes to restore after a failure.
MTBF: This concept is how long a component is expected to last between outages (for example, a hard drive).
The RTO and RPO can be discovered through a risk assessment. The MTTR can be estimated and later refined by looking at the deployment process and disaster recovery tests. The manufacturer can provide the MTBF.
The following numbers will help give you an idea of what kind of SLA will be required. An SLA defines the level of uptime that you can expect from the provider. SLAs are usually measured in nines with more nines giving better uptime. Here is an example downtime chart.
90% ("one nine")
36.53 days
73.05 hours
16.80 hours
2.40 hours
99% ("two nines")
3.65 days
7.31 hours
1.68 hours
14.40 minutes
99.9% ("three nines")
8.77 hours
43.83 minutes
10.08 minutes
1.44 minutes
99.99% ("four nines")
52.60 minutes
4.38 minutes
1.01 minutes
8.64 seconds
99.999% ("five nines")
5.26 minutes
26.30 seconds
6.05 seconds
864.00 milliseconds
Building by Following Best Practices
Failure mode analysis (FMA)
Design the system to be scaled.
Create a redundancy plan.
Build high availability into the design.
Implement load balancing.
Create logs and metrics for monitoring.
FMA helps you determine where failure points are in your application. When choosing a service or system to make highly available, you should first target the failure points that can lead to the largest disruptions.
Design the application or system to be scaled. When discussing scaling, there are two common approaches: vertical scaling and horizontal scaling. With vertical scaling, you add more resources to an existing system. Vertical scaling is easy to implement but has a downside in the high cost of memory and processors. Horizontal scaling, on the other hand, allows you to run an application many times across many low-cost compute nodes.
Next, you will need to consider a redundancy plan. Based on the business requirements for high availability, choose which components to make redundant to achieve the SLA required.
When scaling horizontally, the application will require some form of load balancing to ensure that traffic is sent evenly to the compute nodes.
Finally, add logs and metrics within your application and systems that can be exposed for monitoring.
Data Management
Choose the right storage to meet the requirements.
Back up data regularly.
Verify and restore regularly. If you cannot restore, then you do not have backups.
Protect your data.
Your actions concerning data management can greatly impact the RPO. When choosing storage for your application, there are many options. You will need to choose one that fits your requirements for costs, reliability, encryption, security management, and even location.
You should also back up the data regularly. How often you back up the data depends on the risk assessment. However, simply backing up the data does not help if the backups are not regularly verified and restores have not been done. If you cannot restore from a backup, then you do not have a backup.
Protect the data storage. Enable disk encryption if possible and ensure proper user rights. Verify that the user who is managing the storage does not have access to remove the backups.
Automated Deployment Process
Automate your deployments as much as possible.
Have a rollback plan for failed deployments.
Audit your deployments.
Document your release process.
When implementing a high availability or disaster recovery system, another part of the calculation is the MTTR. Deployments greatly impact the MTTR. How quickly can an application be deployed on a new system?
Automating deployments, or as much of them as you can, helps reduce the time that it takes to run a deployment and limits the chance for human error during the deployment. A high availability system may have many servers to which the application is deployed, depending on scale. The more systems there are, the more chances there are for human error. Automation does not have this problem.
Deployments may still fail, so what is the rollback plan? How quickly can you roll back from a failed deployment? How much impact will the failed deployment have? The rollback plan and deployment auditing can answer these questions.
Document your release process. Multiple people should know how to deploy a release. Even if releases are scheduled based on your availability, a disaster may not occur on your timetable. Having clear deployment instructions that others are familiar with reduces the MTTR.
Monitor Server and Application Health
Identify KPIs as an early warning alert.
Maintain application logs and metrics.
Monitor third-party services on which your application relies.
Implement health checks.
To maintain the SLA and the MTBF, it is important to have a monitoring solution in place.
Key performance indicators (KPIs) help you find abnormalities in your system performance that can be a sign of a possible or imminent failure. For instance, an example API has an average of a 6-ms response time to requests. One day, the response time jumps to more than 25 ms. That spike is an anomaly and could point to an issue. Tied closely to the KPIs should be a system of alerting. A 19-ms increase may or may not have users reporting issues. Without an alert, the anomaly would be ignored.
Maintain application logs and metrics. Logs can tell you about an issue, but a sudden increase in logs can help in detecting an issue. Metrics can report on load and system health. A sudden spike in load can point to an issue on the application, an issue with the load balancer, or perhaps even a server outage.
While on the topic of monitoring, if your application requires a third-party service, those services should also be monitored.
Health checks for the application should be implemented and monitored. Running checks to ensure that the application returns the right data can help discover issues even when logs and metrics do not point to issues occurring.
Test High Availability
What happens when a single server fails?
What happens when a zone or region goes offline?
Run simulated tests.
Run load tests.
Just as with backups, if high availability is not routinely tested, your system is not highly available. Run scenarios against your system and verify the expected result. What is the expected result when a single server fails?
When you rely on a cloud platform like AWS, Google Cloud Platform, or Azure, what happens when one of their zones or regions goes offline? These scenarios should be tested. You can run simulated tests by removing servers, detaching storage, or changing network policies to see how the systems recover. You should also see what happens to your application under load.
High Availability in the Cloud
Scale servers horizontally.
Scale all parts of the application.
Add redundancy by adding a second zone or region.
An application that is hosted in the cloud should be scaled horizontally, not vertically, to achieve high availability. Every compute node that is added, increases the SLA, but with diminishing returns. Each system in the application should also be scaled to get the benefits of high availability.
Adding a second zone or region can be more beneficial to uptime than horizontal scaling, once horizontal scaling reaches the diminishing returns. A second region also benefits you if the provider performs maintenance, if an outage occurs, or if someone deletes one of the application clusters.
Disaster Recovery Plan
Write a disaster recovery plan.
Automate as much as you can.
Document the manual steps.
Regularly test the disaster recovery plan.
Just as important as high availability is disaster recovery. Disaster recovery time is what gives you the MTTR component of availability. Every application should have a disaster recovery plan. This plan should detail the systems, teams, and documentation that is needed to recover from a disaster. Usually, disaster recovery will involve a separate location for a regional disaster. Automate as much of the disaster recovery as you can; automating will ensure the speed and the reliability of the recovery.
All steps should be documented, but especially the manual steps. The documentations should be clear and easy to follow. Regularly test the disaster recovery plan and hold a meeting to discuss issues or comments on the plan.
Disaster Recovery in the Cloud
Moving disaster recovery to the cloud benefits from on-demand systems.
Not limited by physical presence.
Highly benefits from IaC.
Moving disaster recovery into the cloud has benefits over using only on-premises systems. The public cloud providers have on-demand systems. Hardware is readily available and does not have to be purchased ahead of time. There is no downtime waiting on someone to rack the hardware or provide connectivity to the hosts.
Another benefit of choosing a cloud provider is that you are not limited to physical presence. This approach allows the disaster recovery location to be in any region.
IaC allows your infrastructure requirements to be stored and acted upon, rather than being written down or only in documentation. IaC allows the infrastructure to be stored along with the application code. Coupled with scripts such as Terraform, allows disaster recovery to be automated. Many of the cloud providers work with Terraform.
Last updated