In our Disaster Recovery & High Availability Twitter Space, Caylent’s Randall Hunt was joined by Jeff Barr, Vice President & Chief Evangelist at AWS, and Adrian Cockcroft, Former VP, Sustainability Architecture at Amazon. In addition to answering various questions from the audience, they discussed what disaster recovery and resilience mean in the context of cloud computing today, the risks often faced by organizations in a world with continually growing databases and software complexity, as well as some solutions to prepare for disasters and to automate recovery and damage control.
Below is a paraphrased summary of key points from this conversation. For a deeper dive with industry specific analogs, please listen to the Twitter Space recording above.
What does high availability & disaster recovery mean?
To provide high availability is to give your customers access when they need it at the performance level they require. Disasters can jeopardize your platform’s availability and test its resilience. Resilience is the ability of your infrastructure to recover quickly and resume operations after a disaster has occurred. It involves thinking about how fast your infrastructure and services can get back up and running. It involves implementing well-orchestrated plans or automated software that can help you get your platform back up and running.
A more resilient system will have greater spare capacity to absorb failure, as well as self-healing and self-repair capabilities that help it recover more rapidly should the system fail.
When you run out of that capacity and you’re effectively out of control, that’s when disaster recovery comes in. Disaster recovery is about what you do when your business critical system, runs out of the capacity to absorb failure.
Failing Over Without Falling Over
You want to fail over without falling over. MOST CIOs and executives would agree to having developed a backup and recovery plan and if the institution in question was a bank, they would perform tests on it once a year as an audit requirement. If you ask them, “Do you test it one application at a time or do you pull the plug on the whole data center?” Most would report not taking a step that significant.
A lot of time and money can be invested in spare data centers or lots of extra capacity to duplicate things that could failover. But if you don’t test that capacity, it’s basically wasted availability. You just have a theatrical, going through the motions of having a disaster recovery plan, but you don’t really have one in practice.
You shouldn’t have a backup plan, you should have a restore plan.
You don’t want that actual activation of the plan to be the first time you’ve gone through it. It is important to be testing it on a regular basis as if a disaster were real, so that when it does happen, you’ve thoroughly practiced and have the right reflexes.
How does disaster recovery differ today to its state 15 years ago?
One of the primary differences is how much one needs to manually conceptualize, build, maintain, and test themselves versus how much the platform,the environment or the cloud does for you, with the confidence that it’s going to uphold its part in terms of resilience to failover or built-in redundancy.
In the past, it was assumed that everything would remain 100% up when writing software, and there was minimal concern around resilience and disaster recovery. Now, availability and resilience are taken into consideration as a part of the application architecture. During Adrian Cockcroft’s term at Netflix, they started to see that everything can be susceptible to failure, and they started building resilience into systems and simulating failures to prove its effectiveness.
AWS Route 53 Application Resilience Controller (ARC)
If the systems you are using to manage your environment are part of the environment, when it fails, it all goes down together. You want to have something outside that environment which is extremely resilient, and ARC runs across multiple regions. Even if more than one region is down, it’s still there and you can access it from anywhere else in the world and use that to manage the state of your system and the state as it falls over. When you’re using these well-tested components that are purpose built, you can have strong confidence around your resilience.
Managing ownership for DR needs
Some organizations may possess a little bit of reluctance to actually admit that they’re in more of a fragile state than is necessary for resilience that is optimal to their needs. Resilience is a spectrum and we must be technically honest with ourselves about where our systems truly lie.
This opens doors to reasoning and discussion around what an organization can do to secure their systems. Adrian recommends placing a resilience meeting on the calendar on a weekly or bi-weekly cadence, with stakeholders who would be responsible for recovery, should something happen.
A response plan can be developed to strategize the human and technological sides to disaster recovery. Building coordination amongst people is an investment that pays off immensely when a disaster occurs.
Justifying costs and investments
Justifying the costs of implementing disaster recovery measures requires wearing a business hat and determining opportunity costs. What would the cost of a failure be in terms of lost business, or public perception towards a company’s abilities to provide the service and uptime they promise. How does damage to an organization’s reputation affect its churn?
Resilience Roulette
At Amazon, there is a weekly meeting for operations teams representing every service. A roulette wheel is spun at the table and people are picked at random to report the state of their service. Any outages across the services are discussed in the meetings to spread information across the other teams. This form of institutionalized learning and horizontal spread of information is imperative to significantly improve resilience.
Multi-cloud as a strategy to reduce cloud failover risk
While such a strategy might seem favorable at first, it can be very complicated in practice, without a lot of advantages. One has to think about how you are replicating data from one cloud to another cloud. In order to make your systems on multiple clouds look the same, you have to add a layer of software which abstracts them, and you’ll find that that layer of software is actually where most of the failures are going to happen.
You want to have the underlying systems be truly symmetrical. One of the nice things about a cloud that makes disaster recovery and failure much easier than in the data center is that one can be very sure that their systems across two zones in different regions are going to be identical.
If you are looking to learn about how you can set in mechanisms for disaster recovery and make your AWS infrastructure truly resilient, read our blog on disaster recovery on AWS. If you’re ready to work with us to pursue your high availability and disaster recovery goals, explore our Disaster Recovery Assessment Caylent Catalyst – a two week engagement that allows your team to go on a deep dive for your architectural requirements including business architecture and nonfunctional requirements (NFRs) such as regional availability, RTO/RPO, latency, regulatory compliance requirements, and more.