High Availability & Disaster Recovery

Play Video

High Availability 

Let’s talk about global availability. The way that AWS deals with global availability and building resilient architectures is a shared responsibility model very similar to the shared with security model that you have when you’re operating your servers and your instances within AWS. This shared responsibility and this shared availability model allow you to pull the levers and move the needle wherever you need to in order to make a globally resilient architecture.

Let’s break it down. AWS starts out at a regional model. This regional model is made up of availability zones and their 26 regions. Now there’s 84 availability zones in all, most regions have at least three availability zones and AWS is building new ones all the time.. Now you can deploy your servers, your compute and your databases and other things within a region and even within an individual availability zone and you can flip the levers to  move the needles where you need to be able to get high availability within a single region. 

What’s really interesting is when you move into multi-region models, providing you globally resilient architecture. So if a region were to become unavailable for some networking partition or some other issue, you can go and you can say, hey, it’s fine, I have an active failover in another region. And this could be for your database, it could be for your compute it could be for all of the above. The other interesting component is that you can run a globally available set of services, meaning you can use tools like Aurora or dynamodb global tables or even roll your own kind of global availability by leveraging Amazon Cloudfront, Amazon Route53 health checks and other services to build out a system that is resilient to any sort of failure while also improving latency for your in customers. 

At re:Invent this year, we learned about a lot of new services that were very exciting in terms of global availability and global resiliency. One of those is AWS Cloud WAN a wide area network that allows you to expand and operate all of your global infrastructure. There’s also a Direct Connect SiteLink which lets you use AWS’s own backbone, their own global infrastructure as the backbone for your whole internet business. We also saw a talk from riot games which talked about expanding availability beyond just that regional and availability zone model. They talked about how they bring low latency, first person shooter games to as many customers as possible by leveraging AWS Outposts and leveraging regional edge caches (Amazon CloudFront) that AWS operates. This lets you get a feel for:

  •  How much investment do I need to get global availability?
  •  Is latency important to my customer? 
  • What are the access patterns of my application that I need to consider 
  • hat are the failure modes of my application that I need to consider?

There’s another great thing that AWS released at  re:Invent called the AWS Fault Injection Simulator (AWS FIS) service. This is a method for injecting networking failures or compute failures into your workload and it’s a form of chaos engineering and that you’re accepting that you’re in order for your architecture to be resilient, 

You have to test what happens when things fail: 

  • Test losing connectivity to a particular region  
  • Test losing connectivity to a whole database  
  • Test DNS failures. 

All in all, there’s a great set of options for building a globally available service on AWS. There’s a great set of options for building something that’s globally resilient and able to handle any form of failure.

Disaster Recovery

So let’s talk about the spectrum of Disaster Recover (DR)options that are available to you within AWS. Workloads have different characteristics, different access patterns and different failure modes that have to be considered when you’re building out your architecture. So you can build in a single availability zone, you can run your compute, your database, everything from there. That’s pretty good for a developer setup, but it’s not resilient to failure. In order to be resilient to failure, you have to build in multiple availability zones and this will buy you a lot of extra capacity and the ability to deal with any intermittent blips or networking issues or other kinds of failures. 

Now, if you need something that’s vital critical infrastructure, the way to build it is inter region, the way you go about that is you can leverage services like dynamodb global tables, you can leverage things like amazon cloudfront and you can leverage things like Amazon Aurora for doing multi-region tables and replicas. This kind of cost control allows you to build workload specific patterns that work with whatever your application needs and whatever your end users need. If your application is, keeping patients alive in hospitals, that needs to be something that is running 24/7, it can’t go down. If your application is “Twitter for Pets”,  it’s okay if your application goes down for a short time, you know that the world isn’t going to fall apart if “Twitter for Pets” goes down. 

Think about your costs, think about the constraints that you have and the way you can architect your applications to be multi-region and to be multi-az and to deal with eventual consistency and event driven frameworks.

Share this article

Leave a comment


Share this article


Join Thousands of DevOps & Cloud Professionals. Sign up for our newsletter for updated information, insight and promotion.