High Availability & Disaster Recovery


Learn how you can leverage services such as AWS Cloud WAN, AWS Direct Connect SiteLink and AWS Fault Injection Simulator, to enable high availability, globally resilient architectures and disaster recovery.

High Availability & Disaster Recovery

High Availability 

Let's talk about global availability. The way that AWS deals with global availability and building resilient architectures is a shared responsibility model very similar to the shared with security model that you have when you're operating your servers and your instances within AWS. This shared responsibility and this shared availability model allow you to pull the levers and move the needle wherever you need to in order to make a globally resilient architecture.

Let's break it down. AWS starts out at a regional model. This regional model is made up of availability zones and their 26 regions. Now there's 84 availability zones in all, most regions have at least three availability zones and AWS is building new ones all the time.. Now you can deploy your servers, your compute and your databases and other things within a region and even within an individual availability zone and you can flip the levers to move the needles where you need to be able to get high availability within a single region. 

What's really interesting is when you move into multi-region models, providing you globally resilient architecture. So if a region were to become unavailable for some networking partition or some other issue, you can go and you can say, hey, it's fine, I have an active failover in another region. And this could be for your database, it could be for your compute it could be for all of the above. The other interesting component is that you can run a globally available set of services, meaning you can use tools like Aurora or dynamodb global tables or even roll your own kind of global availability by leveraging Amazon Cloudfront, Amazon Route53 health checks and other services to build out a system that is resilient to any sort of failure while also improving latency for your in customers. 

At re:Invent this year, we learned about a lot of new services that were very exciting in terms of global availability and global resiliency. One of those is AWS Cloud WAN a wide area network that allows you to expand and operate all of your global infrastructure. There's also a Direct Connect SiteLink which lets you use AWS's own backbone, their own global infrastructure as the backbone for your whole internet business. We also saw a talk from riot games which talked about expanding availability beyond just that regional and availability zone model. They talked about how they bring low latency, first person shooter games to as many customers as possible by leveraging AWS Outposts and leveraging regional edge caches (Amazon CloudFront) that AWS operates. This lets you get a feel for:

  • How much investment do I need to get global availability?
  • Is latency important to my customer? 
  • What are the access patterns of my application that I need to consider 
  • What are the failure modes of my application that I need to consider?

There's another great thing that AWS released at re:Invent called the AWS Fault Injection Simulator (AWS FIS) service. This is a method for injecting networking failures or compute failures into your workload and it's a form of chaos engineering and that you're accepting that you're in order for your architecture to be resilient, 

You have to test what happens when things fail: 

  • Test losing connectivity to a particular region  
  • Test losing connectivity to a whole database  
  • Test DNS failures. 

All in all, there's a great set of options for building a globally available service on AWS. There's a great set of options for building something that's globally resilient and able to handle any form of failure.

Disaster Recovery

So let's talk about the spectrum of Disaster Recover (DR)options that are available to you within AWS. Workloads have different characteristics, different access patterns and different failure modes that have to be considered when you're building out your architecture. So you can build in a single availability zone, you can run your compute, your database, everything from there. That's pretty good for a developer setup, but it's not resilient to failure. In order to be resilient to failure, you have to build in multiple availability zones and this will buy you a lot of extra capacity and the ability to deal with any intermittent blips or networking issues or other kinds of failures. 

Now, if you need something that's vital critical infrastructure, the way to build it is inter region, the way you go about that is you can leverage services like dynamodb global tables, you can leverage things like amazon cloudfront and you can leverage things like Amazon Aurora for doing multi-region tables and replicas. This kind of cost control allows you to build workload specific patterns that work with whatever your application needs and whatever your end users need. If your application is, keeping patients alive in hospitals, that needs to be something that is running 24/7, it can't go down. If your application is “Twitter for Pets”, it's okay if your application goes down for a short time, you know that the world isn't going to fall apart if “Twitter for Pets” goes down. 

Think about your costs, think about the constraints that you have and the way you can architect your applications to be multi-region and to be multi-az and to deal with eventual consistency and event driven frameworks.

Randall Hunt

Randall Hunt

Randall Hunt, VP of Cloud Strategy and Innovation at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.

View Randall's articles

Learn more about the services mentioned

Caylent Catalysts™

Disaster Recovery Strategy

Determine the disaster recovery (DR) strategy best suited to protect your workloads on AWS, tailored to your budgets and recovery targets.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Differences Between GenAI and AI

While GenAI has gained significant attention in recent times, businesses have long used AI for vital tasks like fraud detection and personalization. Learn the distinctions between GenAI and Analytical AI and how you can unleash the potential of AI in your business.

Artificial Intelligence & MLOps

SageMaker JumpStart

Learn how SageMaker JumpStart paves the way for efficient AI adoption with a blend of foundation models, algorithms, and seamless integrations, without hefty initial investments.

Artificial Intelligence & MLOps

Amazon SageMaker Suite

Increasingly, people are opting to utilize the SageMaker Suite for custom models and internal development purposes. Join Caylent’s Randall Hunt as he breaks down the different services that make up SageMaker Suite

Artificial Intelligence & MLOps