High Availability & Disaster Recovery

March 4, 2022

Security

Disaster Recovery & High Availability

Learn how you can leverage services such as AWS Cloud WAN, AWS Direct Connect SiteLink and AWS Fault Injection Simulator, to enable high availability, globally resilient architectures and disaster recovery.

High Availability

Let's talk about global availability. The way that AWS deals with global availability and building resilient architectures is a shared responsibility model very similar to the shared with security model that you have when you're operating your servers and your instances within AWS. This shared responsibility and this shared availability model allow you to pull the levers and move the needle wherever you need to in order to make a globally resilient architecture.

Let's break it down. AWS starts out at a regional model. This regional model is made up of availability zones and their 26 regions. Now there's 84 availability zones in all, most regions have at least three availability zones and AWS is building new ones all the time.. Now you can deploy your servers, your compute and your databases and other things within a region and even within an individual availability zone and you can flip the levers to move the needles where you need to be able to get high availability within a single region.

What's really interesting is when you move into multi-region models, providing you globally resilient architecture. So if a region were to become unavailable for some networking partition or some other issue, you can go and you can say, hey, it's fine, I have an active failover in another region. And this could be for your database, it could be for your compute it could be for all of the above. The other interesting component is that you can run a globally available set of services, meaning you can use tools like Aurora or dynamodb global tables or even roll your own kind of global availability by leveraging Amazon Cloudfront, Amazon Route53 health checks and other services to build out a system that is resilient to any sort of failure while also improving latency for your in customers.

At re:Invent this year, we learned about a lot of new services that were very exciting in terms of global availability and global resiliency. One of those is AWS Cloud WAN a wide area network that allows you to expand and operate all of your global infrastructure. There's also a Direct Connect SiteLink which lets you use AWS's own backbone, their own global infrastructure as the backbone for your whole internet business. We also saw a talk from riot games which talked about expanding availability beyond just that regional and availability zone model. They talked about how they bring low latency, first person shooter games to as many customers as possible by leveraging AWS Outposts and leveraging regional edge caches (Amazon CloudFront) that AWS operates. This lets you get a feel for:

How much investment do I need to get global availability?
Is latency important to my customer?
What are the access patterns of my application that I need to consider
What are the failure modes of my application that I need to consider?

There's another great thing that AWS released at re:Invent called the AWS Fault Injection Simulator (AWS FIS) service. This is a method for injecting networking failures or compute failures into your workload and it's a form of chaos engineering and that you're accepting that you're in order for your architecture to be resilient,

You have to test what happens when things fail:

Test losing connectivity to a particular region
Test losing connectivity to a whole database
Test DNS failures.

All in all, there's a great set of options for building a globally available service on AWS. There's a great set of options for building something that's globally resilient and able to handle any form of failure.

Disaster Recovery

So let's talk about the spectrum of Disaster Recover (DR)options that are available to you within AWS. Workloads have different characteristics, different access patterns and different failure modes that have to be considered when you're building out your architecture. So you can build in a single availability zone, you can run your compute, your database, everything from there. That's pretty good for a developer setup, but it's not resilient to failure. In order to be resilient to failure, you have to build in multiple availability zones and this will buy you a lot of extra capacity and the ability to deal with any intermittent blips or networking issues or other kinds of failures.

Now, if you need something that's vital critical infrastructure, the way to build it is inter region, the way you go about that is you can leverage services like dynamodb global tables, you can leverage things like amazon cloudfront and you can leverage things like Amazon Aurora for doing multi-region tables and replicas. This kind of cost control allows you to build workload specific patterns that work with whatever your application needs and whatever your end users need. If your application is, keeping patients alive in hospitals, that needs to be something that is running 24/7, it can't go down. If your application is “Twitter for Pets”, it's okay if your application goes down for a short time, you know that the world isn't going to fall apart if “Twitter for Pets” goes down.

Think about your costs, think about the constraints that you have and the way you can architect your applications to be multi-region and to be multi-az and to deal with eventual consistency and event driven frameworks.

Security

Disaster Recovery & High Availability

Randall Hunt

Randall Hunt, Chief Technology Officer at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.

View Randall's articles

Learn more about the services mentioned

Caylent Catalysts™

Disaster Recovery Strategy

Determine the disaster recovery (DR) strategy best suited to protect your workloads on AWS, tailored to your budgets and recovery targets.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

ACGR in CxPortal: Operationalizing Regional Failover in Amazon Connect

Explore how ACGR in CxPortal extends Amazon Connect Global Resiliency by giving contact center operations teams a centralized, playbook-driven way to execute, manage, and monitor regional failover and recovery, reducing downtime and accelerating response during service disruptions.

Disaster Recovery & High Availability

Application Modernization

December 2, 2025

AWS Security Hub: Native Cloud Security Operations

Caylent is proud to be a launch partner for AWS Security Hub, a native solution that streamlines how organizations manage their cloud security posture and run cloud security operations.

Security

May 9, 2024

Securing Sensitive Data: A Deep Dive into PII Protection with OpenSearch

Learn how organizations can protect sensitive data using Amazon OpenSearch's security features like fine-grained access controls, encryption, authentication, and audit logging.

Data Modernization & Analytics

Security

View all blog posts

High Availability

Disaster Recovery

Randall Hunt

Learn more about the services mentioned

Disaster Recovery Strategy

Accelerate your cloud native journey

Related Blog Posts

ACGR in CxPortal: Operationalizing Regional Failover in Amazon Connect

AWS Security Hub: Native Cloud Security Operations

Securing Sensitive Data: A Deep Dive into PII Protection with OpenSearch