2025 GenAI Whitepaper

Designing for Resiliency on AWS

Resiliency
Application Modernization

Learn best practices for building resilient IT systems on AWS to ensure reliability, security, and high availability in the face of failures and disruptions.

In the real world, things fail. Because it’s not possible to entirely avoid failures, the focus for modern organizations is on resiliency — which AWS defines as the ability for a system to recover within the desired timeframe when stressed by load, attack, or component failure. 

So, when an earthquake hits your data center, when an employee opens a ransomware email, or when your latest app goes viral, customers, employees, and partners can continue accessing what they need, when they need it, without missing a beat. 

Simply put, organizations that build resiliency into IT operations can keep going when others can’t. In a world where everything fails, all the time, that can create an important competitive advantage. In this blog, we’ll explore key considerations and best practices for designing for resiliency on AWS.

What impacts workload reliability?

In addition to infrastructure and service disruptions, other factors can also impact workload reliability, such as:

Operational excellence: The practices and processes the organization uses to run and improve systems over time. This includes practices like incident response, problem management, and change management.

Security: The controls and processes in place to protect systems, data, and infrastructure from threats like attacks, breaches, and accidental loss. This includes practices like identity and access management, data encryption, and vulnerability management.

Performance: The ability of systems to operate efficiently, effectively, and predictably under expected loads. This is influenced by infrastructure optimization, application performance, and network configurations.

Cost optimization: Balancing business requirements with cost efficiencies to avoid waste, minimize costs, and maximize the value of cloud investments over time. This includes practices like using cost-efficient instance types, rightsizing instances, and optimizing storage costs.

Resiliency in AWS 

It makes sense to maximize your use of the resiliency features inherent in the AWS Shared Responsibility Model. Simply put, AWS is responsible for the “Security of the Cloud,” and your IT teams are responsible for “Security in the Cloud.”

Adopting a serverless computing model shifts more of the shared responsibility onto the AWS side, because they are responsible for managing the underlying infrastructure and platform services. 

For example, with a legacy monolithic application hosted on an Amazon EC2 instance, your IT team manages the operating system, security configurations, and compliance. However, transitioning to AWS Lambda or AWS Fargate shifts infrastructure security responsibilities to AWS, including securing the execution environment, managing servers, and ensuring high availability.

Best Practices for Designing for Resiliency

Resiliency touches three areas: infrastructure design, operational design, and application design. It’s important to keep in mind that resiliency is not an all-or-nothing proposition; you can work to build resiliency in each area at your own pace, strengthening your resiliency as you go along. Start by assessing your resiliency, and then take steps to improve resiliency where you’ll see the most benefits from your efforts.

Infrastructure Design

There are several key aspects of infrastructure design that will determine how resilient that infrastructure can ultimately be. The most significant are:

Networking redundancy

Duplicating critical networking components helps avert network failures to maintain continuous connectivity. Multiple availability zones (AZs), multi-regional AZ, redundant connections, and AWS services such as Amazon CloudFront, Amazon API Gateway, AWS Lambda Endpoints, and AWS Elastic Load Balancing (ELB) are all integral components of networking redundancy within the AWS ecosystem.

Routing

The Amazon Route 53 DNS service can contribute to resiliency by applying various routing techniques for load balancing and active/passive failover to achieve high availability, fault tolerance, and efficient traffic distribution for applications and services.

Infrastructure as code

Many organizations rely on easy-to-use ClickOps interfaces when transitioning to the cloud. However, ClickOps quickly becomes unwieldy and impractical as operations scale. Infrastructure as code (IaC) refers to automating infrastructure provisioning and support using code pipelines instead of manual processes. IaC is a good fit for modern cloud-native architectures, with features that help promote resiliency, such as:

  • Consistency: Apply configurations uniformly across environments.
  • Automation: Reduce manual intervention and the chances of human error.
  • Scalability: Adjust automatically to changing workload demands.
  • Control: Improve reliability and resilience via version control, change management, and testing.

Monitoring, logging, and alerting

Amazon CloudWatch provides monitoring, logging, and alerting services to help you keep an eye on the health, performance, and security of your AWS resources and applications.

Security

AWS Security Groups act as virtual firewalls, following the principles of least privilege to prevent unauthorized access. Network Access Control Lists (NACLs) enable segmenting the network into multiple subnets, applying stateless firewall rules to each. AWS Identity and Access Management (IAM) enables fine-grained access controls that apply least-privilege permissions to users, groups, and roles to minimize the impact of credential compromise. In addition, AWS provides security tools and services that automate security assessments, monitor compliance, and remediate security issues.

Operational Design

You must design and implement operational processes, practices, and systems to adapt to changing conditions and mitigate risks to help ensure high availability, reliability, and performance during disruptions or failures. Several key operational design principles impact resiliency:

  • Backups: Guard against data loss and corruption with regular backups.
  • Backup testing: Verify the backup and test the restore process.
  • Backup redundancy: Ship backups to other regions or accounts to protect against regional outages or account-level failures.
  • Backup frequency and rotation: Optimize your recovery capabilities while managing storage costs by applying backup granularity and retention policies.
  • Playbooks or runbooks: Help teams carry out predefined procedures quickly and consistently when an incident occurs.
  • Standby environments: Use hot, warm, or pilot light standby environments to determine your disaster recovery strategy based on cost and the speed of recovery.

Application Design

Designing applications with a cloud-native approach promotes resiliency, but can be complex to implement effectively. Crucial elements to consider when building cloud-native applications include:

Modern, cloud-native microservices architectures

Modern, cloud-native microservices architectures incorporate principles such as loose coupling, high cohesion, and auto-scaling as fundamental design characteristics.

Loose coupling between individual services allows you to isolate failures, independently scale components, and enhance fault tolerance. High cohesion reduces the risk of unintended side-effects and dependencies between components, making identifying and isolating failures easier. Autoscaling enables applications to dynamically adjust their resource allocation in response to traffic spikes or changes in demand.

Code reviews

Code reviews are essential for maintaining code quality, consistency, and reliability. Thorough code reviews help teams identify vulnerabilities, bottlenecks, and reliability issues to improve the overall resilience of the application.

Event-driven architecture (EDA)

Event-driven architecture (EDA) is built from small, decoupled services that publish, consume, or route events. EDAs use asynchronous message passing to decouple components and enable scalable and resilient communication between services. Message queues or event streams for buffering and handling transient errors allow applications to absorb bursts of traffic, smooth out load spikes, and recover gracefully from failures without impacting end users.

Idempotency

Idempotency refers to the ability of an operation to produce the same result when performed multiple times. Idempotency supports resilience because repeating an operation after a failure will not cause unintended side effects or data corruption.

Observability

Observability practices such as logging, monitoring, and distributed tracing can help teams diagnose issues, troubleshoot failures, and optimize performance to enhance resilience.

Serverless computing services

Serverless computing services like AWS Lambda and AWS Fargate abstract away infrastructure responsibilities so developers can focus on writing code, not managing servers. Serverless architectures promote resiliency with automatic scalability, fault tolerance, and high availability.

Backup scheduling

Backup scheduling such as Recovery Point Objective and Recovery Time Objective determine how much data you might lose and how much downtime you will experience. You will need to decide for each application what their data requirements are regarding sensitivity to loss.

The Caylent approach to resiliency on AWS

Is your application ready to serve a global user base? Do you need to segregate your environment in a particular region due to compliance (e.g. GDPR)? Do you need a multi-region application to ensure high availability?

Embrace the power of a multi-region infrastructure to reduce latency, improve performance, and transform your application into a truly global powerhouse with Caylent.

Caylent helps you thrive in a software-defined world where technology is at the core of every business. We work with you to build, scale, and optimize sophisticated cloud solutions using deep subject matter expertise to deliver world-class outcomes through an agile co-delivery model. Get in touch to find out how we can help!

Resiliency
Application Modernization
Brian Tarbox

Brian Tarbox

Brian is an AWS Community Hero, Alexa Champion, runs the Boston AWS User Group, has ten US patents and a bunch of certifications. He's also part of the New Voices mentorship program where Heros teach traditionally underrepresented engineers how to give presentations. He is a private pilot, a rescue scuba diver and got his Masters in Cognitive Psychology working with bottlenosed dolphins.

View Brian's articles

Learn more about the services mentioned

Caylent Catalysts™

Multi-Region Application Strategy

Embrace multi-region infrastructure for global reach, compliance, high availability, and disaster recovery.

Caylent Catalysts™

Disaster Recovery Strategy

Determine the disaster recovery (DR) strategy best suited to protect your workloads on AWS, tailored to your budgets and recovery targets.

Caylent Catalysts™

Application Modernization Strategy

Modernize your applications on AWS with a customized plan that aligns with your unique business needs and goals.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

EKS Best Practices Guide for Security

Explore security best practices for Amazon Elastic Kubernetes Service (EKS) to help maintain a strong security posture.

Application Modernization

eBook: A Guide to Legacy Application Modernization

Application Modernization

Production-Grade EKS Clusters: Best Practices for Scalability, Security, and Efficiency

Learn how Amazon Elastic Kubernetes Service (EKS) simplifies Kubernetes cluster management by providing robust tools, security practices, and scalability solutions for production environments.

Application Modernization