Caylent Catalysts™
Multi-Region Application Strategy
Embrace multi-region infrastructure for global reach, compliance, high availability, and disaster recovery.
Learn best practices for building resilient IT systems on AWS to ensure reliability, security, and high availability in the face of failures and disruptions.
In the real world, things fail. Because it’s not possible to entirely avoid failures, the focus for modern organizations is on resiliency — which AWS defines as the ability for a system to recover within the desired timeframe when stressed by load, attack, or component failure.
So, when an earthquake hits your data center, when an employee opens a ransomware email, or when your latest app goes viral, customers, employees, and partners can continue accessing what they need, when they need it, without missing a beat.
Simply put, organizations that build resiliency into IT operations can keep going when others can’t. In a world where everything fails, all the time, that can create an important competitive advantage. In this blog, we’ll explore key considerations and best practices for designing for resiliency on AWS.
In addition to infrastructure and service disruptions, other factors can also impact workload reliability, such as:
Operational excellence: The practices and processes the organization uses to run and improve systems over time. This includes practices like incident response, problem management, and change management.
Security: The controls and processes in place to protect systems, data, and infrastructure from threats like attacks, breaches, and accidental loss. This includes practices like identity and access management, data encryption, and vulnerability management.
Performance: The ability of systems to operate efficiently, effectively, and predictably under expected loads. This is influenced by infrastructure optimization, application performance, and network configurations.
Cost optimization: Balancing business requirements with cost efficiencies to avoid waste, minimize costs, and maximize the value of cloud investments over time. This includes practices like using cost-efficient instance types, rightsizing instances, and optimizing storage costs.
It makes sense to maximize your use of the resiliency features inherent in the AWS Shared Responsibility Model. Simply put, AWS is responsible for the “Security of the Cloud,” and your IT teams are responsible for “Security in the Cloud.”
Adopting a serverless computing model shifts more of the shared responsibility onto the AWS side, because they are responsible for managing the underlying infrastructure and platform services.
For example, with a legacy monolithic application hosted on an Amazon EC2 instance, your IT team manages the operating system, security configurations, and compliance. However, transitioning to AWS Lambda or AWS Fargate shifts infrastructure security responsibilities to AWS, including securing the execution environment, managing servers, and ensuring high availability.
Resiliency touches three areas: infrastructure design, operational design, and application design. It’s important to keep in mind that resiliency is not an all-or-nothing proposition; you can work to build resiliency in each area at your own pace, strengthening your resiliency as you go along. Start by assessing your resiliency, and then take steps to improve resiliency where you’ll see the most benefits from your efforts.
There are several key aspects of infrastructure design that will determine how resilient that infrastructure can ultimately be. The most significant are:
Duplicating critical networking components helps avert network failures to maintain continuous connectivity. Multiple availability zones (AZs), multi-regional AZ, redundant connections, and AWS services such as Amazon CloudFront, Amazon API Gateway, AWS Lambda Endpoints, and AWS Elastic Load Balancing (ELB) are all integral components of networking redundancy within the AWS ecosystem.
The Amazon Route 53 DNS service can contribute to resiliency by applying various routing techniques for load balancing and active/passive failover to achieve high availability, fault tolerance, and efficient traffic distribution for applications and services.
Many organizations rely on easy-to-use ClickOps interfaces when transitioning to the cloud. However, ClickOps quickly becomes unwieldy and impractical as operations scale. Infrastructure as code (IaC) refers to automating infrastructure provisioning and support using code pipelines instead of manual processes. IaC is a good fit for modern cloud-native architectures, with features that help promote resiliency, such as:
Amazon CloudWatch provides monitoring, logging, and alerting services to help you keep an eye on the health, performance, and security of your AWS resources and applications.
AWS Security Groups act as virtual firewalls, following the principles of least privilege to prevent unauthorized access. Network Access Control Lists (NACLs) enable segmenting the network into multiple subnets, applying stateless firewall rules to each. AWS Identity and Access Management (IAM) enables fine-grained access controls that apply least-privilege permissions to users, groups, and roles to minimize the impact of credential compromise. In addition, AWS provides security tools and services that automate security assessments, monitor compliance, and remediate security issues.
You must design and implement operational processes, practices, and systems to adapt to changing conditions and mitigate risks to help ensure high availability, reliability, and performance during disruptions or failures. Several key operational design principles impact resiliency:
Designing applications with a cloud-native approach promotes resiliency, but can be complex to implement effectively. Crucial elements to consider when building cloud-native applications include:
Modern, cloud-native microservices architectures incorporate principles such as loose coupling, high cohesion, and auto-scaling as fundamental design characteristics.
Loose coupling between individual services allows you to isolate failures, independently scale components, and enhance fault tolerance. High cohesion reduces the risk of unintended side-effects and dependencies between components, making identifying and isolating failures easier. Autoscaling enables applications to dynamically adjust their resource allocation in response to traffic spikes or changes in demand.
Code reviews are essential for maintaining code quality, consistency, and reliability. Thorough code reviews help teams identify vulnerabilities, bottlenecks, and reliability issues to improve the overall resilience of the application.
Event-driven architecture (EDA) is built from small, decoupled services that publish, consume, or route events. EDAs use asynchronous message passing to decouple components and enable scalable and resilient communication between services. Message queues or event streams for buffering and handling transient errors allow applications to absorb bursts of traffic, smooth out load spikes, and recover gracefully from failures without impacting end users.
Idempotency refers to the ability of an operation to produce the same result when performed multiple times. Idempotency supports resilience because repeating an operation after a failure will not cause unintended side effects or data corruption.
Observability practices such as logging, monitoring, and distributed tracing can help teams diagnose issues, troubleshoot failures, and optimize performance to enhance resilience.
Serverless computing services like AWS Lambda and AWS Fargate abstract away infrastructure responsibilities so developers can focus on writing code, not managing servers. Serverless architectures promote resiliency with automatic scalability, fault tolerance, and high availability.
Backup scheduling such as Recovery Point Objective and Recovery Time Objective determine how much data you might lose and how much downtime you will experience. You will need to decide for each application what their data requirements are regarding sensitivity to loss.
Is your application ready to serve a global user base? Do you need to segregate your environment in a particular region due to compliance (e.g. GDPR)? Do you need a multi-region application to ensure high availability?
Embrace the power of a multi-region infrastructure to reduce latency, improve performance, and transform your application into a truly global powerhouse with Caylent.
Caylent helps you thrive in a software-defined world where technology is at the core of every business. We work with you to build, scale, and optimize sophisticated cloud solutions using deep subject matter expertise to deliver world-class outcomes through an agile co-delivery model. Get in touch to find out how we can help!
Brian is an AWS Community Hero, Alexa Champion, runs the Boston AWS User Group, has ten US patents and a bunch of certifications. He's also part of the New Voices mentorship program where Heros teach traditionally underrepresented engineers how to give presentations. He is a private pilot, a rescue scuba diver and got his Masters in Cognitive Psychology working with bottlenosed dolphins.
View Brian's articlesCaylent Catalysts™
Embrace multi-region infrastructure for global reach, compliance, high availability, and disaster recovery.
Caylent Catalysts™
Determine the disaster recovery (DR) strategy best suited to protect your workloads on AWS, tailored to your budgets and recovery targets.
Caylent Catalysts™
Modernize your applications on AWS with a customized plan that aligns with your unique business needs and goals.
Explore security best practices for Amazon Elastic Kubernetes Service (EKS) to help maintain a strong security posture.
Learn how Amazon Elastic Kubernetes Service (EKS) simplifies Kubernetes cluster management by providing robust tools, security practices, and scalability solutions for production environments.