For example, with a legacy monolithic application hosted on an Amazon EC2 instance, your IT team manages the operating system, security configurations, and compliance. However, transitioning to AWS Lambda or AWS Fargate shifts infrastructure security responsibilities to AWS, including securing the execution environment, managing servers, and ensuring high availability.
Best Practices for Designing for Resiliency
Resiliency touches three areas: infrastructure design, operational design, and application design. It’s important to keep in mind that resiliency is not an all-or-nothing proposition; you can work to build resiliency in each area at your own pace, strengthening your resiliency as you go along. Start by assessing your resiliency, and then take steps to improve resiliency where you’ll see the most benefits from your efforts.
Infrastructure Design
There are several key aspects of infrastructure design that will determine how resilient that infrastructure can ultimately be. The most significant are:
Networking redundancy
Duplicating critical networking components helps avert network failures to maintain continuous connectivity. Multiple availability zones (AZs), multi-regional AZ, redundant connections, and AWS services such as Amazon CloudFront, Amazon API Gateway, AWS Lambda Endpoints, and AWS Elastic Load Balancing (ELB) are all integral components of networking redundancy within the AWS ecosystem.
Routing
The Amazon Route 53 DNS service can contribute to resiliency by applying various routing techniques for load balancing and active/passive failover to achieve high availability, fault tolerance, and efficient traffic distribution for applications and services.
Infrastructure as code
Many organizations rely on easy-to-use ClickOps interfaces when transitioning to the cloud. However, ClickOps quickly becomes unwieldy and impractical as operations scale. Infrastructure as code (IaC) refers to automating infrastructure provisioning and support using code pipelines instead of manual processes. IaC is a good fit for modern cloud-native architectures, with features that help promote resiliency, such as:
- Consistency: Apply configurations uniformly across environments.
- Automation: Reduce manual intervention and the chances of human error.
- Scalability: Adjust automatically to changing workload demands.
- Control: Improve reliability and resilience via version control, change management, and testing.
Monitoring, logging, and alerting
Amazon CloudWatch provides monitoring, logging, and alerting services to help you keep an eye on the health, performance, and security of your AWS resources and applications.
Security
AWS Security Groups act as virtual firewalls, following the principles of least privilege to prevent unauthorized access. Network Access Control Lists (NACLs) enable segmenting the network into multiple subnets, applying stateless firewall rules to each. AWS Identity and Access Management (IAM) enables fine-grained access controls that apply least-privilege permissions to users, groups, and roles to minimize the impact of credential compromise. In addition, AWS provides security tools and services that automate security assessments, monitor compliance, and remediate security issues.
Operational Design
You must design and implement operational processes, practices, and systems to adapt to changing conditions and mitigate risks to help ensure high availability, reliability, and performance during disruptions or failures. Several key operational design principles impact resiliency:
- Backups: Guard against data loss and corruption with regular backups.
- Backup testing: Verify the backup and test the restore process.
- Backup redundancy: Ship backups to other regions or accounts to protect against regional outages or account-level failures.
- Backup frequency and rotation: Optimize your recovery capabilities while managing storage costs by applying backup granularity and retention policies.
- Playbooks or runbooks: Help teams carry out predefined procedures quickly and consistently when an incident occurs.
- Standby environments: Use hot, warm, or pilot light standby environments to determine your disaster recovery strategy based on cost and the speed of recovery.
Application Design
Designing applications with a cloud-native approach promotes resiliency, but can be complex to implement effectively. Crucial elements to consider when building cloud-native applications include:
Modern, cloud-native microservices architectures
Modern, cloud-native microservices architectures incorporate principles such as loose coupling, high cohesion, and auto-scaling as fundamental design characteristics.
Loose coupling between individual services allows you to isolate failures, independently scale components, and enhance fault tolerance. High cohesion reduces the risk of unintended side-effects and dependencies between components, making identifying and isolating failures easier. Autoscaling enables applications to dynamically adjust their resource allocation in response to traffic spikes or changes in demand.
Code reviews
Code reviews are essential for maintaining code quality, consistency, and reliability. Thorough code reviews help teams identify vulnerabilities, bottlenecks, and reliability issues to improve the overall resilience of the application.
Event-driven architecture (EDA)
Event-driven architecture (EDA) is built from small, decoupled services that publish, consume, or route events. EDAs use asynchronous message passing to decouple components and enable scalable and resilient communication between services. Message queues or event streams for buffering and handling transient errors allow applications to absorb bursts of traffic, smooth out load spikes, and recover gracefully from failures without impacting end users.
Idempotency
Idempotency refers to the ability of an operation to produce the same result when performed multiple times. Idempotency supports resilience because repeating an operation after a failure will not cause unintended side effects or data corruption.
Observability
Observability practices such as logging, monitoring, and distributed tracing can help teams diagnose issues, troubleshoot failures, and optimize performance to enhance resilience.
Serverless computing services
Serverless computing services like AWS Lambda and AWS Fargate abstract away infrastructure responsibilities so developers can focus on writing code, not managing servers. Serverless architectures promote resiliency with automatic scalability, fault tolerance, and high availability.
Backup scheduling
Backup scheduling such as Recovery Point Objective and Recovery Time Objective determine how much data you might lose and how much downtime you will experience. You will need to decide for each application what their data requirements are regarding sensitivity to loss.
The Caylent approach to resiliency on AWS
Is your application ready to serve a global user base? Do you need to segregate your environment in a particular region due to compliance (e.g. GDPR)? Do you need a multi-region application to ensure high availability?
Embrace the power of a multi-region infrastructure to reduce latency, improve performance, and transform your application into a truly global powerhouse with Caylent.
Caylent helps you thrive in a software-defined world where technology is at the core of every business. We work with you to build, scale, and optimize sophisticated cloud solutions using deep subject matter expertise to deliver world-class outcomes through an agile co-delivery model. Get in touch to find out how we can help!