re:Invent 2023 Architecture Session Summaries

Cloud Technology

Get up to speed on all the architecture focused 300 and 400 level sessions from re:Invent 2023!

We know that watching all the re:Invent session videos can be a daunting task, but we don't want you to miss out on the gold that is often found in them! In this blog, you can find quick summaries of all the 300 and 400 level sessions, grouped by track. Enjoy!

ARC206 Scaling on AWS for the first 10 million users

The AWS re:Invent 2023 presentation titled "Scaling on AWS for the First 10 Million Users" (ARC206) was led by Sky Hart and Chris Munz, who discussed strategies for scaling applications on Amazon Web Services. The presentation emphasized the need for developers to build applications with scalability in mind from the beginning. Sky Hart introduced the session by highlighting the importance of scalability and future-proofing applications, using AWS tools and resources. She covered various components of an application, including front-end, back-end, and data storage, and how to effectively manage these with AWS services like Amazon Amplify for front-end hosting, compute options including Amazon EC2, container services, and AWS Lambda, and backend services like API Gateway, Application Load Balancer, and AWS AppSync.

Sky further delved into database selection, advocating for starting with SQL databases such as Amazon Aurora, citing their scalability, widespread support, and compatibility with most applications. She detailed the features of Amazon Aurora, focusing on its scalability and managed service benefits. This was complemented by a discussion on choosing the appropriate API front end and using AWS App Runner for efficient application deployment.

Chris Munz continued the discussion, focusing on architecture refinement as applications scale. He emphasized the use of AWS CloudWatch and AWS X-Ray for monitoring and tracing, and the application of machine learning tools like Amazon DevOps Guru and Amazon CodeGuru for performance improvement and insights. As applications grow beyond 10,000 users, Munz explained the necessity to evolve the architecture, possibly breaking it into microservices and considering asynchronous communication models. He stressed the importance of caching, database federation, and choosing the right technology based on specific needs, concluding that AWS offers a wide array of services and tools to support scalability at each stage of growth.

AWS re:Invent 2023 - Scaling on AWS for the first 10 million users (ARC206)

ARC306 Reducing your area of impact surviving difficult days

At the AWS re:Invent 2023, a presentation titled "Reducing your area of impact and surviving difficult days" (ARC306) was given, focusing on strategies for minimizing the impact of impairments or events on critical workloads. The presentation introduced a fictional character, Alice, who owns a coffee shop, to illustrate the journey of scaling a business and the corresponding need to enhance resilience in IT systems. The initial discussion revolved around the evolution of Alice's coffee shop from a small enterprise to a larger concern, and the various challenges she faced in maintaining and improving the resilience of her IT infrastructure.

The speakers delved into various architectural strategies and AWS services that can enhance system resilience. They emphasized the importance of breaking down applications into microservices, using AWS's fault isolation boundaries like regions and availability zones, and implementing cell-based architectures for greater isolation and reduced impact. Each strategy aimed to mitigate shared fate scenarios where a single failure could impact the entire system. Additionally, the concept of Shuffle Sharding was introduced, a method that offers an even more granular level of resilience by limiting the impact of failures to smaller subsets of users.

The presentation concluded by highlighting AWS's Resilience Lifecycle Framework and other AWS services, such as AWS Resilience Hub, Elastic Disaster Recovery, AWS Backup, and Route 53 Application Recovery Controller. These tools and frameworks provide mechanisms for setting resilience objectives, designing and implementing resilient architectures, evaluating and testing systems, and learning from recovery processes. The session ended with encouragement for the audience to visit Alice's coffee shop, a metaphor for businesses leveraging AWS services for enhanced resilience, and a reminder to fill out the session survey on the mobile app.

AWS re:Invent 2023 - Reducing your area of impact and surviving difficult days (ARC306)

ARC307 Do modern cloud applications lock you in?

The session ARC 307 at AWS re:Invent 2023 delved into the intriguing subject of modern cloud applications, particularly focusing on their benefits and the potential for vendor lock-in. The speaker, previously an enterprise strategist and now a part of the AWS team, shared insights on what constitutes a modern cloud application. Challenging the conventional graph that typically represents modernization in cloud applications, the session highlighted that modern applications are those that optimally utilize cloud capabilities, thus being cloud-native. Emphasizing the benefits such as resilience, scalability, transparency, cost efficiency, and agility, the presentation aimed to redefine the understanding of modern cloud applications beyond just their runtime environments.

The discussion also addressed the concerns of vendor lock-in, a common apprehension with cloud services. The speaker, adopting an architectural perspective, emphasized the importance of understanding the trade-offs involved in architecture decisions. The session explored the multidimensionality of lock-in, suggesting that it's not just about vendor dependence but encompasses various aspects including product switching costs, skill set adaptability, and even mental lock-in. The presentation advocated for looking at cloud services through a multi-faceted lens, considering aspects like utility, cost, and future flexibility. The speaker also underscored the significance of maintaining agility and development discipline to minimize potential switching costs and lock-in risks, ultimately encouraging a balanced and well-informed approach to utilizing cloud services.

AWS re:Invent 2023 - Do modern cloud applications lock you in? (ARC307)

ARC308 Best practices for creating multi-region architectures on AWS

AWS re:Invent 2023 featured a session on best practices for creating multi-region architectures on AWS, led by Joe Chapman, a principal solutions architect, and Neeraj Kumar, a principal technologist. The session focused on the complexities and decisions involved in extending workloads across multiple AWS regions. This is often done to improve performance for a globally distributed user base, increase availability for critical workloads, or comply with data residency laws. The speakers shared real-world scenarios and best practices extracted from their experience helping customers evolve their AWS workloads. The session aimed to provide clarity on building and evolving multi-region workloads using known AWS best practices, discussing trade-offs, considerations, and the importance of understanding specific requirements for each workload.

The first customer scenario discussed was a FinTech retail bank, which was considering a multi-region strategy for disaster recovery (DR) and operational continuity. The bank needed to define business goals like Recovery Time Objective (RTO) and Recovery Point Objective (RPO) and choose a suitable DR strategy from options like backup and restore, pilot light, warm standby, and active/active configurations. They opted for the warm standby approach and focused on data replication strategies, deployment of code across regions, and the importance of testing DR strategies regularly through mechanisms like GameDays and AWS Fault Injection Service. Regular testing of detection and recovery controls was emphasized as critical for ensuring the effectiveness of DR strategies.

The second scenario focused on an online authentication provider who needed to improve their uptime SLA and global application performance. They faced challenges in managing deployments across regions and ensuring intelligent, reliable routing for users worldwide. Solutions included using AWS CloudFormation for consistent deployments, Route 53 for latency-based routing, and DynamoDB global tables for data replication. The company also emphasized the importance of regional independence, operational considerations in a multi-region setup, and a framework for selecting new AWS regions based on user location, cost, and service availability. The session concluded with a reminder of the resilience offered by AWS regions and the need for careful planning and understanding of dependencies and data consistency in a multi-region architecture.

AWS re:Invent 2023 - Best practices for creating multi-Region architectures on AWS (ARC308)

ARC309 Using zonal autoshift to automatically recover from an AZ Impairment

The AWS re:Invent 2023 session, presented by Deepak Sury, the General Manager for Application Recovery Controller at AWS, focused on the new features of zonal autoshift, a tool for automatic recovery from Availability Zone (AZ) impairments. Sury emphasized the importance of reducing the duration and frequency of customer-impacting events and highlighted AWS's commitment to developing tools and processes for both internal services and customer applications. The session included an in-depth discussion on Availability Zones (AZs), their role in AWS's infrastructure, and the significance of deploying to multiple AZs for enhanced reliability and quick recovery from potential failures.

Gavin McCullough, a long-time AWS employee, shared his insights on the evolution of Amazon's infrastructure and its emphasis on reliability, particularly during critical sales periods like Black Friday. He discussed the concept of "recovery-oriented computing," which involves shifting away from failed components to maintain service continuity, rather than trying to fix them immediately. McCullough explained the differences between hard and gray failures in distributed systems and emphasized the importance of designing for redundancy and minimizing coordination between AZs to avoid simultaneous failures.

The session introduced "zonal shift," a tool that allows users to steer traffic away from an impaired AZ temporarily. This tool is part of Amazon Route 53 Application Recovery Controller, which also includes "zonal auto shift," enabling automatic traffic redirection in response to potential AZ impairments. These tools help manage and mitigate the impact of AZ failures on applications and services, ensuring a more resilient and reliable AWS infrastructure. The presentation highlighted the shared responsibility model in AWS services, where certain layers like load balancing and databases are managed by AWS, while the compute layer requires customer involvement for efficient recovery.

AWS re:Invent 2023 - Using zonal autoshift to automatically recover from an AZ impairment (ARC309)

ARC310 Detecting and mitigating gray failures

The AWS re:Invent 2023 presentation "Detecting and Mitigating Gray Failures" (ARC310) was delivered by Mike Hagen, a senior principal solutions architect at AWS. Hagen focused on the concept of gray failures, which are subtle issues in a system that are difficult to detect because they don't cause outright system failures but can significantly degrade user experience. He emphasized the importance of understanding and addressing these failures for maintaining system resilience.

Hagen introduced the concept of differential observability, where a system's health appears different depending on the perspective. For instance, the underlying system might not register an impact, but users experience significant degradation. This discrepancy necessitates action beyond relying on the system's inherent detection mechanisms. He underscored the need for more nuanced health checks and observability, going beyond basic metrics like CPU and memory usage to include context-rich indicators aligned with system fault boundaries like availability zones.

The presentation also covered strategies for detecting and mitigating gray failures, especially in single-host and single availability zone scenarios. Hagen discussed using advanced health checks, outlier detection, and composite alarms. For mitigation, he recommended evacuating affected areas of the system, such as a problematic availability zone, and discussed methods like using data planes over control planes for more reliable recovery actions. He also introduced AWS tools like Application Recovery Controller and its Zonal Shift feature for facilitating these processes. The talk was accompanied by various resources and workshops for deeper understanding and hands-on experience with these concepts.

AWS re:Invent 2023 - Detecting and mitigating gray failures (ARC310)

ARC311 Building Cost-Optimized Multi-Tenant SaaS Architectures

The AWS re:Invent 2023 session "Building Cost-Optimized Multi-Tenant SaaS Architectures" (ARC311) was presented by Todd Golding, a solutions architect with an 8-year experience in the SaaS domain at AWS. The session primarily focused on the strategies and best practices for creating cost-effective, multi-tenant SaaS solutions. Todd emphasized that cost optimization in SaaS is more than just reducing infrastructure bills; it encompasses efficient growth, operational efficiency, and understanding tenant-specific use cases and consumption patterns.

The session explored various architectural patterns and strategies, including horizontal scaling, serverless compute with AWS Lambda, containerization with Amazon EKS, and tier-based throttling. Todd also discussed the importance of aligning tenant activity with resource consumption and the necessity of having operational metrics to measure the efficiency of SaaS architectures. He highlighted the challenges in right-sizing storage and compute resources in multi-tenant environments and underscored the need for granular control over tenant consumption. The talk stressed the importance of measuring tenant-level consumption to truly understand and optimize costs and concluded by emphasizing that successful cost optimization requires a comprehensive approach that extends beyond mere infrastructure considerations.

AWS re:Invent 2023 - Building cost-optimized multi-tenant SaaS architectures (ARC311)

ARC312 Resilience lifecycle: A mental model for resilience on AWS

The presentation at AWS re:Invent 2023 focused on the concept of resilience in application development and maintenance, particularly in the context of AWS services. Clark Ritchie, a principal technologist at AWS, began by discussing the importance of resilience for organizations, highlighting the substantial revenue loss companies can face due to unplanned downtime. He introduced the AWS Resilience Lifecycle, a mental model designed to help businesses enhance their resilience on AWS. This model includes phases like setting objectives, designing and implementing, evaluating and testing, and operating, each with specific activities and AWS services to support them. Ritchie emphasized the shared responsibility model in AWS, where AWS ensures the resilience of the cloud infrastructure, while customers are responsible for resilience in the cloud.

Stacy Brown and Yoni from Vanguard, a global investment management company, shared their experiences and strategies in improving resilience. They highlighted Vanguard's digital nature and the critical importance of ensuring zero downtime for clients and efficient feature development for engineers. The company's approach involved bringing together various teams, like architecture, engineering, and operations, to create an enterprise-wide focus on resilience. This integrated approach led to significant improvements, including a reduction in major incidents and faster, more reliable deployments. Their strategy also included the development of in-house tools for performance testing and chaos engineering, enhanced observability, and a strong focus on culture change towards proactive resilience.

In conclusion, while significant progress has been made in enhancing resilience at Vanguard, the journey is ongoing. Future efforts will focus on making resilience adoption easier for engineers, enhancing observability tools for a more integrated view, improving the policy engine for early development lifecycle intervention, and expanding the focus to the entire end-to-end client journey. The presentation concluded with an invitation for further questions and engagement with the audience, demonstrating the ongoing dialogue and learning process in the field of resilience in cloud computing and application development.

AWS re:Invent 2023 - Resilience lifecycle: A mental model for resilience on AWS (ARC312)

ARC313 A consistent approach to resilience analysis for critical workloads

The AWS re:Invent 2023 session, titled "A Consistent Approach to Resilience Analysis for Critical Workloads (ARC313)," presented by John Fermenta and his team, focused on the importance of resilience in application infrastructure. John introduced the Resilience Lifecycle Framework, a guide to help organizations start their resilience journey and optimize their existing frameworks. The session emphasized the criticality of designing and implementing resilient systems, particularly in the context of AWS services. The team also highlighted various AWS resilience-oriented offerings, including the Application Recovery Controller, which plays a crucial role in recovery and resilience strategies.

During the session, real-world examples were discussed to illustrate the application of the Resilience Analysis Framework (RAF) in various scenarios. This included strategies like constant work, hedging, and fault-isolated deployments. Each example demonstrated how specific resilience patterns could mitigate potential failure modes like excessive load, excessive latency, and cascading failures. The RAF approach helps in identifying critical failure modes, understanding their impact, and implementing preventative or corrective measures. The examples underscored the importance of understanding and preparing for various types of failures to ensure resilient and reliable systems.

Finally, Mike Gallo from AWS shared insights into the practical implementation of RAF within their teams. He emphasized the importance of executive support and the need for engineers to adopt a proactive approach towards resilience. Mike highlighted that resilience is not a one-time effort but an ongoing journey, requiring continuous evaluation and improvement of systems. He also shared how RAF led to better understanding and prioritization of risks, ultimately enhancing the resilience of their services. The session concluded with a reminder that resilience is a journey requiring consistent effort and attention, especially in complex, mission-critical applications.

AWS re:Invent 2023 - A consistent approach to resilience analysis for critical workloads (ARC313)

ARC315 Gain Confidence in System Correctness & Resilience with Formal Methods

The AWS re:Invent 2023 session titled "Gain Confidence in System Correctness & Resilience with Formal Methods" (ARC315), presented by Ankush Desai and Vikas Bera, focused on the use of formal methods in ensuring system correctness and resilience, particularly in distributed applications. They emphasized that formal methods, often perceived as complex and math-heavy, can be approachable and practical. The session introduced a framework called 'P', used within AWS to reason about system correctness. This framework enables users to express system design as communicating state machines, verify these designs against specified behaviors, and connect them with real-world implementations.

The presentation detailed the use of 'P' in validating distributed systems through model creation and simulation of various scenarios, including chaos engineering and disaster recovery. The tool allows the definition of system invariants (expected behaviors) and simulates system components interacting through state machines. This process enables the identification of defects and testing of system resilience to failures. The speakers shared their experiences using 'P' to model complex systems like transaction processing and how it accelerated development by identifying hard-to-find bugs during the design phase.

AWS re:Invent 2023 - Gain confidence in system correctness & resilience with formal methods (ARC315)

ARC316 Practice like you play: How Amazon scales resilience to new heights

The AWS re:Invent 2023 conference featured a presentation titled "Practice like you play: How Amazon scales resilience to new heights," focusing on how Amazon Prime Video applies principles from sports teams to their engineering practices for improved resilience and reliability. The presenters, Olga Hall and Lauren Don, discussed the importance of reducing downtime in various industries, noting that the average cost of downtime is around $300,000 per hour, and nearly half of companies experiencing downtime fail to serve their customers as a result. They introduced the concept of a "resilience playbook," a set of strategies and tactics to train teams for unpredictable scenarios, drawing parallels between preparing engineering teams for peak workloads and sports teams preparing for major events like the Super Bowl.

The presentation detailed the systematic approach Amazon uses to ensure resilience, including game days (automated load tests run three times a week in each region), chaos engineering, and fault injection experiments to simulate unpredictable conditions. They emphasized the use of feature flags for quickly modifying or disabling features, rigorous testing and monitoring, and the importance of retrospective analysis after incidents to learn and improve. The key metrics for Prime Video, like concurrent streams and stream starts, are used to predict and prepare for peak workload times. The speakers also highlighted how Prime Video's global strategies are tailored to local executions, illustrating their 'Think Global, Act Local' mindset.

In summary, the talk emphasized the importance of preparing for both predictable and unpredictable events in systems engineering. The speakers outlined the benefits of regular, disciplined training and testing to build operational muscle memory, ensuring that teams are ready to respond effectively to incidents. They discussed the use of operational readiness scores to measure and improve system reliability, the significance of conducting both low and high-risk experiments, and the critical role of observability in identifying and addressing issues in real-time. The presentation concluded by encouraging attendees to build their resilience playbooks, emphasizing the philosophy of 'practice like you play' to achieve proactive reliability in their systems.

AWS re:Invent 2023 - Practice like you play: How Amazon scales resilience to new heights (ARC316)

ARC317 Improve application resilience with AWS Fault Injection Service

The AWS re:Invent 2023 session, "Improve Application Resilience with AWS Fault Injection Service (ARC317)," featured Adrian Hornsby, Principal Engineer at the AWS Reliability Team, and Iris, a Senior Product Manager. They introduced AWS's Fault Injection Service (FIS), designed to enhance application resilience by simulating faults and stress in systems. Adrian emphasized the high cost of downtime for enterprises, highlighting the financial and reputational risks involved. He stressed that complex systems inherently face failures and detailed strategies for building resilience, including anticipation, monitoring, response, and learning from failures. The concept of statically stable systems, which can maintain operations despite failures without additional control plane operations, was also discussed.

Iris further delved into the specifics of the AWS Fault Injection Service. She explained how FIS allows users to create experiment templates to simulate various faults in their AWS environment. The service includes features like actions, targets, safeguards, IAM controls, and the ability to stop experiments under certain conditions. Iris introduced new FIS features such as scenarios for easy experiment setup and multi-account experiments. She also announced new actions and scenarios focused on multi-AZ and multi-Region resilience testing, demonstrating how these can be used to simulate real-world scenarios like power interruptions in an Availability Zone or connectivity issues across Regions.

The session concluded with a demonstration of FIS in action, using scenarios to test resilience against power interruption in an Availability Zone and cross-Region connectivity issues. Iris showcased the process of setting up and executing these experiments, illustrating their impact on AWS resources like EC2 instances, RDS databases, and Auto Scaling groups. The demonstration highlighted the importance of monitoring and adapting to these simulated failures to ensure application resilience. Attendees were encouraged to practice and integrate resilience testing into their organizational habits, using the AWS Fault Injection Service as a tool to identify and mitigate potential failures in their systems.

AWS re:Invent 2023 - Improve application resilience with AWS Fault Injection Service (ARC317)

ARC319 Optimize cost and performance and track progress toward mitigation

The AWS re:Invent 2023 session on optimizing cost and performance in AWS focused on the significance of understanding and managing cloud costs effectively. Yuriy Prykhodko, a principal technical account manager at AWS, emphasized cost as a reflection of decisions and priorities. He introduced the Cloud Intelligence Dashboards framework, an open-source collection of customizable dashboards in Amazon QuickSight. These dashboards provide actionable insights and in-depth details on AWS cost and usage, enabling organizations to visualize and manage their cloud spend efficiently.

JR Storment, executive director of the FinOps Foundation, discussed the cultural aspect of FinOps, emphasizing its role in aligning technology, business, and finance teams towards cost-effective cloud usage. He highlighted that FinOps is not just about cutting costs, but about maximizing business value and making informed investment decisions. Mike Graff, from Dolby Laboratories, shared practical applications of the Cloud Intelligence Dashboards at Dolby, demonstrating how they enabled better visibility into cloud spend and drove cost-effective architectural decisions.

The session concluded with a discussion on the next steps towards fostering a cost-aware culture in organizations. Emphasis was placed on treating FinOps metrics as critical operational metrics, similar to performance or availability. By deploying tools like the Cloud Intelligence Dashboards, organizations can empower their engineering and finance teams with the data necessary for making informed, cost-effective decisions. This approach not only optimizes cloud spend but also aligns it with broader business objectives, including sustainability goals.

AWS re:Invent 2023 - Optimize cost and performance and track progress toward mitigation (ARC319)

ARC327 5 Things you should know about resilience at scale

The presentation at AWS re:Invent 2023 by Alec Peterson, Mike Fur, and Becky Weiss focused on resilience at scale, drawing on their collective three decades of experience with AWS. They shared five key lessons learned from operating AWS services: handling dependencies and modes, understanding blast radius, managing queues, dealing with errors, and implementing retries effectively. Each topic highlighted the importance of anticipating and managing system behaviors that change at scale, emphasizing resilience over mere availability.

Dependencies and modes dealt with how systems should handle failures in dependent services without drastically altering their mode of operation. Blast radius focused on the scope of impact a single component or change can have on a system and the importance of designing with failure in mind. Queue management stressed the significance of monitoring and controlling backlog buildup to prevent extended recovery times during outages.

Errors and retries were the final topics, underscoring the importance of correctly classifying and handling errors for better detection and response to issues. The speakers illustrated how retry mechanisms, while crucial for overcoming temporary issues, can also amplify system load during outages. They suggested strategies like segmenting retries and pre-emptive duplication to balance the need for retries against the risk of overloading the system. These insights offered a comprehensive look at building and operating resilient, large-scale systems, providing valuable lessons for AWS users and cloud computing professionals.

AWS re:Invent 2023 - 5 things you should know about resilience at scale (ARC327)

Conclusion

These are summaries of all the 300 and 400 level ARC sessions. We hope you found these helpful in both getting an overview of the new ARC content as well as deciding which sessions to go watch.

Cloud Technology
Brian Tarbox

Brian Tarbox

Brian is an AWS Community Hero, Alexa Champion, runs the Boston AWS User Group, has ten US patents and a bunch of certifications. He's also part of the New Voices mentorship program where Heros teach traditionally underrepresented engineers how to give presentations. He is a private pilot, a rescue scuba diver and got his Masters in Cognitive Psychology working with bottlenosed dolphins.

View Brian's articles

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

re:Invent 2023 AI/ML Session Summaries

Get up to speed on all the GenAI, AI, and ML focused 300 and 400 level sessions from re:Invent 2023!

Cloud Technology
Artificial Intelligence & MLOps

re:Invent 2023 Storage Session Summaries

Get up to speed on all the storage focused 300 and 400 level sessions from re:Invent 2023!

Cloud Technology

re:Invent 2023 Serverless Session Summaries

Get up to speed on all the serverless focused 300 and 400 level sessions from re:Invent 2023!

Cloud Technology