re:Invent 2024

AWS re:Invent 2024 Price Reductions and Performance Improvements

AWS Announcements
Generative AI & LLMOps
Serverless & Containers

Explore our technical analysis of AWS re:Invent 2024 price reductions and performance improvements across DynamoDB, Aurora, Bedrock, FSx, Trainium2, SageMaker AI, and Nova models, along with architecture details and implementation impact.

The tension in cloud computing has always been between cost and performance. Services either optimize for performance at a higher cost, or reduce costs by sacrificing capabilities. Well, this year's (2024) edition of AWS re:Invent brought a set of improvements that challenge this assumption through fundamental engineering changes to key services. Let's dive into the technical details of these improvements and understand how they affect system architecture, design, and more importantly our wallets.

DynamoDB On-Demand 50% Cheaper and Global Tables 67% Cheaper

DynamoDB's capacity modes determine how you pay for read and write throughput. Provisioned mode requires you to specify your capacity requirements upfront, planning for peak capacity and paying for it whether you use it or not. On-demand mode, introduced to solve this problem, automatically scales up and down based on actual traffic, but historically came with a price premium for this flexibility.

AWS's engineering investments in DynamoDB's operational efficiency have now shifted this cost equation. Since November 1st, 2024, on-demand throughput costs 50% less than before. This changes the point of when to use on-demand versus provisioned capacity. The trade-off is no longer so heavily weighted toward cost - you can now get the flexibility of on-demand scaling with much more attractive economics.

Global tables, DynamoDB's multi-region replication feature, saw an even more dramatic reduction of 67% in its cost. Global tables maintain independent, active-active copies of your data across regions, automatically handling replication and conflict resolution. This price reduction makes multi-region architectures significantly more accessible, especially for applications that need to serve global users or maintain cross-region disaster recovery capabilities. Don't get me wrong though, multi-region is still hard, and there's still overhead, but this makes the overhead of the data layer significantly smaller (at least if you're using DynamoDB).

The addition of multi-region strong consistency support (in preview as of this writing) for global tables is another big improvement for multi-region architectures. Available in us-east-1, us-east-2, and us-west-2, this feature guarantees that applications always read the latest version of data from any region, achieving zero RPO (Recovery Point Objective). This eliminates one of the major complexities (i.e. headaches) in distributed systems: managing consistency across regions. No more building complex consistency mechanisms into your application layer or accepting eventual consistency as the only option for global deployments!

FSx Intelligent-Tiering: 85% Cost Reduction vs SSD Storage

Amazon FSx for OpenZFS's new Intelligent-Tiering storage class offers up to 85% lower cost than FSx SSD storage class and up to 20% less than traditional on-premises HDD-based NAS storage. Not the most widely used service, I know, but... 85%! Hold on, let me put that in bold: 85%!

And it's not just much cheaper, but also easier. No upfront commitment, no storage tier selection, and no manual data movement. You simply pay for what you use. This makes FSx a viable option for workloads that previously didn't make economic sense in the cloud, particularly those with large amounts of infrequently accessed data. Think development environments with large binary artifacts, media processing workflows with raw footage archives, or enterprise file shares with historical documents.

Aurora Serverless v2 Introduces Scale to Zero

Amazon Aurora Serverless v2 has finally achieved true serverless operation with the ability to scale down to zero Aurora Capacity Units (ACUs). The technical implementation is connection-aware: the database automatically pauses after a period of inactivity based on database connections, while maintaining the ability to quickly resume when needed. When the last connection closes and the database remains inactive, it begins the scale-down process. When a new connection request arrives, Aurora automatically resumes and scales to meet the application's demand.

This is particularly relevant for development and testing environments, and production environments with predictable inactive periods. Consider a development database that's only used during working hours, or a reporting database that runs nightly jobs. Instead of running continuously at minimum capacity, these databases can now truly scale to zero during idle periods. The key is that this happens automatically! There's no need to implement complex scheduling logic or manage the scale-down process yourself.

Also, if you were mad at AWS for calling Aurora Serverless v2 serverless (remember that v1 used to scale to 0, before it was deprecated), now you'll have to concede the point. I did.

Bedrock Intelligent Routing and Caching: Up to 90% Cost Reduction

Let's start with Intelligent Prompt Routing. It's a feature that allows dynamic selection of models within the same family based on prompt complexity. For example, you can route between Claude 3.5 Sonnet and Claude 3.5 Haiku depending on the query's requirements.

Simpler queries go to faster, more cost-effective models, while complex queries route to more capable models. Simple, right? And the improvement isn't just theoretical, AWS reports cost reductions of up to 30% without compromising accuracy. For example, let's think about a customer service AI that handles both simple status queries and complex problem-solving conversations. Why use an expensive, sophisticated model like Claude 3.5 Sonnet for "What's my order status?" when a simpler model (and by simpler I mean cheaper) can answer that just as effectively? To be fair, you were probably already doing this optimization in your backend, even if your AI app is not agentic, but especially if it is. Well, at least for non-agentic stuff, you can say goodbye to that part of your code.

The second feature, prompt caching, attacks a different inefficiency in AI applications. When you're building applications like document Q&A systems, users often ask multiple questions about the same context. Instead of sending that context with every prompt, Bedrock now caches it for up to 5 minutes after each access. The impact is huge: up to 90% cost reduction and 85% lower latency for supported models.

Trainium2: 30-40% Better Price Performance vs P5 Instances

AWS's second-generation Trainium chips are the newest generation of custom silicon for machine learning. They're 4x faster than the previous generation (Trn1), offer 4x more memory bandwidth, and provide 3x more memory capacity.

Each Trn2 instance packs 16 Trainium2 chips, 192 vCPUs, 2 TiB of memory, and 3.2 Tbps of Elastic Fabric Adapter (EFA) v3 network bandwidth. That EFA v3 networking delivers up to 35% lower latency than the previous generation. All this translates to 30-40% better price performance compared to current generation GPU-based EC2 P5e and P5en instances.

For those needing even more power, AWS introduced Trn2 UltraServers with 64 Trainium2 chips connected via high-bandwidth, low-latency NeuronLink interconnect. This configuration targets the increasing demands of frontier foundation model training and inference. Makes you wonder if this is what Anthropic is using (you read about their partnership and AWS's investment, right?).

SageMaker AI Inference Endpoints Scale to Zero

Amazon SageMaker AI Inference has joined the scale-to-zero movement, allowing endpoints to scale to zero instances during inactive periods. This is particularly valuable for applications with variable traffic patterns, like chatbots that are busy during business hours but quiet overnight, or content moderation systems that handle periodic upload spikes. Or, of course, dev environments.

The ability to scale to zero significantly reduces the cost of running inference using AI models if your use case is sporadic. Instead of maintaining minimum capacity for potential traffic, endpoints can now completely scale down during quiet periods and automatically scale up when needed.

Amazon Nova: New Cost-Efficient Foundation Models

AWS's entry into the foundation model space focuses squarely on the price-performance equation. The Amazon Nova family includes models optimized for different use cases:

  • Nova Micro is a text-only model designed for maximum speed and minimum cost. It handles basic text operations with the lowest latency in the family.
  • Nova Lite takes a different approach, offering multimodal capabilities (processing image, video, and text) while maintaining very low cost through optimizations for speed.
  • Nova Pro represents the balance point, delivering high capability across multiple modalities while optimizing the accuracy-speed-cost triangle.
  • Nova Canvas and Nova Reel round out the family with specialized models for image and video generation respectively.

Initially available in us-east-1 (all models) and us-west-2/us-east-2 (Micro, Lite, and Pro), Nova demonstrates AWS's focus on practical, cost-effective AI implementations.

And before you ask, no, they didn't get a mention in this article just because they're new. Have you checked the prices?

For reference, Claude 3 Haiku (Haiku is the cheapest Anthropic model per generation) costs $0.00025 per 1,000 input tokens (4x more than Nova Lite) and $0.00125 per 1,000 output tokens (5x more than Nova Lite). And that's the old generation Claude 3 Haiku, the newest one Claude 3.5 Haiku is 3x more expensive than Claude 3 Haiku, and is priced similarly to Nova Pro, except that Nova Pro is in the same class as Claude 3.5 Sonnet and GPT-4o models, not their cheaper options Haiku and 4o-mini. Oh, and did I mention the 300k token context window? Alright, I'll move on, but seriously consider the Nova models for Gen AI apps.

Technical Impact and Architecture Considerations

First of all, we're seeing true serverless implementations. Aurora Serverless v2 scaling to zero feels like something AWS owed us, since they already had this in Aurora Serverless v1. SageMaker AI, on the other hand, is a fantastic bonus. Sure, the cost of running Aurora or SageMaker AI at minimum capacity while not in use wasn't that high, and the reduction in your AWS bill will very likely be well below 10%. But think about the implementation effort on your side: zero! These announcements don't mean a huge amount of money, but they mean free money for you. Go grab it!

Trainium2, on the other hand, brings a potentially huge reduction in your AWS bill. Sure, most of us aren't doing training or inference, and I wouldn't be surprised if you've never used a GPU in AWS. But for the relatively few who use GPUs for training and inference, this announcement is indeed huge. Moreover, it's a continued bet on purpose-built processors, a trend we've seen with AI even before ChatGPT's initial release. I mean, bitcoin has been using specialized hardware called ASICs since 2013, so it makes sense that AI training and inference also gets specialized hardware, right?

Bedrock's intelligent routing and prompt caching are great for cost optimization, and I think their impact on latency deserves a special mention, since latency is one of the biggest problems with production-grade Generative AI applications, especially agentic ones. But it's not just cost reduction. Bedrock's continuous release of tools and features has turned it into such a great platform to use Gen AI models that at this point building a Generative AI application without using Bedrock makes me feel handicapped.

And let me end this article by saying that DynamoDB's price reduction blew my mind. I knew they were looking for ways to favor customers using On Demand instead of Provisioned mode, but this price reduction changes the entire TCO equation.

I've always said you should start with On Demand and only move to Provisioned once you know your load patterns, and I've written extensively about how both work, what to expect when getting bursts of traffic in Provisioned mode, and how to mitigate them to avoid the dreaded ProvisionedThroughputExceededException (hint: SQS queues for writes and caches for reads). But with the new pricing I'm no longer sure if most companies should even bother with Provisioned.

Sure, On Demand is still more expensive. But when you include the engineering effort of all the optimizations you need to do to make sure a table in Provisioned mode won't be a bottleneck, you'll likely find you're several years away from Provisioned mode making sense for you.

Overall, focusing on your core application and your business differentiator and throwing money at AWS to solve the rest has always been a sensible architecture decision. Now you need to throw less money at it!

AWS Announcements
Generative AI & LLMOps
Serverless & Containers
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.

View Guille's articles

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Whitepaper: The Transformative Potential of Generative AI in Healthcare: A Clinician’s Perspective

Generative AI & LLMOps

How We Utilize AI at Caylent

At Caylent, we're using generative AI across all aspects of our business, from accelerating and improving internal workflows, to offering more innovative, tailored solutions to our customers.

Generative AI & LLMOps

Understanding Amazon Q Developer: Transform

Learn all about how Amazon Q Developer’s transformation capabilities uses generative AI to accelerate data migration and modernization.

Generative AI & LLMOps