Caylent Services
Infrastructure & DevOps Modernization
Quickly establish an AWS presence that meets technical security framework guidance by establishing automated guardrails that ensure your environments remain compliant.
Explore Amazon OpenSearch Serverless NextGen, AWS's new scale-to-zero architecture that decouples compute from storage to deliver faster autoscaling, lower costs for idle and agentic AI workloads, and offers a new approach to managing search and vector search collections.
Running a search or vector workload on Amazon OpenSearch Serverless has always meant paying for compute even when no one was querying it. The original architecture kept a minimum of two OpenSearch Compute Units (OCUs) provisioned at all times, which is reasonable for a steady production search engine. However, for a workload that fires hundreds of vector queries while an agent reasons through a task and then goes quiet for hours, you're paying for capacity you're not using.
On May 28, 2026, AWS released the next generation of OpenSearch Serverless, aptly called OpenSearch Serverless NextGen. It scales compute to zero when a collection is idle, provisions capacity in seconds, and autoscales up to 20 times faster than the previous architecture of OpenSearch Serverless (now called “Classic”). NextGen is perfect for agentic AI and unpredictable workloads, with up to 60% lower cost than provisioning OpenSearch clusters for peak capacity.
In this blog, we will explain what changed in the architecture, what the new capabilities are, how much they cost, where NextGen fits, and how to decide whether to adopt it or stay on Classic.
NextGen can release compute to zero because of a key architectural change: it separates compute from storage. In Classic, each OCU held a local copy of the data it served, so the system could not give up its last compute nodes without losing the data living on them. NextGen makes the OCUs stateless: they read from and write to a distributed shared storage layer instead of local disk.
This means that new OCUs can start serving requests in seconds rather than minutes, because there is no local disk to populate before a node is useful. The OCU mounts the shared storage and begins working, with scale-up times up to 20x faster than in Classic. In turn, idle capacity can be released without affecting your stored data, because the data never lived on the compute node.
Scale-to-zero comes with a cold start. When no collection in a group receives indexing or search traffic for 10 minutes, compute drops to zero OCUs, and billing for those workers stops. The 10-minute window is not configurable. When traffic returns, the first request to each component waits roughly 10 to 30 seconds while capacity is restored, and requests during that window are queued rather than dropped. Search and indexing scale independently, so a collection can keep indexing workers active while search has dropped to zero, or vice versa. You can set a lower and upper limit on scale, separately for search and indexing capacity.
Cost is metered along four dimensions: indexing, search, storage, and vector-index GPU acceleration. When a collection scales to zero, its compute charge stops, but its storage charge continues because the data persists in the shared storage layer and an idle collection is billed for that storage. AWS's claim of up to 60% lower cost is measured against provisioning Amazon OpenSearch clusters for peak capacity and keeping them idle most of the time.
Beyond the performance gains, NextGen brings structural changes you have to plan around. Collection groups, which were optional in Classic, are required in NextGen. Every collection belongs to a collection group, and capacity limits are set at the group level rather than per collection. Collections in a group share compute resources, which lowers costs for smaller collections with complementary traffic, and they can still use different AWS KMS keys, so grouping does not force shared encryption. The collection group is the unit you set minimum and maximum OCUs on, and the unit that scales to zero.
NextGen also adds a per-account regional endpoint alongside the familiar per-collection endpoint, so a single hostname can serve all your collections, with the target collection identified by a request header. Both endpoint types use standard AWS PrivateLink with automatic private DNS, which removes the need for Route 53 private hosted zones, forwarding rules, and custom DNS configuration that Classic required. For an account running many collections, that is one connection pool and one endpoint to manage instead of one per collection.
At the time of launch, NextGen supports search and vector search collections, but time-series is not available yet. The table below summarizes these differences and the points above, so you can check whether a workload's assumptions still hold under the new architecture, particularly around cold start and cost.
NextGen's advantages concentrate in workloads with one trait in common: demand that is uneven over time.
Agentic retrieval is the clearest fit, and the workload AWS built the architecture around. An AI agent working through a multi-step task can trigger hundreds of concurrent vector queries during a burst of reasoning, then go idle until the next request. Classic's always-on compute bills you for the quiet stretches; NextGen scales up as queries arrive and scales back to zero between tasks.
Multi-tenant SaaS benefits from two of the new features. When each tenant maps to its own collection and tenants have very different activity patterns, collection groups let low-traffic tenants share capacity instead of each holding a minimum, and the regional endpoint serves every tenant collection through one connection pool rather than one per tenant.
Development environments, batch pipelines, and traffic that spikes and subsides round out the fit, because these workloads spend much of their time idle. Turning off an environment on nights and weekends saves 70% on compute, and OpenSearch Serverless NextGen finally lets you do that.
Two kinds of workload do not benefit. A workload that runs at steady, high throughput around the clock has little idle time for scale-to-zero to reclaim, and a time-series or log-analytics workload has no path to NextGen yet, because those collection types are not part of the GA release.
NextGen and Classic charge the same way and at the same unit prices. According to the OpenSearch pricing page, in us-east-1 you pay about $0.24 per OCU-hour for indexing and search compute, billed per second, and roughly $0.024 per GB-month for OpenSearch-managed storage. An OCU costs the same regardless of architecture, so the bill differs only in how many OCU-hours you consume. Classic bills for a minimum of two OCUs for the first collection, one for indexing and one for search, even when idle, while a NextGen collection group defaults to zero on both indexing and search. The two scenarios below show how much that saves for an idle-heavy workload and why a steady one sees little benefit.
One thing the figures omit: vector collections incur a separate, usage-based charge for GPU-accelerated index builds. It applies to both architectures and is small enough for incremental re-embedding, so it does not change the direction of the comparison.
Picture a vector store that an internal agent queries in bursts during business hours and leaves idle overnight and on weekends. Assume search runs about 132 hours a month, roughly six hours per day across 22 working days, and averages three search OCUs while active; a nightly re-embedding job indexes for about 22 hours a month at two indexing OCUs, and the collection holds 50 GB of vectors and metadata. The rest of the month, no requests arrive.
NextGen scales both components to zero through the idle hours and bills only for the work done: 440 OCU-hours, about $107. Classic cannot drop below its two-OCU floor, so it carries one indexing and one search OCU through every idle hour, roughly 1,300 OCU-hours of standby (about $313) on top of the same active usage. The collection is idle most of the month, so NextGen costs about 75% less.
Consider a product-search service for a web app that never goes quiet. Traffic runs at a baseline overnight and doubles during the day: something like two search OCUs for the 12 night hours and four for the 12 day hours, continuous indexing at one OCU for catalog updates, and 100 GB of stored data.
Here the two architectures cost the same, about $703, because the workload never idles long enough to scale to zero and never drops below Classic's floor. NextGen still autoscales faster, so it tracks the morning doubling more tightly and without manual capacity planning, but the bill is identical. Scale-to-zero saves nothing when there is no idle time to reclaim.
Both results follow from the same fact: NextGen's savings come from idle time, not a lower unit price. An idle-heavy workload like the agent example can get significant savings from NextGen, while a steady workload like the web backend example sees little or no benefit. Before adopting NextGen for cost reasons, estimate the share of the month your collection sits idle. That fraction, more than any other number, determines your savings.
These figures use us-east-1 on-demand rates and exclude data transfer. Prices vary by region, and a Database Savings Plan can lower both compute lines in exchange for a one-year commitment.
The adoption decision comes down to two questions:
New workloads are the easy case, with little reason to start a search or vector project on Classic unless you need time-series collections. Existing workloads need a closer look, because moving to NextGen is not an in-place upgrade.
The migration path towards NextGen is to create a new collection group and collection, reindex your data into it, and update clients to the new endpoint. Queries and index mappings carry over unchanged, so application logic stays stable. What changes is the endpoint, and with the new static regional endpoint, that is a one-time update. Treat it as a reindex-and-cutover project sized by your data volume, and estimate the costs appropriately.
The guide below maps workload patterns to a recommendation and to what you should validate before committing. Use it to aid your decision.
Keep in mind the first request after idle waits while capacity restores, so user-facing search needs either pre-warming, at least 1 OCU (which changes the cost equation significantly), or tolerance for that latency.
OpenSearch's Express Create gets a collection running quickly with default policies, which can be useful for testing. Production deployments should still set encryption, network, and data access policies deliberately rather than leaning on defaults.
NextGen is best understood through the core architecture change: decoupling compute from storage. That change is what lets compute reach zero, return in seconds, and scale faster than Classic, and it also defines the boundaries, from the cold start on the first request to the storage you keep paying for while idle. The 20x and 60% figures are real but conditional, tied to specific baselines and to workloads with meaningful idle time.
Adoption decisions should come from the shape of the traffic. A workload that idles often, bursts unpredictably, or serves many uneven tenants is the kind where NextGen will save you a lot of money. Most new search or vector projects that do not need time-series collections should default to NextGen, unless you already know the traffic is going to be steady. Existing workloads should carefully analyze idle time, reindex cost, and the operational details before cutover. The value of NextGen is in matching what you pay to how your workload behaves, and that value is largest exactly where Classic's always-on capacity previously dominated costs.
As organizations build agentic AI applications, modernize search experiences, and optimize retrieval-augmented generation (RAG) architectures, choosing the right search and vector infrastructure becomes increasingly important. Caylent helps customers design, deploy, and optimize AI-powered search and retrieval solutions on AWS, including Amazon OpenSearch Service and Amazon OpenSearch Serverless. Whether you're evaluating OpenSearch Serverless NextGen for a new workload, assessing migration opportunities from existing search platforms, or building scalable vector search capabilities for generative AI applications, our experts can help. Reach out to us today to get started.
Guille Ojeda is a Principal Innovation Architect at Caylent, a speaker, author, and content creator. He has published 2 books, over 200 blog articles, and writes a free newsletter called Simple AWS with more than 45,000 subscribers. He's spoken at multiple AWS Summits and other events, and was recognized as AWS Builder of the Year in 2025.
View Guille's articlesExplore how building a multi-tenant SaaS platform changes when the infrastructure lives in customers’ AWS accounts, and the architectural lessons learned from managing resources across environments you don’t own or control.
Learn how Datadog Event Mapping works — how to correlate logs, events, and alerts into meaningful context, improve observability, and reduce noise so your team can quickly detect and respond to issues.
Get a practical introduction to AWS CloudFormation nested stacks — how they work, when to use them, and best practices for organizing and managing reusable infrastructure templates at scale.