2025 GenAI Whitepaper

Introducing Amazon Nova Sonic: Real-Time Conversation Redefined

Generative AI & LLMOps

Explore Amazon Nova Sonic, AWS’s new unified Speech-to-Speech model on Amazon Bedrock, that enables real-time voice interactions with ultra-low latency, enhancing user experience in voice-first applications.

Creating voice-enabled applications on AWS isn't terribly complex, but achieving a truly seamless, real-time conversational experience is still a challenge. AI is evolving rapidly, and while foundation models have become marvelously better, their true value is unlocked when integrated effectively into user-facing applications. User experience is the key here, and for many interactions, traditional text-based chat simply falls short. Users demand the speed, clarity, and natural flow inherent in voice communication.

This demand is pushing the boundaries of conversational interfaces. We're moving beyond basic chatbots towards systems capable of understanding and responding using speech itself, often across different languages. This is the domain of Speech-to-Speech (S2S) translation and interactions, a technology poised to fundamentally change how we interact with systems and each other.

In this article, we'll explore why Speech-to-Speech is becoming essential for modern applications. We'll take a deep dive into a significant new development from AWS designed specifically for this challenge: the Amazon Nova Sonic model, accessible via Amazon Bedrock. We'll analyze its capabilities, examine different architectural approaches for S2S architectures (including how AWS compares to offerings from OpenAI and ElevenLabs), discuss conceptual cost considerations, and explore how AWS builders can leverage these technologies to craft the next generation of voice-first user experiences.

UX First: Why Speech-to-Speech Matters Now

So far, chatbots and text interfaces have dominated AI interactions. Historically (i.e., before the Generative AI boom) chatbots have been useful for simple data retrieval operations, with the advantage of moving those capabilities from dedicated screens and websites to more easily accessible places like a popup bubble in every page of the website or a WhatsApp chat. Modern Large Language Models (LLMs) made them even more powerful in their understanding of queries, and the introduction of RAG, tools and agentic architectures enabled more complex use cases.

But no matter how powerful they get, chatbots introduce unnecessary friction in user interactions. Consider navigating a complex IVR system, participating in a multilingual video conference, or trying to get information hands-free while driving. Typing is often slow, inconvenient, or impossible. We need a new medium.

As LLMs made software feel less like a mindless robot, users have quickly grown to expect interactions that mirror human conversation:

  • Fast: Immediate responses. Latency is the enemy of natural conversation flow.
  • Clear: An accurate understanding of the input, and a clear, comprehensible output.
  • Natural: Interactions shouldn't feel robotic, they should manage nuance and meaning.

Delivering this level of user experience requires moving beyond text. While powerful AI models provide the underlying intelligence, the application layer defines the experience. Speech-to-Speech technology directly addresses the need for speed, clarity, and naturalness, offering a more intuitive, efficient, and human-centric way to interact.

The Next Evolution in Conversational Interfaces

Speech-to-Speech unlocks capabilities fundamentally difficult or impossible to achieve with text alone:

  • Emotional Nuance: Speech carries prosody, tone, and emphasis that convey meaning beyond words. Advanced S2S systems can interpret this input nuance and generate responses with appropriate emotional tone, which is vital for empathetic interactions.
  • Hands-Free Accessibility: S2S enables truly hands-free operation, essential not only for convenience but also for accessibility for users with visual or motor impairments, and for safety in hands-occupied situations.
  • Real-time Multilingual Collaboration: S2S can dissolve language barriers instantly. Imagine a global meeting where participants speak their native language and hear real-time translations in their own language, even maintaining aspects of the original speaker's vocal style. This fosters seamless collaboration and inclusivity, and opens up markets in a way that a shared, agreed-upon language has only dreamed of.

These capabilities enable transformative use cases: multilingual customer support, streamlined healthcare intake, accessible real-time virtual meetings, and more intuitive multilingual voice assistants, reshaping user experiences across industries.

Introducing Amazon Nova Sonic

Addressing the need for sophisticated, real-time voice interactions, AWS introduces the Amazon Nova Sonic model, available via Amazon Bedrock. This is a unified foundation model architecture, engineered specifically for end-to-end spoken language conversion and interaction.

What architectural advantages does a "unified" model offer? Traditional S2S often involves chaining separate AI services: Speech-to-Text (STT), Machine Translation (MT), text-only Large Language Models (LLMs), and Text-to-Speech (TTS). So far we've been orchestrating services like Amazon Transcribe, Amazon Translate, Amazon Bedrock using text models like Amazon Nova Pro or Claude 3.7 Sonnet, and Amazon Polly (often with AWS Lambda). While it's proven an effective approach, each handoff introduces potential latency from network calls, data serialization/deserialization, and distinct processing steps.

A unified model like Nova Sonic aims to streamline this. By fusing speech recognition, translation, interpretation, response generation, and speech synthesis capabilities within a single model, it can significantly reduce the problems generated by this inter-service overhead. Architecturally, this means fewer network roundtrips and less data transformation, contributing to the ultra-low latency needed for natural conversation. Internally (and this is speculative based on general FM principles), such a model might leverage shared embedding spaces or sophisticated multi-modal transformer architectures capable of processing and generating both audio and textual/phonetic representations within a unified computational graph.

Key capabilities highlighted for the Nova Sonic model include:

  • Real-time Streaming & Low Latency: Purpose-built for conversations, processing audio bidirectionally for fluid dialogue and interruption handling.
  • Adaptive Speech Responses: Aims to modulate output tone and sentiment based on input speech characteristics.
  • Natural Turn-Taking: Incorporates mechanisms for handling conversational dynamics like pauses and interruptions.
  • Expressive, Consistent Voices: Provides high-quality neural voices designed for consistency.
  • Contextual Understanding: Utilizes significant context windows (32K speech tokens and 300K text tokens) for coherent conversations.
  • Privacy-First Design on AWS: Adheres to AWS security and data privacy standards.
  • Seamless AWS Ecosystem Integration: Designed for integration with Lambda, Lex, SageMaker, and future direct integrations with Agents for Bedrock and Amazon Connect.

The emergence of unified models like Nova Sonic on Bedrock signals a shift towards more integrated and higher-performance solutions for complex AI tasks like real-time S2S.

This writing is based on the private gated preview that Caylent had access to during March 2025, as part of our agreement with AWS as a Premier Partner.

Architectural Approaches & Competitive Landscape

When building AI-powered applications, we need to consider not just specific models but also the fundamental architecture. Let's compare the different approaches to Speech-to-Speech applications available through AWS and other major players like OpenAI and ElevenLabs.

Architectural Styles:

AWS Multi-Service Pipeline (Chained): This established approach uses separate, specialized AWS services: Amazon Transcribe (STT) -> Amazon Bedrock with text models to generate a response -> Amazon Polly (TTS), typically orchestrated by AWS Lambda.

  • Pros: Mature services, high control over each step, flexibility to swap components, broad language/feature support across individual services.
  • Cons: Potential for accumulated latency at each service hop, more complex orchestration logic.

AWS Unified Model (Nova Sonic on Bedrock): A single, integrated model designed for end-to-end S2S.

  • Pros: Significantly lower latency, simpler architecture (fewer moving parts), integrated features like adaptive response and natural turn-taking.
  • Cons: Newer technology (just out of preview), potentially less granular control over intermediate steps compared to the chained approach, initial feature/language set might be limited compared to the combined scope of mature services.

OpenAI Realtime API (Multimodal): Uses a single, multimodal model (gpt-4o-realtime-preview) via WebSockets or WebRTC for direct audio-in, audio-out processing.

  • Pros: Designed for low latency, understands nuances like emotion from audio, bypasses explicit text steps.
  • Cons: Can be significantly more expensive (tokenizes audio), technology is in preview, potentially less control than chained approaches.

OpenAI Chained Architecture: Similar to the AWS multi-service approach, using OpenAI's STT (gpt-4o-transcribe or whisper-1) -> text LLM (e.g., gpt-4o for reasoning/translation) -> TTS (gpt-4o-mini-tts or tts-1/tts-1-hd).

  • Pros: High control, access to text, leverages powerful LLMs for reasoning, steerable TTS (gpt-4o-mini-tts).
  • Cons: Higher latency than Realtime API, involves managing multiple API calls, less integration capabilities than an AWS-native approach.

ElevenLabs (Chained TTS-focused): Primarily known for high-quality TTS and voice cloning. S2S functionality appears to be achieved via their TTS API ("Speech Synthesis" with STS feature), potentially using STT internally or requiring external STT. Focuses on transforming voice characteristics.

  • Pros: State-of-the-art voice cloning and quality, extensive voice library, steerable TTS.
  • Cons: S2S seems focused on voice conversion rather than full translation workflows, architecture likely chained (requiring STT + TTS calls), pricing based on credits/hours.
Capability Comparison:
  • Real-Time Translation: AWS (Translate API, Nova Sonic), OpenAI (via LLM in chained, potentially Realtime API), ElevenLabs (less clear if translation is primary focus vs. voice conversion). Latency varies; unified/Realtime models aim for lowest latency.
  • Voice Retention/Cloning: ElevenLabs is a strong player here (instant & professional cloning). AWS offers Polly Brand Voice (custom), and Nova Sonic can interpret emotions. OpenAI TTS models have built-in voices but gpt-4o-mini-tts offers steerability (accent, emotion) which aids in mimicking styles. Achieving true speaker identity preservation in translation remains challenging across platforms.
  • Multilingual Support: AWS has broad coverage across Transcribe, Translate, Polly, and text models like Amazon Nova Pro support over 200 languages. Amazon Nova Sonic supports EN-US in the preview version, with DE, ES, FR and IT added at GA. OpenAI's STT/TTS support many languages. ElevenLabs supports numerous languages for STT and TTS.
  • Enterprise Integration & Tooling: AWS excels with deep integration (IAM, VPC, CloudWatch), comprehensive SDKs, serverless options (Lambda), and enterprise governance. OpenAI offers SDKs and APIs but less native cloud integration. ElevenLabs provides SDKs and APIs, with enterprise tiers offering more support (SLAs, BAAs). For AWS-centric organizations, the native integration is a significant advantage.
  • Data Privacy & Security: AWS provides strong commitments, compliance certifications (HIPAA, SOC 2), and data control (encryption, opt-outs). OpenAI's policies are evolving. ElevenLabs offers enterprise agreements (DPA, SLA, BAA) at higher tiers. AWS's standard security posture and compliance are often key for enterprises.
Architectural Choice:

For organizations architecting on AWS:

  • The multi-service pipeline (which can include text LLMs via Amazon Bedrock) is mature, flexible, and offers fine-grained control, suitable for many applications today, especially if extreme low latency isn't the absolute top priority or if specific features of individual services are needed.
  • Nova Sonic represents the future direction – potentially simpler and faster for real-time conversational use cases. It's compelling for new projects aiming for the most natural interaction, keeping in mind that it has just been released.
  • Leveraging OpenAI or ElevenLabs via API calls from within an AWS environment (e.g., from Lambda) is also feasible, especially if specific features like ElevenLabs' voice cloning or OpenAI's LLM reasoning are primary drivers. However, this introduces cross-cloud dependencies and potentially different security/compliance considerations.

The choice depends heavily on latency requirements, the need for intermediate text access, desired voice characteristics, existing infrastructure, and enterprise governance needs. And of course, on pricing.

Cost Considerations

Understanding the potential cost implications of different S2S architectures is of vital importance for planning and budgeting. While precise costs depend heavily on usage patterns, model choices, and specific pricing (which can change), we can analyze the conceptual differences.

Billing Models:

AWS Multi-Service Pipeline: You pay for each service individually.

  • Amazon Transcribe: Priced per second of audio processed (with a minimum duration per request for streaming). Different rates for standard vs. medical.
  • Amazon Translate: Priced per character translated.
  • Amazon Bedrock (text models): Priced per input and output tokens.
  • Amazon Polly: Priced per character synthesized (Neural TTS usually costs more than Standard).
  • AWS Lambda (Orchestration): Priced per request and compute duration (GB-seconds).

AWS Unified Model (Nova Sonic): Pricing isn't public yet. It could be priced per second/minute of interaction, potentially with different rates for input/output audio, or perhaps token-based similar to other Bedrock models. What's certain is that a single invocation will cover the end-to-end S2S task, simplifying billing compared to tracking three separate services.

OpenAI Realtime API: Priced per million tokens for both text and audio input/output. Audio tokenization generally results in higher costs per minute compared to text. Cached pricing offers discounts for repeated inputs.

OpenAI Chained Pipeline: Pay per service used.

  • STT (gpt-4o-transcribe, whisper-1): Priced per minute of audio.
  • LLM (gpt-4o): Priced per million input/output text tokens.
  • TTS (gpt-4o-mini-tts, tts-1, tts-1-hd): Priced per million input characters/tokens.

ElevenLabs: Tiered subscription model with included usage and overages.

  • STT (Scribe v1): Priced per hour of audio transcribed (beyond included hours).
  • TTS: Priced based on credits consumed (roughly 1000 credits per minute), with different tiers offering varying credits/cost.
Example Interaction Analysis:

Consider a brief 2-turn voice interaction:

  1. User: (Speaks 5 seconds) "Translate 'Hello, how are you?' to Spanish."
  2. AI: (Synthesizes 3 seconds) "Hola, cómo estás?"
  3. User: (Speaks 4 seconds) "And how do you say 'Thank you'?"
  4. AI: (Synthesizes 2 seconds) "Gracias."

Let's analyze the cost/latency factors:

AWS Multi-Service:

  • Cost: Charged for ~9 seconds of Transcribe, ~X characters for Translate (both requests), ~Y characters for Polly (both responses), plus 2 Lambda invocations.
  • Latency: Latency added at each step: User Audio -> Transcribe -> Lambda -> Translate -> Lambda -> Polly -> AI Audio. Multiple network roundtrips between services.
  • Note: For intelligent response generation we should use Bedrock instead of Translate, priced per input and output tokens. The rest remains the same.

AWS Nova Sonic:

  • Cost: Charged likely based on total interaction time or tokens processed by the single model invocation. Much simpler to track.
  • Latency: Lower latency due to unified processing within Bedrock. Fewer network hops. Note: This isn't speculative, we've thoroughly tested the Nova Sonic preview.

OpenAI Realtime:

  • Cost: Charged for audio input tokens (~9 sec), audio output tokens (~5 sec), plus any text tokens involved in the prompt/response internally. Likely the most expensive option based on audio tokenization rates.
  • Latency: Designed for low latency, similar potential to Nova Sonic.

OpenAI Chained:

  • Cost: Charged for ~9 seconds of STT, ~Z text tokens for LLM (processing translation requests), ~W characters/tokens for TTS.
  • Latency: Higher latency due to sequential calls: STT -> LLM -> TTS.

ElevenLabs (Assuming Chained STT+TTS):

  • Cost: Consumes STT hours/credits for ~9 seconds, TTS credits for ~5 seconds of output. Cost depends heavily on the subscription tier. Translation would require an additional service (like AWS Translate or an LLM).
  • Latency: Higher latency due to chained nature and potentially separate translation step.

Key Takeaway: Unified models (Nova Sonic, OpenAI Realtime) promise lower latency but may have different, potentially higher cost structures than traditional chained pipelines, like OpenAI's audio tokenization. Chained pipelines offer granular cost tracking per function (STT, MT, TTS) but accumulate latency.

Ultimately, the success of Nova Sonic will depend on where it lands with its pricing. If it can offer a total cost comparable to that of the (as of now) cheaper chained approach, it will be a game changer. If it's pricing is significantly higher, it will compete with OpenAI and ElevenLabs’ realtime solutions, where it will reign if it manages to undercut their prices (we've already tested the performance, and it's on par).

Built for Builders: Fast, Flexible, AWS-Native

Leveraging the Nova Sonic model on Amazon Bedrock fits naturally within the AWS ecosystem, offering developers familiar patterns and powerful tools to build voice-first applications.

Serverless with AWS Lambda

AWS Lambda remains a prime candidate for handling the application logic around Nova Sonic interactions. Functions can manage user sessions, potentially pre-process or post-process data, interact with the Bedrock API to invoke Nova Sonic, and connect with other AWS services like DynamoDB for state or S3 for storage. Even with a unified model, Lambda often serves as essential orchestration glue.

Amazon Bedrock Native

As a Bedrock model, Nova Sonic benefits from Bedrock's managed infrastructure, unified API access, security controls, and monitoring via CloudWatch. Developers interact with it using the Bedrock API actions (InvokeModelWithResponseStream adapted for bidirectional flow), simplifying deployment and management compared to self-hosting models. Moreover, integration with Agents for Amazon Bedrock is planned to be released soon.

Orchestration Frameworks

For managing complex conversational state, context, and prompts, frameworks like LangChain can be used effectively with Bedrock models. LangChain helps structure interactions, maintain history, and potentially chain Nova Sonic with other Bedrock models or tools if needed (though full reasoning integration is planned for Nova Sonic GA).

SDKs & API Access

AWS provides extensive SDKs to facilitate integration with their services, and Nova Sonic has its own SDK for several languages, which includes a bidirectional streaming API. This bidirectional API is key: unlike traditional request-response or unidirectional streams (like Transcribe streaming output), it allows developers to send user audio chunks and receive AI audio chunks concurrently over a persistent connection. This technical capability is what enables natural turn-taking and low-latency barge-in; the application doesn't have to wait for the user to finish speaking entirely before processing begins, and the AI can start responding almost immediately, even interrupting if appropriate, mirroring human conversation dynamics. Implementing this requires careful management of the audio stream on the client/application side but unlocks a significantly more fluid user experience.

Future Native Integrations

The potential for future direct integrations with Agents for Amazon Bedrock (for task automation driven by voice) and Amazon Connect (for seamless multilingual contact center experiences) further enhances the value proposition, making it easier to embed advanced S2S capabilities deeply within specific AWS solutions.

Building with Nova Sonic means utilizing a cutting-edge model within the robust, scalable, and developer-centric AWS environment, supported by familiar tools and integration patterns, while leveraging new API paradigms like bidirectional streaming for enhanced conversationality.

Responsible AI and Privacy

Any application, and especially voice-first applications that use AI, necessitates a strong focus on responsible use, data privacy, and security. AWS builds its AI services with these principles as foundational pillars.

  • Data Handling and Security: AWS maintains rigorous data privacy commitments. Data (speech, text) is encrypted in transit and at rest. Customers retain ownership and control over their content. AWS does not use customer content processed by services like Bedrock, Polly, or Translate to train base models. Specifically for Nova Sonic, the assurance that speech data isn't stored reinforces this customer control.
  • Compliance: AWS helps organizations meet demanding compliance requirements. Services like Transcribe and Polly are HIPAA Eligible, and the platform supports GDPR compliance efforts. Regular SOC 2 audits attest to AWS's operational controls. For regulated industries, AWS's built-in compliance posture is often a critical factor.
  • Responsible AI: AWS provides tools and emphasizes guidelines for responsible AI deployment. This includes efforts towards model fairness (evaluating performance across demographics), transparency (e.g., confidence scores in Transcribe), accountability, and safety (e.g., content moderation features and guardrails for PII redaction or toxic speech detection).

Building on AWS allows organizations to leverage these inherent security features, compliance frameworks, and responsible AI tools, fostering trust and enabling the ethical deployment of Speech-to-Speech AI applications.

Try It Today

You can try Amazon Nova Sonic today in our Bedrock Battleground application. It's also available on the Amazon Bedrock console and via the Bedrock API and SDK, under the model id amazon.nova-sonic-v1:0.

Final Word: The UX Era of AI Has a Voice

We're firmly in an era where AI's value is increasingly measured by the quality of the user experience it enables. The value is in the application layer, and voice is rapidly transitioning from a novelty to a fundamental component of intuitive and efficient application design. Users expect interactions that are as natural and immediate as human conversation, not just in content but across the entire experience.

Delivering on this expectation requires powerful, responsive Speech-to-Speech capabilities. With the introduction of Amazon Nova Sonic, AWS is equipping builders with the necessary tools to live up to the modern and future expectations. This signals a significant architectural evolution, paving the way for lower latency, more natural interactions, and seamless multilingual communication supported by the secure, scalable, and developer-centric AWS cloud.

Whether enhancing customer support, improving healthcare accessibility, enabling global collaboration, or creating sophisticated voice assistants, AWS provides the foundational AI services, orchestration tools, and robust infrastructure needed. For us building on AWS, the future of conversational AI is not just about a powerful backend exposed via text; it's about mastering the nuances of voice to create seamless human interactions. The UX era of AI has found its voice, and AWS is providing the platform to make it heard.

Generative AI & LLMOps
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Caylent Catalysts™

AWS Generative AI Proof of Value

Accelerate investment and mitigate risk when developing generative AI solutions.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

Amazon Nova Act: Building Reliable Browser Agents

Learn everything you need to know about Amazon Nova Act, a groundbreaking AI-powered tool that combines intelligent UI understanding with a Python SDK, enabling developers to create more reliable browser automation compared to traditional methods.

Generative AI & LLMOps

Amazon Bedrock Pricing Explained

Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.

Generative AI & LLMOps

The Art of Designing Bedrock Agents: Parallels with Traditional API Design

Learn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.

Generative AI & LLMOps