Amazon Bedrock Pricing Explained
Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.
Learn how to build and implement an intelligent GenAI-powered voice agent that can handle real-time complex interactions including key design considerations, how to plan a prompt strategy, and challenges to overcome.
Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.
Learn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.
Explore how to use prompt caching on Large Language Models (LLMs) such as Amazon Bedrock and Anthropic Claude to reduce costs and improve latency.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsVinicius Silva, Cloud Software Architect at Caylent, is a technology consultant, leader, and advisor with extensive experience leading initiatives and delivering transformative solutions across diverse industries. Based in São Paulo, he has held previous roles at Bain & Company and Amazon Web Services (AWS), specializing in guiding clients through digital transformation, cost optimization, cybersecurity, DevOps, AI, and application modernization. A builder at heart, Vinicius embraces a hands-on “learn-by-doing” approach, constantly experimenting with new ideas to create innovative solutions. He thrives on coaching people and teams, sharing knowledge, and driving collaboration to help organizations leverage modern cloud technologies and stay competitive in a rapidly evolving market.
View 's articlesTristan is a Senior Cloud Software Engineer at Caylent, where he helps customers design and build reliable, efficient, and scalable cloud-native applications on AWS. Originally from Milwaukee, Wisconsin, he earned his degree from Milwaukee Area Technical College and holds both the AWS Solutions Architect Associate and AWS Certified Developer Associate certifications. When he’s not developing top-tier software, Tristan enjoys traveling, working on cars, exploring the outdoors with his dog, and maintaining his freshwater aquariums.
View Tristan's articlesGenerative AI (GenAI) has made significant strides in recent years, starting with text-based applications and moving towards voice automation.
These recent breakthroughs are disrupting voice-based communications, like in call centers, where GenAI enables the natural understanding and generation of speech, creating space for speech-to-speech systems that can handle real-time complex interactions.
Unlike traditional Interactive Voice Response (IVR) systems, these more advanced solutions offer human-like conversations, automating tasks such as answering questions, scheduling appointments, and providing customer support without the need for key presses or providing simple multiple-choice answers that are typical of IVR systems.
For businesses, this means optimizing efficiency, enhancing user experience through a more natural interaction flow, and improving key metrics such as profits, operational predictability, and reduced downtime. Additionally, these solutions address challenges typical of high-turnover environments like call centers by minimizing training needs and streamlining operations.
Speech-to-speech technology refers to systems that take spoken input, process it, and produce spoken output in response. Unlike simple voice recognition or text-based agents, speech-to-speech mimics a real human conversation by allowing users to interact with a bot in real-time using only natural spoken language.
Creating an intelligent voice agent involves more than just understanding how speech-to-speech works—it requires careful planning and integration of various technologies to ensure smooth and adaptive conversations. This section focuses on the strategic choices and practical steps needed to build a robust voice agent, emphasizing the tool selection and planning the scenarios and actions the bot will manage.
The accuracy and effectiveness of a voice agent depends heavily on the choice of STT engine. At the time of writing this blog, the best tool we came across was Deepgram. Their accuracy, pricing, and latency were the best of all offerings we evaluated. Evaluating the right tool involves understanding the specific requirements of your use case. Some engines excel in handling phone conversations, where audio quality may be lower, while others are optimized for noisy environments, such as public spaces or call centers. Here are some key factors to consider when evaluating STT tools:
The core of the GenAI component is powered by large language models. With Amazon Bedrock, developers can choose from various LLMs, like Anthropic’s Claude or Meta’s Llama, allowing flexibility to select the model that best fits their needs.
Selecting the right LLM is essential, as each model offers unique capabilities that can impact the behavior and performance of the voice agent. Here are key considerations when evaluating LLMs:
TTS engines bring the responses generated by AI to life, delivering them in natural-sounding voices that enhance engagement and create a positive user experience. With a wide variety of voice options, customizable speech tones, and even emotional modulation, selecting the right TTS tool is crucial. Different platforms offer unique features that can improve both functionality and user interaction, depending on the needs of the application.
Selecting the right TTS solution involves balancing functionality, performance, and cost to match the specific goals of your project. A well-chosen voice solution not only improves user satisfaction but also reinforces brand consistency and enhances the emotional connection between the user and the system. In real-time applications, while low-latency responses are essential for seamless interactions, it's important to strike the right balance with voice naturality. Making the voice sound too human-like can sometimes create discomfort—a phenomenon known as the 'uncanny valley.' Instead, focus on creating clear, reliable voices that maintain a consistent personality while being clearly distinguishable from human speech. Here are some key factors to consider when evaluating TTS tools:
Carefully defining the scenarios and actions that the system will manage is essential to building an effective voice agent. Thoughtful planning ensures that the agent responds intuitively and efficiently to user needs, delivering a smooth and engaging experience. Anticipating common use cases allows the system to handle tasks seamlessly, such as:
Creating an effective speech-to-speech voice agent requires prompt engineering to ensure the system generates accurate, relevant responses. Prompts guide the LLM in understanding user inputs and delivering precise outputs. A carefully designed prompt strategy improves the conversational flow and helps the agent manage unexpected scenarios gracefully, ensuring a seamless user experience.
Building contextual prompts is essential for ensuring the LLM interprets inputs accurately and delivers relevant responses. Clear instructions within prompts help guide the model’s behavior and minimize ambiguity during interactions. For example, instead of asking an open-ended question like, “What’s your name?”, the agent can break the query into smaller steps: “Can you tell me your first name?” followed by, “And now your last name, please.” Once the name is collected, the agent can continue seamlessly: “Great, I’ve got that. Could you also confirm your email address so I can make sure everything is up to date?” This structured approach sets clear expectations for the conversation, reduces the likelihood of mistakes, and ensures that the system captures all necessary information in the correct order. By breaking down complex inputs, the agent can maintain smoother conversations and reduce the need for follow-up clarifications.
Using examples in prompts helps the LLM generalize and maintain consistency across interactions. By demonstrating the expected structure and tone of a response, developers ensure the system performs reliably, even when user inputs vary slightly. Anthropic’s documentation on prompt engineering highlights the importance of few-shot examples, where multiple samples are provided to guide the model. These examples help the LLM recognize patterns, enabling it to adapt to similar queries more effectively. With this approach, the model becomes better equipped to handle unseen inputs by drawing parallels between the provided examples and real-world interactions, leading to more accurate and context-aware responses.
Consistency is improved by using role-based prompts instructing the LLM to adopt a specific persona, such as a polite support agent or an authoritative booking assistant. Defining the role helps guide the model’s tone and behavior, ensuring responses align with the interaction’s purpose. This approach reduces the chance of inconsistent or out-of-context replies. Role-based prompts in multi-step interactions—like resolving a customer inquiry—help the conversation flow naturally, even when the user introduces new or unexpected information mid-interaction. Maintaining a coherent persona throughout ensures the voice agent remains reliable and engaging across various scenarios.
Developing these prompts requires continuous testing and refinement, with a crucial emphasis on testing with users outside the development team. While internal testing is important, having friends, family members, or external partners interact with the system often reveals unexpected scenarios and usability issues that developers might overlook. Someone unfamiliar with the system's design will interact with it in more unpredictable ways, helping identify edge cases and natural conversation patterns that need to be accommodated. Real-world interactions often reveal points of confusion or areas where prompts may not perform as expected. Adjusting prompts based on this diverse feedback ensures that the agent becomes more reliable over time, improving its ability to handle varied situations and communication styles. Anthropic's prompt engineering strategies provide a detailed framework for this iterative process, offering insights on designing flexible prompts, avoiding ambiguous phrasing, and leveraging structured input to guide LLM responses effectively. Through ongoing refinement and diverse user testing, prompt engineering can greatly enhance the overall performance of the voice agent, ensuring that conversations remain clear, adaptive, and engaging for all users.
Handling voicemail effectively is crucial when deploying voice agents, especially for outbound calls, to avoid wasting resources or causing unwanted interactions. VoIP solutions typically offer machine detection features as the first layer of defense. These systems identify when a call reaches a voicemail, preventing the WebSocket connection from being established and blocking the voice agent from engaging. However, this layer is not foolproof—there are cases where the detection may fail, triggering the WebSocket connection and inadvertently opening a dialogue with the voicemail system.
To manage such scenarios effectively, it’s essential to implement a second layer of protection at the prompt level. Once the WebSocket connection opens, the system can use specific cues to detect voicemail behavior—such as the absence of two-way conversation, a long monotone greeting, or pre-recorded messages. A prompt designed for this purpose might contain logic like:
“If no user input is detected after 3 seconds or if a continuous stream of audio is detected without pauses, terminate the call.”
Another strategy is to monitor speech patterns that deviate from natural conversation. For example, voicemail greetings often contain longer monologues or standard phrases like, “Please leave a message after the beep.” As soon as these patterns are recognized, the system can issue an immediate call termination command to end the interaction and free up resources.
Implementing both machine detection in VoIP systems and voicemail-aware prompts within the LLM ensures the voice agent handles these scenarios efficiently. Additionally, iterative testing is essential to fine-tune the prompt and detection logic, ensuring the system remains adaptive to various voicemail formats. This dual-layer approach reduces the likelihood of the voice agent engaging unnecessarily, saving time and preventing awkward or unintended interactions.
Managing interruptions in conversations with a voice agent is essential to maintaining a smooth and engaging user experience. In real-world interactions, users often interject for various reasons, and the agent must adapt appropriately to each type of interruption. A critical aspect of handling interruptions is ensuring the agent can pause its current response immediately, allowing space for the user to speak. When a user interrupts, the agent should recognize the type of interruption and respond accordingly. For temporary pauses like "Someone's at my door, one moment please," the agent might respond with "No problem, take your time. Let me know when you're ready to continue." For interruptions that change the conversation flow, such as "Wait, I actually need to check my account balance first," the agent should pivot with "Got it! Let's check your balance, and then we can return to our previous discussion if needed."
The ability to recognize and respond to user intent is equally important. Interruptions are not always off-topic; they may introduce relevant information or shift the conversation in a new direction. In such cases, the agent must pivot gracefully, extracting the new intent from the user’s input and adjusting the flow of the conversation. For example, if a user interrupts to ask, “Can I change my appointment?” while discussing billing, the agent should adapt: “Sure! Let’s update your appointment details,” and proceed with that task instead.
A robust voice agent should also include mechanisms to handle repetitive or non-essential interruptions, such as users talking over the agent without providing new information. In these cases, the system can politely manage the interruption by redirecting the conversation: “I’ll pause here. Let’s continue when you’re ready.” This type of prompt keeps the interaction polite and controlled, helping the conversation stay productive without frustrating the user.
Testing and refining the agent's behavior in handling interruptions is critical to improving its performance. Real-world interactions often reveal edge cases where the interruption logic might fail, such as overlapping inputs or ambiguous responses. Particularly important is recognizing when a user is becoming frustrated or explicitly requesting human assistance (for example, repeatedly saying "agent" or "human"). In these situations, the voice agent should smoothly transition to a human handoff rather than persisting with automated responses. This approach acknowledges the limitations of AI and prioritizes user satisfaction by connecting them with a human agent who can better handle complex or sensitive situations. For routine interruptions that the AI can handle effectively, continuous iteration based on feedback helps the agent become more adaptive, making conversations feel more intuitive over time. The key is striking the right balance between automated assistance and knowing when to escalate to human support.
In real-world interactions, it is common for users to stray off-topic during conversations with a voice agent. They may provide unrelated information, ask questions outside the scope of the interaction, or engage in casual talk. Managing these diversions effectively ensures the conversation remains productive and the user experience stays positive. Without proper handling, these detours can frustrate users, waste time, or confuse the voice agent, ultimately disrupting the flow of the interaction.
The first step to managing off-topic conversations is for the voice agent to recognize when the interaction has shifted. Advanced speech models can analyze the content of user inputs to detect when they fall outside the expected context. For example, if the user starts discussing the weather during a customer support call, the system can quickly identify that the input is irrelevant to the task at hand. In such cases, the agent can gently acknowledge the input while redirecting the conversation. A prompt like, “I see! Let’s get back to your appointment details,” provides a polite acknowledgment while steering the user back to the original topic.
In some cases, users may engage in persistent off-topic conversations, which can overwhelm the voice agent and disrupt the interaction. For instance, a user might attempt to engage the agent in an extended chat or provide excessive, irrelevant details. In these situations, the agent must politely but firmly set boundaries to maintain control of the conversation. For example, it could respond with: “I’m here to assist with your reservation today. Let’s continue with that task.” This type of boundary-setting ensures that the system remains focused while still maintaining a polite tone.
Managing off-topic interactions also requires tracking the conversation history and maintaining context. If a user diverts from the intended flow and then returns to the original topic, the agent must be able to pick up where it left off. This capability prevents users from needing to repeat information and ensures that the interaction continues smoothly. For example, if a user suddenly asks about store hours in the middle of confirming their email, the agent should handle the question and then return seamlessly to the confirmation task: “The store closes at 7 PM. Now, where were we? Ah, yes, we were confirming your email address.”
Testing and refining the handling of off-topic conversations is essential for improving the performance of the voice agent. Real-world interactions often reveal unexpected scenarios where users deviate from the main flow, requiring developers to adjust prompts or logic to handle such cases effectively. Regular updates and fine-tuning ensure that the agent continues to respond appropriately to new situations, improving both accuracy and user satisfaction over time.
Ultimately, managing off-topic conversations requires balancing flexibility and focus. While the voice agent should allow some degree of natural conversation to enhance user experience, it must also remain goal-oriented, ensuring that the interaction serves its intended purpose. The agent can deliver a more coherent, responsive, and enjoyable user experience by recognizing off-topic inputs, setting boundaries, and maintaining conversation flow.
Pauses are a natural part of human conversations, but when interacting with a voice agent, prolonged silences can disrupt the flow and cause confusion. A well-designed voice agent should be prepared to manage these pauses effectively to ensure the interaction remains smooth and engaging. One common strategy is to set a threshold for detecting silence. For example, if the user doesn’t respond within 5 seconds, the agent can gently prompt with: “Are you still there?” This subtle nudge helps re-engage the user without making the interaction feel rushed, providing an opportunity for the user to continue the conversation seamlessly.
If the user remains silent beyond a longer threshold—such as 10 seconds or another specified duration—the agent should be configured to end the call gracefully. At this stage, the agent might respond with: “It seems like we’ve been disconnected. I’ll end the call now. Please reach out again if you need anything.” This approach ensures that resources are not wasted on inactive calls while also maintaining a courteous tone that leaves the user with a positive impression. Implementing this strategy helps avoid awkward moments where both parties remain idle, creating a more efficient and polished user experience.
Additionally, managing pauses effectively requires tuning the length of silence thresholds to match the specific use case and user behavior. In customer service scenarios, users might need more time to gather information or think through their responses, so a longer pause threshold might be appropriate. On the other hand, in fast-paced interactions—like voice bots for appointment scheduling—shorter silence thresholds help keep the conversation moving. Continuous monitoring and refinement of these settings ensure the agent adapts to real-world interactions, striking the right balance between patience and efficiency.
Misunderstandings are common in voice-based interactions, especially when dealing with challenging inputs such as complex names, email addresses with special characters, or technical terms. Various strategies can help voice agents handle these scenarios effectively, ensuring smoother interactions and minimizing frustration. One proven technique is requesting inputs in smaller, manageable chunks. For example, when asking for a name, the agent might prompt: “Could you spell your first name for me, one letter at a time?” This approach reduces ambiguity, especially with uncommon or difficult-to-pronounce names, and allows the system to capture the information accurately on the first attempt.
Handling email addresses can be particularly tricky, as users often speak them quickly or include special characters that are hard to interpret. Leveraging Speech Synthesis Markup Language (SSML) can improve the interaction by controlling the way the agent reads back or requests information. For instance, SSML can adjust the pacing, emphasize specific letters, or pause appropriately between characters to make spelling easier. A prompt might say: “Please confirm your email address: ‘example at domain dot com.’ I’ll repeat it—‘e-x-a-m-p-l-e at d-o-m-a-i-n dot com.’ Did I get that right?” These adjustments help the agent ensure accuracy and prevent the need for multiple retries.
In cases where the system remains unsure about the input, fallback mechanisms become essential. When confidence scores from the speech-to-text (STT) engine fall below a set threshold, the agent can ask for confirmation or offer a polite clarification: “I’m not sure I got that right. Could you repeat your email address or spell it for me?” Additionally, adaptive prompts allow the agent to pivot if necessary. For example, if a user struggles with pronunciation, the system might switch to an alternate input method: “If it’s easier, you can also provide your details by text or email after this call.” These fallback strategies ensure the conversation remains fluid, even when initial inputs are unclear, ultimately enhancing the user experience.
Once the voice agent has completed all tasks—such as confirming details or booking an appointment—it should move directly toward ending the interaction. Rather than recapping every action, the agent can simply confirm that everything is in order and signal the conversation’s conclusion. For example, it might say: “Everything is set. If there’s nothing else, I’ll go ahead and end the call now.” This keeps the interaction efficient while still giving the user an opportunity to raise any final concerns.
If the user has no additional input, the agent should politely signal the end of the call. A concise phrase such as “Thank you for calling! Have a great day!” provides a clear closing statement and ensures the user understands the interaction is complete. If the user hesitates or seems unsure about whether to end the call, the system can offer a gentle confirmation: “I’ll end the call now. Feel free to reach out if you need anything else.”
Once the call concludes, the system should disconnect promptly and smoothly, avoiding awkward pauses or lingering silence. In some cases, the agent can offer follow-up actions, such as sending a confirmation via email or text: “You’ll receive an email shortly with your appointment details. Thank you again!” This streamlined approach ensures the conversation ends efficiently, with no ambiguity or unnecessary repetition, leaving the user with a positive, polished experience.
By following these steps—recapping actions, signaling the end clearly, and smoothly disconnecting—the voice agent can leave users with a positive impression, encouraging future interactions and reinforcing trust in the service.
While advancements in Generative AI and speech-to-speech technology have made voice agents more capable than their predecessors, significant challenges remain. One of the biggest lessons from past voice assistants—such as Amazon Alexa—is that failing to handle real-world interactions effectively can severely limit adoption. Many users abandoned these systems because they did not maintain conversation context well, often forgot prior inputs, and struggled to handle user frustration when misunderstandings occurred.
To build a truly intelligent voice agent, it’s critical to address these limitations through better real-time transcription accuracy, improved context retention, and frustration-aware response mechanisms. The following sections dive into the key challenges that must be tackled to achieve this.
Building an intelligent speech-to-speech agent using GenAI combines the best of both voice and conversational AI technologies. By integrating high-quality STT, TTS, and LLMs, companies can easily create voice agents that handle complex conversations, providing seamless user experiences. As generative AI continues to advance, the possibilities for intelligent voice agents will only grow, making this an exciting area for developers and businesses alike to explore.
Caylent Catalysts™
Learn how to improve customer experience and with custom chatbots powered by generative AI.
Caylent Catalysts™
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.