AI adoption is colliding with a hard truth: people will not trust chatbots that make things up. In high-stakes areas such as law, healthcare, or finance, even a single hallucination can erode user confidence and create real business or legal risk. That is why grounding, the practice of ensuring model responses are factually tied to the right context, has become one of the most critical challenges in applied AI.
Retrieval-Augmented Generation (RAG) helps address this by allowing models to pull knowledge directly from trusted documents. The next wave, agentic RAG, adds another level of complexity. These systems coordinate multi-step reasoning and retrieval, often using tools and external functions to construct more informed answers. This setup increases complexity, particularly in multi-turn conversations, where the agent’s response depends not only on retrieved documents but also on evolving dialogue history.
The sophistication unlocks new capabilities, but it also raises a difficult question: how do we know the answers are still grounded in the right sources? Traditional methods designed for single-turn Q&A often fall short when applied to multi-turn, context-dependent exchanges.
At Caylent, we addressed this challenge while developing an agentic RAG chatbot for a legal services firm to help users navigate complex legal topics. The system leveraged retrieval, tool calls, and multi-step reasoning to generate responses grounded in legal documents.
In the legal domain, ensuring that chatbot responses are accurate and grounded in the provided context is essential to minimize hallucinations and build user trust. Evaluating the quality of grounding became a key part of deploying the system responsibly.
Two grounding evaluation methods were explored:
- Amazon Bedrock Guardrails, offering built-in contextual grounding and relevance scores
- A custom LLM-based scoring approach, using Anthropic Claude to assess how well responses aligned with the supplied context across multi-turn interactions
In this article, we'll explore the grounding evaluation performed for this customer and the importance of ground truth in measuring quality in agentic AI systems that utilize RAG.
Amazon Bedrock Guardrails Contextual Grounding: Setup and Approach
Amazon Bedrock Guardrails offers a contextual grounding check that evaluates how well a model’s response aligns with a given source and user query. It provides confidence scores across dimensions, such as grounding and relevance, with thresholds and filters configurable via policy settings.
The agentic RAG architecture evaluated here consisted of:
- A retrieval layer, which pulls relevant documents from a vector database
- An agent layer, which selects tools and performs reasoning across steps
- A generation model, responsible for producing the final response
Amazon Bedrock Guardrails was integrated post-response to assess grounding at each turn of the conversation, using the full chat history and context as input.
Observations and Pitfalls
Initial experiments showed that grounding scores declined over the course of a conversation, even when all relevant context, including prior turns and retrieved documents, was explicitly passed to the Amazon Bedrock Guardrails API.
Key Observations
- Amazon Bedrock Guardrails performed well in short, isolated queries
- In multi-turn conversations, even well-grounded responses often received low scores
Analysis
This degradation likely stems from:
- The expanding context window, which makes it harder for Amazon Bedrock Guardrails to accurately parse and match content
- The loss of structure when combining multiple turns and retrievals into a single context input
- The absence of conversational memory modeling within the Amazon Bedrock Guardrails scoring logic
As a result, the reliability of grounding evaluation diminished over time, introducing noise into monitoring and observability metrics.
Technical Deep Dive: Why Do Grounding Scores Degrade?
Amazon Bedrock Guardrails operates on a per-turn basis, evaluating a static snapshot of context, query, and response. In agentic RAG systems, however, meaning is often distributed across turns, requiring a more holistic view of the dialogue.
Key Technical Limitations
- Guardrails assume context is static, whereas agentic systems maintain an evolving state
- Concatenated message history may lose semantic alignment, especially when the context size grows
- Filters designed for hallucination detection may not account for indirect references or inference across dialogue turns
These limitations make it difficult to apply standard grounding evaluations to agent-driven, multi-step reasoning flows.
Transition to Custom LLM-Based Scoring
To address the shortcomings, a custom grounding evaluation pipeline was developed using Anthropic Claude as an LLM-based evaluator.
Approach
- Prompted the LLM to act as a grounding judge, using structured instructions and a consistent format
- Emphasized factual alignment, completeness, and hallucination detection
- Parsed responses into structured JSON with quantitative scores and qualitative reasoning
This approach enabled more stable evaluations, especially in longer conversations where Amazon Bedrock Guardrails struggled.
Best Practices and Lessons Learned
Several insights emerged from comparing Amazon Bedrock Guardrails and custom LLM-based scoring:
- Out-of-the-box grounding tools like Amazon Bedrock Guardrails are useful for narrow, well-structured tasks
- Custom pipelines offer greater flexibility and accuracy in complex, multi-turn scenarios
- Real-time prompt monitoring and observability are essential to maintain trust in production
- Context formatting and input strategy significantly impact grounding score reliability
- Thresholds should be tuned based on empirical results, not default configurations
Open Questions
- How are other teams evaluating grounding in agentic systems?
- Are there emerging frameworks for benchmarking grounding across different LLMs and use cases?
Conclusion
While Amazon Bedrock Guardrails provides a valuable starting point for grounding evaluation, its design is best suited for static, single-turn interactions. In contrast, agentic RAG systems require more nuanced scoring strategies that can account for evolving context and reasoning chains.
Custom LLM-based scoring pipelines show promise in bridging this gap by offering interpretable, tunable evaluations aligned with how agentic chatbots actually operate.
Feedback is welcome: What other techniques or tools have proven effective for grounding evaluation in multi-turn, agent-based systems?
How Caylent Can Help
At Caylent, our experts can help you take the next step in building agentic chatbots with our Generative AI Knowledge Base Solution. By leveraging sophisticated data preprocessing, we dramatically improve response accuracy while reducing operational costs through efficient search and retrieval strategies. Our solution is designed to scale seamlessly with your organization’s growing knowledge base and deliver a superior user experience that drives adoption and long-term satisfaction. Get in touch with us today to get started.
Appendix
Integrating Amazon Bedrock Guardrails