Caylent Catalysts™
Generative AI Strategy
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Explore how organizations can ensure trustworthy, factually grounded responses in agentic RAG chatbots by evaluating contextual grounding methods, using Amazon Bedrock Guardrails and custom LLM-based scoring, to reduce hallucinations and build user confidence in high-stakes domains.
AI adoption is colliding with a hard truth: people will not trust chatbots that make things up. In high-stakes areas such as law, healthcare, or finance, even a single hallucination can erode user confidence and create real business or legal risk. That is why grounding, the practice of ensuring model responses are factually tied to the right context, has become one of the most critical challenges in applied AI.
Retrieval-Augmented Generation (RAG) helps address this by allowing models to pull knowledge directly from trusted documents. The next wave, agentic RAG, adds another level of complexity. These systems coordinate multi-step reasoning and retrieval, often using tools and external functions to construct more informed answers. This setup increases complexity, particularly in multi-turn conversations, where the agent’s response depends not only on retrieved documents but also on evolving dialogue history.
The sophistication unlocks new capabilities, but it also raises a difficult question: how do we know the answers are still grounded in the right sources? Traditional methods designed for single-turn Q&A often fall short when applied to multi-turn, context-dependent exchanges.
At Caylent, we addressed this challenge while developing an agentic RAG chatbot for a legal services firm to help users navigate complex legal topics. The system leveraged retrieval, tool calls, and multi-step reasoning to generate responses grounded in legal documents.
In the legal domain, ensuring that chatbot responses are accurate and grounded in the provided context is essential to minimize hallucinations and build user trust. Evaluating the quality of grounding became a key part of deploying the system responsibly.
Two grounding evaluation methods were explored:
In this article, we'll explore the grounding evaluation performed for this customer and the importance of ground truth in measuring quality in agentic AI systems that utilize RAG.
Amazon Bedrock Guardrails offers a contextual grounding check that evaluates how well a model’s response aligns with a given source and user query. It provides confidence scores across dimensions, such as grounding and relevance, with thresholds and filters configurable via policy settings.
The agentic RAG architecture evaluated here consisted of:
Amazon Bedrock Guardrails was integrated post-response to assess grounding at each turn of the conversation, using the full chat history and context as input.
Initial experiments showed that grounding scores declined over the course of a conversation, even when all relevant context, including prior turns and retrieved documents, was explicitly passed to the Amazon Bedrock Guardrails API.
Key Observations
Analysis
This degradation likely stems from:
As a result, the reliability of grounding evaluation diminished over time, introducing noise into monitoring and observability metrics.
Amazon Bedrock Guardrails operates on a per-turn basis, evaluating a static snapshot of context, query, and response. In agentic RAG systems, however, meaning is often distributed across turns, requiring a more holistic view of the dialogue.
Key Technical Limitations
These limitations make it difficult to apply standard grounding evaluations to agent-driven, multi-step reasoning flows.
To address the shortcomings, a custom grounding evaluation pipeline was developed using Anthropic Claude as an LLM-based evaluator.
This approach enabled more stable evaluations, especially in longer conversations where Amazon Bedrock Guardrails struggled.
Several insights emerged from comparing Amazon Bedrock Guardrails and custom LLM-based scoring:
While Amazon Bedrock Guardrails provides a valuable starting point for grounding evaluation, its design is best suited for static, single-turn interactions. In contrast, agentic RAG systems require more nuanced scoring strategies that can account for evolving context and reasoning chains.
Custom LLM-based scoring pipelines show promise in bridging this gap by offering interpretable, tunable evaluations aligned with how agentic chatbots actually operate.
Feedback is welcome: What other techniques or tools have proven effective for grounding evaluation in multi-turn, agent-based systems?
At Caylent, our experts can help you take the next step in building agentic chatbots with our Generative AI Knowledge Base Solution. By leveraging sophisticated data preprocessing, we dramatically improve response accuracy while reducing operational costs through efficient search and retrieval strategies. Our solution is designed to scale seamlessly with your organization’s growing knowledge base and deliver a superior user experience that drives adoption and long-term satisfaction. Get in touch with us today to get started.
Integrating Amazon Bedrock Guardrails
import boto3
import logging
from botocore.config import Config
# Set up AWS region and Bedrock runtime
AWS_REGION = "us-east-1"
GUARDRAIL_ID = "id" # Replace with your Guardrail ID
MODEL_NAME = "anthropic.claude-3-sonnet-20240229-v1:0"
bedrock_rt = boto3.client("bedrock-runtime", region_name=AWS_REGION, config=Config(
connect_timeout=120, read_timeout=120, retries={"max_attempts": 1}
))
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def populate_content(source, query, response):
"""
source: All state messages from the conversation history, including intermediate tool messages
query: The latest user input
response: Final agent response
"""
return [
{"text": {"text": str(source), "qualifiers": ["grounding_source"]}},
{"text": {"text": query, "qualifiers": ["query"]}},
{"text": {"text": response}}
]
def evaluate_with_bedrock_guardrails(source, query, response):
try:
content = populate_content(source, query, response)
result = bedrock_rt.apply_guardrail(
guardrailIdentifier=GUARDRAIL_ID,
guardrailVersion='1',
source='OUTPUT',
content=content,
outputScope='FULL'
)
grounding_score = result["assessments"][0]["contextualGroundingPolicy"]["filters"][0]["score"]
relevance_score = result["assessments"][0]["contextualGroundingPolicy"]["filters"][1]["score"]
return {
"grounding_score": grounding_score,
"relevance_score": relevance_score,
"raw_result": result
}
except Exception as e:
logger.error(f"Guardrail evaluation error: {e}")
return {
"grounding_score": None,
"relevance_score": None,
"error": str(e)
}
Custom LLM scoring
from langchain_aws import ChatBedrockConverse
import re, json
def create_scoring_prompt(source, query, response):
return f"""
Evaluate how well this response is grounded in the provided source material.
SOURCE: {source}
QUERY: {query}
RESPONSE: {response}
EVALUATION CRITERIA:
1. Factual Accuracy: All facts and claims in the response must come directly from the source
2. Query Relevance: The response must directly address the query
3. No Hallucination: No new information should be added beyond what's in the source
4. Completeness: Key information from the source relevant to the query should be included
SCORING SCALE (0-1):
- 1.0: Perfect grounding - all facts from source, directly answers query, no hallucinations
- 0.8-0.9: Mostly grounded with minor issues
- 0.6-0.7: Moderately grounded but has some unsupported claims or omissions
- 0.4-0.5: Partially grounded with significant issues
- 0.2-0.3: Poorly grounded with major hallucinations
- 0.0-0.1: Not grounded, mostly hallucinated content
RESPONSE FORMAT (return valid JSON only and keep the explanation brief):
{{
"score": 0.85,
"reasoning": "Explanation of specific issues found and strengths as well."
}}
Return only the JSON object, no additional text.
"""
def evaluate_custom_score(source, query, response):
model = ChatBedrockConverse(
client=bedrock_rt,
model=MODEL_NAME,
temperature=0.01
)
prompt = create_scoring_prompt(source, query, response)
try:
result = model.invoke(prompt)
score_text = result.content.strip()
match = re.search(r"\d*\.?\d+", score_text)
return float(match.group(0)) if match else None
except Exception as e:
logger.error(f"Custom scoring error: {e}")
return None
Seima Saki is a Senior Machine Learning Architect at Caylent. She helps clients design, build, and scale cutting-edge ML and Generative AI solutions that address complex business challenges. Her expertise spans architecting Agentic AI use cases such as intelligent chatbots, copilots, and autonomous workflows; designing custom retrieval and chunking strategies to optimize RAG pipelines; and fine-tuning open-source LLMs for scalable, high-performance inference. She also focuses on LLMOps best practices, ensuring reliable, secure, and cloud-native deployments.
View Seima's articlesCaylent Catalysts™
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Caylent Catalysts™
Learn how to improve customer experience and with custom chatbots powered by generative AI.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsLearn how understanding tokenomics helps organizations optimize the cost and profitability of their generative AI applications—making them both financially sustainable and scalable.
Explore how agentic AI reduces the high failure rates of healthcare and life sciences innovation by making stakeholder collaboration a structural requirement, aligning teams from the start, and ensuring both technology adoption and reduced project risk.
Discover how Amazon Q Developer is redefining developer productivity - featuring a real-world migration of a .NET Framework application to .NET 8 that transforms weeks of manual effort into just hours with AI-powered automation.