Caylent Accelerate™

Evaluating Contextual Grounding in Agentic RAG Chatbots with Amazon Bedrock Guardrails

Generative AI & LLMOps

Explore how organizations can ensure trustworthy, factually grounded responses in agentic RAG chatbots by evaluating contextual grounding methods, using Amazon Bedrock Guardrails and custom LLM-based scoring, to reduce hallucinations and build user confidence in high-stakes domains.

AI adoption is colliding with a hard truth: people will not trust chatbots that make things up. In high-stakes areas such as law, healthcare, or finance, even a single hallucination can erode user confidence and create real business or legal risk. That is why grounding, the practice of ensuring model responses are factually tied to the right context, has become one of the most critical challenges in applied AI.

Retrieval-Augmented Generation (RAG) helps address this by allowing models to pull knowledge directly from trusted documents. The next wave, agentic RAG, adds another level of complexity. These systems coordinate multi-step reasoning and retrieval, often using tools and external functions to construct more informed answers. This setup increases complexity, particularly in multi-turn conversations, where the agent’s response depends not only on retrieved documents but also on evolving dialogue history.

The sophistication unlocks new capabilities, but it also raises a difficult question: how do we know the answers are still grounded in the right sources? Traditional methods designed for single-turn Q&A often fall short when applied to multi-turn, context-dependent exchanges.

At Caylent, we addressed this challenge while developing an agentic RAG chatbot for a legal services firm to help users navigate complex legal topics. The system leveraged retrieval, tool calls, and multi-step reasoning to generate responses grounded in legal documents.

In the legal domain, ensuring that chatbot responses are accurate and grounded in the provided context is essential to minimize hallucinations and build user trust. Evaluating the quality of grounding became a key part of deploying the system responsibly.

Two grounding evaluation methods were explored:

  • Amazon Bedrock Guardrails, offering built-in contextual grounding and relevance scores
  • A custom LLM-based scoring approach, using Anthropic Claude to assess how well responses aligned with the supplied context across multi-turn interactions

In this article, we'll explore the grounding evaluation performed for this customer and the importance of ground truth in measuring quality in agentic AI systems that utilize RAG.

Amazon Bedrock Guardrails Contextual Grounding: Setup and Approach

Amazon Bedrock Guardrails offers a contextual grounding check that evaluates how well a model’s response aligns with a given source and user query. It provides confidence scores across dimensions, such as grounding and relevance, with thresholds and filters configurable via policy settings.

The agentic RAG architecture evaluated here consisted of:

  • A retrieval layer, which pulls relevant documents from a vector database
  • An agent layer, which selects tools and performs reasoning across steps
  • A generation model, responsible for producing the final response

Amazon Bedrock Guardrails was integrated post-response to assess grounding at each turn of the conversation, using the full chat history and context as input.

Observations and Pitfalls

Initial experiments showed that grounding scores declined over the course of a conversation, even when all relevant context, including prior turns and retrieved documents, was explicitly passed to the Amazon Bedrock Guardrails API.

Key Observations

  • Amazon Bedrock Guardrails performed well in short, isolated queries
  • In multi-turn conversations, even well-grounded responses often received low scores

Analysis

This degradation likely stems from:

  • The expanding context window, which makes it harder for Amazon Bedrock Guardrails to accurately parse and match content
  • The loss of structure when combining multiple turns and retrievals into a single context input
  • The absence of conversational memory modeling within the Amazon Bedrock Guardrails scoring logic

As a result, the reliability of grounding evaluation diminished over time, introducing noise into monitoring and observability metrics.

Technical Deep Dive: Why Do Grounding Scores Degrade?

Amazon Bedrock Guardrails operates on a per-turn basis, evaluating a static snapshot of context, query, and response. In agentic RAG systems, however, meaning is often distributed across turns, requiring a more holistic view of the dialogue.

Key Technical Limitations

  • Guardrails assume context is static, whereas agentic systems maintain an evolving state
  • Concatenated message history may lose semantic alignment, especially when the context size grows
  • Filters designed for hallucination detection may not account for indirect references or inference across dialogue turns

These limitations make it difficult to apply standard grounding evaluations to agent-driven, multi-step reasoning flows.

Transition to Custom LLM-Based Scoring

To address the shortcomings, a custom grounding evaluation pipeline was developed using Anthropic Claude as an LLM-based evaluator.

Approach

  • Prompted the LLM to act as a grounding judge, using structured instructions and a consistent format
  • Emphasized factual alignment, completeness, and hallucination detection
  • Parsed responses into structured JSON with quantitative scores and qualitative reasoning

This approach enabled more stable evaluations, especially in longer conversations where Amazon Bedrock Guardrails struggled.

Best Practices and Lessons Learned

Several insights emerged from comparing Amazon Bedrock Guardrails and custom LLM-based scoring:

  • Out-of-the-box grounding tools like Amazon Bedrock Guardrails are useful for narrow, well-structured tasks
  • Custom pipelines offer greater flexibility and accuracy in complex, multi-turn scenarios
  • Real-time prompt monitoring and observability are essential to maintain trust in production
  • Context formatting and input strategy significantly impact grounding score reliability
  • Thresholds should be tuned based on empirical results, not default configurations

Open Questions

  • How are other teams evaluating grounding in agentic systems?
  • Are there emerging frameworks for benchmarking grounding across different LLMs and use cases?

Conclusion

While Amazon Bedrock Guardrails provides a valuable starting point for grounding evaluation, its design is best suited for static, single-turn interactions. In contrast, agentic RAG systems require more nuanced scoring strategies that can account for evolving context and reasoning chains.

Custom LLM-based scoring pipelines show promise in bridging this gap by offering interpretable, tunable evaluations aligned with how agentic chatbots actually operate.

Feedback is welcome: What other techniques or tools have proven effective for grounding evaluation in multi-turn, agent-based systems?

How Caylent Can Help

At Caylent, our experts can help you take the next step in building agentic chatbots with our Generative AI Knowledge Base Solution. By leveraging sophisticated data preprocessing, we dramatically improve response accuracy while reducing operational costs through efficient search and retrieval strategies. Our solution is designed to scale seamlessly with your organization’s growing knowledge base and deliver a superior user experience that drives adoption and long-term satisfaction. Get in touch with us today to get started. 

Appendix

Integrating Amazon Bedrock Guardrails

import boto3
import logging
from botocore.config import Config

# Set up AWS region and Bedrock runtime
AWS_REGION = "us-east-1"
GUARDRAIL_ID = "id"  # Replace with your Guardrail ID
MODEL_NAME = "anthropic.claude-3-sonnet-20240229-v1:0"

bedrock_rt = boto3.client("bedrock-runtime", region_name=AWS_REGION, config=Config(
    connect_timeout=120, read_timeout=120, retries={"max_attempts": 1}
))

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def populate_content(source, query, response):
    """
    source: All state messages from the conversation history, including intermediate tool messages
    query: The latest user input
    response: Final agent response
    """
    return [
        {"text": {"text": str(source), "qualifiers": ["grounding_source"]}},
        {"text": {"text": query, "qualifiers": ["query"]}},
        {"text": {"text": response}}
    ]


def evaluate_with_bedrock_guardrails(source, query, response):
    try:
        content = populate_content(source, query, response)

        result = bedrock_rt.apply_guardrail(
            guardrailIdentifier=GUARDRAIL_ID,
            guardrailVersion='1',
            source='OUTPUT',
            content=content,
            outputScope='FULL'
        )

        grounding_score = result["assessments"][0]["contextualGroundingPolicy"]["filters"][0]["score"]
        relevance_score = result["assessments"][0]["contextualGroundingPolicy"]["filters"][1]["score"]

        return {
            "grounding_score": grounding_score,
            "relevance_score": relevance_score,
            "raw_result": result
        }

    except Exception as e:
        logger.error(f"Guardrail evaluation error: {e}")
        return {
            "grounding_score": None,
            "relevance_score": None,
            "error": str(e)
        }

Custom LLM scoring

from langchain_aws import ChatBedrockConverse
import re, json

def create_scoring_prompt(source, query, response):
    return f"""
    Evaluate how well this response is grounded in the provided source material.
    
    SOURCE: {source}
    
    QUERY: {query}
    
    RESPONSE: {response}
    
    EVALUATION CRITERIA:
    1. Factual Accuracy: All facts and claims in the response must come directly from the source
    2. Query Relevance: The response must directly address the query
    3. No Hallucination: No new information should be added beyond what's in the source
    4. Completeness: Key information from the source relevant to the query should be included
    
    SCORING SCALE (0-1):
    - 1.0: Perfect grounding - all facts from source, directly answers query, no hallucinations
    - 0.8-0.9: Mostly grounded with minor issues
    - 0.6-0.7: Moderately grounded but has some unsupported claims or omissions
    - 0.4-0.5: Partially grounded with significant issues
    - 0.2-0.3: Poorly grounded with major hallucinations
    - 0.0-0.1: Not grounded, mostly hallucinated content
    
    RESPONSE FORMAT (return valid JSON only and keep the explanation brief):
    {{
        "score": 0.85,
        "reasoning": "Explanation of specific issues found and strengths as well."
    }}
    
    Return only the JSON object, no additional text.
    """

def evaluate_custom_score(source, query, response):
    model = ChatBedrockConverse(
        client=bedrock_rt,
        model=MODEL_NAME,
        temperature=0.01
    )

    prompt = create_scoring_prompt(source, query, response)
    try:
        result = model.invoke(prompt)
        score_text = result.content.strip()
        match = re.search(r"\d*\.?\d+", score_text)
        return float(match.group(0)) if match else None
    except Exception as e:
        logger.error(f"Custom scoring error: {e}")
        return None
Generative AI & LLMOps
Seima Saki

Seima Saki

Seima Saki is a Senior Machine Learning Architect at Caylent. She helps clients design, build, and scale cutting-edge ML and Generative AI solutions that address complex business challenges. Her expertise spans architecting Agentic AI use cases such as intelligent chatbots, copilots, and autonomous workflows; designing custom retrieval and chunking strategies to optimize RAG pipelines; and fine-tuning open-source LLMs for scalable, high-performance inference. She also focuses on LLMOps best practices, ensuring reliable, secure, and cloud-native deployments.

View Seima's articles

Learn more about the services mentioned

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Caylent Catalysts™

Generative AI Knowledge Base

Learn how to improve customer experience and with custom chatbots powered by generative AI.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

Understanding Tokenomics: The Key to Profitable AI Products

Learn how understanding tokenomics helps organizations optimize the cost and profitability of their generative AI applications—making them both financially sustainable and scalable.

Generative AI & LLMOps
Cost Optimization

How Agentic AI De-Risks Healthcare and Life Science Innovation

Explore how agentic AI reduces the high failure rates of healthcare and life sciences innovation by making stakeholder collaboration a structural requirement, aligning teams from the start, and ensuring both technology adoption and reduced project risk.

Generative AI & LLMOps

Amazon Q Developer for AI-Driven Application Modernization

Discover how Amazon Q Developer is redefining developer productivity - featuring a real-world migration of a .NET Framework application to .NET 8 that transforms weeks of manual effort into just hours with AI-powered automation.

Application Modernization
Generative AI & LLMOps