Anthropic SDK Track Part 20: Guardrails — Hallucination & Consistency

                        
                        What You’ll Learn: This article covers CCA Phase 4.1 (Reduce Hallucinations) and Phase 4.2 (Increase Output Consistency). You’ll implement grounding techniques with source attribution, nullable schema fields, temperature control strategies, tool_use for structural enforcement, eval-driven consistency measurement, and a layered production guardrails architecture.
                    

Reducing Hallucinations

Hallucinations — confident outputs that aren’t grounded in provided context — are the primary reliability concern in agentic systems. Unlike creative writing where imagination is valued, production agents need factual accuracy and the ability to express uncertainty when information is unavailable.

What Causes Hallucinations

In agentic contexts, hallucinations typically emerge from three sources:

Missing context: The model is asked about information not present in provided documents, so it fills gaps with plausible-sounding content
Overly broad questions: Vague prompts give the model latitude to speculate rather than cite specific evidence
Forced output structure: When all fields in a schema are required, the model invents values rather than leaving blanks

Grounding Techniques

The most effective anti-hallucination strategy is grounding — constraining the model to only use information from explicit sources. Key patterns include:

Explicit source attribution: Require the model to cite which document/paragraph supports each claim
“I don’t know” signals: Teach the model to express uncertainty with phrases like “Based on the provided documents, I cannot determine…”
Retrieval-augmented patterns: Provide source documents alongside the query so the model has concrete material to reference
Nullable fields: Allow null in your schema when information genuinely isn’t available

Nullable Schema Fields

One of the highest-impact techniques for reducing hallucinations is making schema fields nullable. When a tool requires all fields to be populated, the model will fabricate data rather than admit it doesn’t have the information:

                        
                        CCA Exam Tip: The CCA exam specifically tests nullable field patterns. When an extraction tool forces all fields to be non-null, the model hallucinates values. The fix is to allow null for fields that may not be present in the source document, and add a source_citation field requiring the model to quote the relevant passage.
                    

import anthropic
import json

client = anthropic.Anthropic()

# Tool with nullable fields + source citation requirement
extraction_tool = {
    "name": "extract_company_info",
    "description": (
        "Extract company information from the provided document. "
        "Use null for any field where the information is NOT explicitly "
        "stated in the document. Never guess or infer values that aren't "
        "directly supported by the text."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "company_name": {
                "type": "string",
                "description": "Company name as stated in the document"
            },
            "founding_year": {
                "type": ["integer", "null"],
                "description": "Year founded. null if not mentioned."
            },
            "revenue": {
                "type": ["string", "null"],
                "description": "Revenue figure. null if not mentioned."
            },
            "employee_count": {
                "type": ["integer", "null"],
                "description": "Number of employees. null if not mentioned."
            },
            "headquarters": {
                "type": ["string", "null"],
                "description": "HQ location. null if not mentioned."
            },
            "source_citations": {
                "type": "array",
                "items": {"type": "string"},
                "description": (
                    "Direct quotes from the document supporting each "
                    "non-null field. One citation per extracted value."
                )
            }
        },
        "required": [
            "company_name", "founding_year", "revenue",
            "employee_count", "headquarters", "source_citations"
        ]
    }
}

# Document with partial information
document = """
Acme Corp announced Q3 results yesterday. The San Francisco-based company
reported strong growth in its cloud division. CEO Jane Smith noted that
the team of 2,400 employees delivered exceptional results this quarter.
"""

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=[extraction_tool],
    messages=[{
        "role": "user",
        "content": (
            f"Extract company information from this document. "
            f"Use null for anything not explicitly stated.\n\n"
            f"Document:\n{document}"
        )
    }]
)

# Parse the tool use result
for block in response.content:
    if block.type == "tool_use":
        result = block.input
        print(json.dumps(result, indent=2))
        # Expected output:
        # {
        #   "company_name": "Acme Corp",
        #   "founding_year": null,        <-- Not mentioned, correctly null
        #   "revenue": null,              <-- Not mentioned, correctly null
        #   "employee_count": 2400,
        #   "headquarters": "San Francisco",
        #   "source_citations": [
        #     "The San Francisco-based company",
        #     "the team of 2,400 employees"
        #   ]
        # }

Source Attribution & Verification

Retrieval-Augmented Pattern

The retrieval-augmented generation (RAG) pattern provides source documents alongside queries, requiring the model to ground every claim in specific passages. The key insight is that Claude performs significantly better when told exactly which documents to reference:

Provide source documents with clear identifiers (Doc 1, Doc 2, etc.)
Require inline citations for every factual claim
Add self-correction prompts: “If you’re unsure about any fact, say so explicitly rather than guessing”

Verification Loop

For high-stakes applications, implement a verification loop that extracts individual claims from the model’s output and checks each against the source material:

Source-Grounded Verification Flow

                            flowchart TD
                                A[User Query] --> B[Retrieve Source Documents]
                                B --> C[Extract with Citations]
                                C --> D[Parse Individual Claims]
                                D --> E{Each Claim}
                                E --> F[Find Supporting Quote]
                                F --> G{Quote Found?}
                                G -->|Yes| H[Mark Verified ✓]
                                G -->|No| I[Mark Unverified ✗]
                                H --> J[Compile Verified Output]
                                I --> J
                                J --> K[Return with Confidence Scores]

Fact-Checking Agent

This pattern uses a two-pass approach: first extract claims with citations, then verify each citation actually supports the claim:

import anthropic
import json

client = anthropic.Anthropic()


def extract_claims_with_citations(text: str, sources: str) -> dict:
    """First pass: extract factual claims with source citations."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=(
            "You are a precise fact extractor. For each factual claim in your "
            "response, cite the exact source passage that supports it. "
            "Format: {'claims': [{'claim': '...', 'citation': '...', "
            "'source_id': '...'}]}. If you cannot find a supporting passage "
            "for a claim, set citation to null and source_id to null."
        ),
        tools=[{
            "name": "submit_claims",
            "description": "Submit extracted claims with their source citations",
            "input_schema": {
                "type": "object",
                "properties": {
                    "claims": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "claim": {"type": "string"},
                                "citation": {"type": ["string", "null"]},
                                "source_id": {"type": ["string", "null"]}
                            },
                            "required": ["claim", "citation", "source_id"]
                        }
                    }
                },
                "required": ["claims"]
            }
        }],
        messages=[{
            "role": "user",
            "content": (
                f"Extract all factual claims from the following question's "
                f"answer, citing sources.\n\nQuestion: {text}\n\n"
                f"Sources:\n{sources}"
            )
        }]
    )

    for block in response.content:
        if block.type == "tool_use":
            return block.input
    return {"claims": []}


def verify_claim(claim: str, citation: str, source_text: str) -> dict:
    """Second pass: verify each citation actually supports the claim."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                f"Does this citation support the claim?\n\n"
                f"Claim: {claim}\n"
                f"Citation: {citation}\n"
                f"Full source: {source_text}\n\n"
                f"Respond with ONLY 'supported', 'partially_supported', "
                f"or 'not_supported' followed by a brief explanation."
            )
        }]
    )
    text = response.content[0].text.strip()
    verdict = text.split()[0].lower().rstrip(",.")
    return {"verdict": verdict, "explanation": text}


# Example usage
sources = """
[Doc A] TechStart Inc was founded in 2019 in Austin, Texas. The company
develops AI-powered analytics tools for enterprise customers.

[Doc B] In their 2025 annual report, TechStart reported revenue of $45M,
up 60% year-over-year. The company now employs 350 people.
"""

question = "Tell me about TechStart Inc"

# Step 1: Extract claims with citations
result = extract_claims_with_citations(question, sources)
print("Extracted claims:")
for c in result["claims"]:
    print(f"  - {c['claim']}")
    print(f"    Citation: {c['citation']}")
    print(f"    Source: {c['source_id']}")
    print()

# Step 2: Verify each cited claim
print("Verification results:")
for c in result["claims"]:
    if c["citation"]:
        verification = verify_claim(c["claim"], c["citation"], sources)
        print(f"  {c['claim']}")
        print(f"    Verdict: {verification['verdict']}")
        print()

Increasing Output Consistency

Temperature & top_p Effects

Temperature and top_p directly control output variability. For production agents that need deterministic results, understanding these parameters is critical:

temperature=0: Nearly deterministic (not perfectly so due to floating-point and batching), best for factual extraction
temperature=0.3–0.5: Low creativity, still reproducible for most structured tasks
temperature=1.0: Default, balanced creativity and consistency
top_p: Alternative to temperature — restricts token sampling to the most probable subset

                        
                        Practical Tip: For structured extraction and classification tasks, set temperature=0. For tasks requiring tool_use, temperature has less impact because the JSON schema constrains the output format. The schema itself acts as a consistency mechanism regardless of temperature.
                    

tool_use for Structural Consistency

Forcing output through a tool schema is the most reliable way to achieve structural consistency. Unlike free-text responses that can vary in format, tool_use guarantees the output matches your exact JSON schema:

import anthropic
import json

client = anthropic.Anthropic()

# Demonstrate temperature effects on free text vs tool_use

# Approach 1: Free text with temperature=1.0 (inconsistent format)
free_text_response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=256,
    temperature=1.0,
    messages=[{
        "role": "user",
        "content": "Classify this support ticket: 'My payment failed twice today'"
    }]
)
print("Free text (temp=1.0):", free_text_response.content[0].text[:100])
# Output varies: sometimes bullet points, sometimes prose, different field orders

# Approach 2: tool_use forces exact structure (consistent regardless of temp)
classification_tool = {
    "name": "classify_ticket",
    "description": "Classify a support ticket into category and priority",
    "input_schema": {
        "type": "object",
        "properties": {
            "category": {
                "type": "string",
                "enum": ["billing", "technical", "account", "feature_request"]
            },
            "priority": {
                "type": "string",
                "enum": ["low", "medium", "high", "critical"]
            },
            "confidence": {
                "type": "number",
                "description": "Confidence score between 0 and 1"
            },
            "reasoning": {
                "type": "string",
                "description": "Brief explanation for the classification"
            }
        },
        "required": ["category", "priority", "confidence", "reasoning"]
    }
}

tool_response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=256,
    temperature=1.0,  # Even with high temperature, structure is enforced
    tool_choice={"type": "tool", "name": "classify_ticket"},
    tools=[classification_tool],
    messages=[{
        "role": "user",
        "content": "Classify this support ticket: 'My payment failed twice today'"
    }]
)

for block in tool_response.content:
    if block.type == "tool_use":
        print("Tool use (temp=1.0):", json.dumps(block.input, indent=2))
        # Always returns exact schema: category, priority, confidence, reasoning
        # Structure is guaranteed even with temperature=1.0

Few-Shot Anchoring

Providing examples in the system prompt establishes format expectations that reduce output variance. The model mimics the structure, tone, and level of detail shown in examples:

2–3 examples are sufficient for most formatting patterns
Include both typical cases and edge cases in examples
Show the exact output format you expect (JSON, bullet points, etc.)
Combine with tool_use for maximum consistency

Eval-Driven Consistency

Measuring Consistency

The most rigorous approach to consistency is empirical: run the same prompt multiple times and measure variance in outputs. This technique reveals which aspects of your prompt or configuration produce unstable results:

Run N=5–10 times with identical inputs
Measure field-level agreement: Do specific fields always return the same value?
Explicit criteria reduce variance more than vague instructions (“classify as high/medium/low based on financial impact > $1000” vs “classify by priority”)
Build golden datasets — curated examples with known-correct outputs that anchor expectations

Consistency Evaluator

import anthropic
import json
from collections import Counter

client = anthropic.Anthropic()

classification_tool = {
    "name": "classify_sentiment",
    "description": "Classify the sentiment of a product review",
    "input_schema": {
        "type": "object",
        "properties": {
            "sentiment": {
                "type": "string",
                "enum": ["positive", "negative", "neutral", "mixed"]
            },
            "confidence": {
                "type": "number",
                "description": "Confidence between 0.0 and 1.0"
            },
            "key_phrases": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Top 3 phrases that determined sentiment"
            }
        },
        "required": ["sentiment", "confidence", "key_phrases"]
    }
}


def run_consistency_eval(prompt: str, n_runs: int = 5) -> dict:
    """Run the same classification N times and measure agreement."""
    results = []

    for i in range(n_runs):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=256,
            temperature=0,
            tool_choice={"type": "tool", "name": "classify_sentiment"},
            tools=[classification_tool],
            messages=[{"role": "user", "content": prompt}]
        )

        for block in response.content:
            if block.type == "tool_use":
                results.append(block.input)
                break

    # Measure agreement
    sentiments = [r["sentiment"] for r in results]
    confidences = [r["confidence"] for r in results]
    sentiment_counts = Counter(sentiments)
    majority_sentiment = sentiment_counts.most_common(1)[0]

    agreement_rate = majority_sentiment[1] / n_runs
    confidence_variance = (
        max(confidences) - min(confidences) if confidences else 0
    )

    return {
        "n_runs": n_runs,
        "majority_sentiment": majority_sentiment[0],
        "agreement_rate": agreement_rate,
        "sentiment_distribution": dict(sentiment_counts),
        "confidence_range": {
            "min": min(confidences) if confidences else 0,
            "max": max(confidences) if confidences else 0,
            "variance": confidence_variance
        },
        "all_results": results
    }


# Test with an ambiguous review (harder to classify consistently)
review = (
    "The product works as advertised but shipping took forever. "
    "Quality is decent for the price, though I expected better packaging."
)

eval_result = run_consistency_eval(
    f"Classify the sentiment of this review:\n\n{review}"
)

print(f"Agreement rate: {eval_result['agreement_rate']:.0%}")
print(f"Majority sentiment: {eval_result['majority_sentiment']}")
print(f"Distribution: {eval_result['sentiment_distribution']}")
print(f"Confidence range: {eval_result['confidence_range']}")

# If agreement_rate < 0.8, consider:
# 1. Adding explicit criteria ("positive = recommends product")
# 2. Lowering temperature further
# 3. Adding few-shot examples for edge cases
# 4. Using tool_use with stricter enum definitions

Case Study

Healthcare Company: Reducing Hallucination Risk with Nullable Fields and Citations

A healthcare analytics company was extracting patient data from clinical notes using Claude. Their initial implementation frequently hallucinated missing lab values, invented medication dosages, and fabricated dates when the chart was ambiguous or incomplete.

Changes implemented:

Made all numeric fields nullable ("type": ["number", "null"])
Added source_citation requiring exact quotes from clinical notes
Included system prompt: “If a value is not explicitly stated in the clinical note, you MUST return null. Never infer or calculate values.”
Added post-extraction verification checking citations against source text

Result: Internal evals showed a sharp drop in hallucinated extractions after these changes. The null rate increased materially, which was the desired behavior: the system started representing missing data honestly instead of silently inventing values.

Nullable Fields Source Citation Healthcare Production

Production Guardrails Architecture

Layered Defense

Production systems use multiple overlapping guardrails rather than relying on any single technique. Each layer catches different failure modes:

Prompt engineering: System prompts with explicit grounding instructions and “I don’t know” permissions
Structured output: tool_use schemas with nullable fields and enums constraining valid values
Post-processing validation: Programmatic checks on model output (required fields populated, values in valid ranges, citations present)
Human review gate: Flag low-confidence or high-stakes outputs for human review before action

Layered Guardrails Pipeline

                            flowchart LR
                                A[User Input] --> B[Input Validation]
                                B --> C[Prompt Engineering Layer]
                                C --> D[Model Call with tool_use]
                                D --> E[Output Schema Validation]
                                E --> F{Confidence Check}
                                F -->|High Confidence| G[Post-Processing Rules]
                                F -->|Low Confidence| H[Human Review Queue]
                                G --> I{All Checks Pass?}
                                I -->|Yes| J[Return to User]
                                I -->|No| H
                                H --> K[Human Approves/Rejects]
                                K -->|Approved| J
                                K -->|Rejected| L[Fallback Response]

PostToolUse Validation Hook

The PostToolUse hook pattern intercepts model outputs after tool calls to validate them programmatically before returning results to the user. This catches hallucinations that slip through prompt-level guardrails:

import anthropic
import json
from dataclasses import dataclass

client = anthropic.Anthropic()


@dataclass
class ValidationResult:
    """Result of output validation."""
    is_valid: bool
    issues: list
    confidence: float


def validate_extraction_output(
    tool_output: dict,
    source_text: str
) -> ValidationResult:
    """
    PostToolUse validation hook: checks extraction output for
    hallucination signals before returning to user.
    """
    issues = []
    confidence = 1.0

    # Check 1: Required citations present
    citations = tool_output.get("source_citations", [])
    non_null_fields = sum(
        1 for k, v in tool_output.items()
        if v is not None and k not in ("source_citations",)
    )
    if len(citations) < non_null_fields:
        issues.append(
            f"Missing citations: {non_null_fields} non-null fields "
            f"but only {len(citations)} citations"
        )
        confidence -= 0.3

    # Check 2: Verify citations exist in source text
    for citation in citations:
        # Normalize whitespace for comparison
        normalized_citation = " ".join(citation.lower().split())
        normalized_source = " ".join(source_text.lower().split())
        if normalized_citation not in normalized_source:
            issues.append(f"Citation not found in source: '{citation[:50]}...'")
            confidence -= 0.2

    # Check 3: Numeric values within reasonable ranges
    if tool_output.get("employee_count") is not None:
        count = tool_output["employee_count"]
        if count < 1 or count > 10_000_000:
            issues.append(f"Suspicious employee count: {count}")
            confidence -= 0.4

    if tool_output.get("founding_year") is not None:
        year = tool_output["founding_year"]
        if year < 1800 or year > 2026:
            issues.append(f"Suspicious founding year: {year}")
            confidence -= 0.4

    # Check 4: Flag all-null responses (model may be too conservative)
    null_count = sum(
        1 for k, v in tool_output.items()
        if v is None and k != "source_citations"
    )
    total_fields = sum(
        1 for k in tool_output if k != "source_citations"
    )
    if null_count == total_fields:
        issues.append("All fields null - model may be overly conservative")
        confidence -= 0.1

    return ValidationResult(
        is_valid=len(issues) == 0,
        issues=issues,
        confidence=max(0.0, confidence)
    )


# Example: validate an extraction result
source = """
Pinnacle Systems was established in 2015 in Boston. The company
specializes in enterprise security solutions and employs 890 staff.
"""

# Simulated model output (imagine this came from a tool_use call)
model_output = {
    "company_name": "Pinnacle Systems",
    "founding_year": 2015,
    "revenue": None,
    "employee_count": 890,
    "headquarters": "Boston",
    "source_citations": [
        "Pinnacle Systems was established in 2015 in Boston",
        "employs 890 staff"
    ]
}

result = validate_extraction_output(model_output, source)
print(f"Valid: {result.is_valid}")
print(f"Confidence: {result.confidence:.2f}")
if result.issues:
    print(f"Issues: {result.issues}")
else:
    print("All validation checks passed!")

When to Use Each Technique

Different guardrail techniques are most effective for different scenarios:

Technique	Best For	Limitations
Nullable fields	Data extraction from documents with incomplete info	Model may become overly conservative
Source citations	RAG applications requiring traceability	Increases token usage; citations may be imprecise
temperature=0	Classification, extraction, deterministic tasks	Not fully deterministic; reduces creative problem-solving
tool_use schemas	Any task requiring structured, consistent output	Adds latency; not ideal for open-ended generation
Post-processing validation	High-stakes outputs (financial, medical, legal)	Requires domain-specific rules; adds complexity
Human review gate	Critical decisions, novel scenarios, edge cases	Introduces latency; doesn’t scale for all outputs

Next in the Series

In Part 21: Guardrails — Jailbreaks & Prompt Leak, we’ll cover CCA Phase 4.3 and 4.4 — defending against adversarial prompt injection attacks, preventing system prompt leakage, implementing input sanitization, and building multi-layer jailbreak defenses.