Back to AI App Dev Series

Anthropic SDK Track Part 18: Customer Support Agent Pattern

May 25, 2026 Wasil Zafar 22 min read

Build a production customer support agent end-to-end — MCP tools for CRM integration, identity verification gates, escalation criteria, human handoff patterns, guardrails against misuse, and evaluation frameworks for measuring first-contact resolution.

Table of Contents

  1. Customer Support Pattern
  2. MCP Tool Design
  3. Context Preservation
  4. Escalation & Human Handoff
  5. Guardrails for Support
  6. Evaluation Patterns
What You’ll Learn: This article covers CCA Phase 1.3 — Customer Support Agent Pattern. You’ll build an end-to-end support agent with MCP tool integration, identity verification gates, structured escalation to human agents, production guardrails, and an evaluation framework measuring first-contact resolution rate.

1. End-to-End Customer Support Pattern

A production customer support agent is more than a chatbot with tools. It must verify identity before accessing data, choose the right tool for each sub-task, recognize when it cannot help, and hand off to humans with full context. The CCA exam tests your ability to wire all these components into a single coherent system.

1.1 What Makes a Good Support Agent

Three metrics define a good customer support agent:

  • First-Contact Resolution (FCR) — The customer’s issue is fully resolved in a single interaction without callbacks or transfers. Target: >70% for general support.
  • Context Awareness — The agent understands the customer’s history, current subscription, open tickets, and recent interactions before responding.
  • Escalation Precision — The agent escalates exactly the right cases (never too early, never too late) with structured context so the human agent doesn’t repeat questions.

1.2 Architecture Overview

The support agent architecture follows a clear pipeline: the user message arrives, the routing layer classifies intent (from Part 17), the support agent processes using MCP tools, and either resolves directly or escalates with a structured handoff summary.

ComponentResponsibilityImplementation
RouterClassify intent & urgencyHaiku + forced tool_choice (Part 17)
Verification GateConfirm customer identitySystem prompt + tool sequence
Support AgentResolve issue using toolsSonnet + MCP tools + agentic loop
GuardrailsBlock unsafe operationsHooks (PreToolUse / PostResponse)
EscalationHand off to human with contextStructured output + queue integration

1.3 Complete Support Flow

Customer Support Agent Flow
flowchart TD
    A["Customer Message"] --> B{"Identity Verified?"}
    B -->|"No"| C["Request Verification
(email + order ID)"] C --> B B -->|"Yes"| D["Classify Intent"] D --> E{"Can Resolve?"} E -->|"Yes"| F["Execute MCP Tools
(lookup, update, refund)"] F --> G{"Guardrail Check"} G -->|"Pass"| H["Deliver Resolution"] G -->|"Blocked"| I["Escalate: Policy Violation"] E -->|"No"| J{"Confidence < 0.6?"} J -->|"Yes"| K["Escalate with Context"] J -->|"No"| L["Ask Clarifying Question"] L --> D K --> M["Human Agent Queue"] I --> M H --> N["Customer Satisfied?"] N -->|"No"| K style C fill:#BF092F,color:#fff style H fill:#3B9797,color:#fff style M fill:#132440,color:#fff style F fill:#16476A,color:#fff

Key design principles visible in this flow:

  • Verification first — Never access customer data before confirming identity
  • Guardrails at the boundary — Check before executing any tool that modifies state
  • Graceful degradation — Low confidence leads to clarification, not random actions
  • Feedback loop — If the customer isn’t satisfied after resolution, escalate rather than loop forever

2. MCP Tool Design for Support

The support agent’s capabilities are defined entirely by its tool schemas. Well-designed tool descriptions guide Claude toward correct tool selection without extensive prompt engineering. Poorly described tools lead to hallucinated parameters and incorrect invocations.

2.1 Tool Schemas

A production support agent typically needs 5–7 core tools. Each tool has a specific scope and clear documentation in its description field that tells Claude when to use it:

import anthropic
import json

# Define MCP tools for a customer support agent
support_tools = [
    {
        "name": "get_customer",
        "description": "Retrieve customer profile by email or customer ID. Use this FIRST after identity verification to load the customer's account details, subscription tier, and contact preferences. Returns: name, email, plan, signup_date, lifetime_value.",
        "input_schema": {
            "type": "object",
            "properties": {
                "identifier": {
                    "type": "string",
                    "description": "Customer email address or customer ID (format: CUS-XXXXX)"
                },
                "identifier_type": {
                    "type": "string",
                    "enum": ["email", "customer_id"],
                    "description": "Whether the identifier is an email or customer ID"
                }
            },
            "required": ["identifier", "identifier_type"]
        }
    },
    {
        "name": "lookup_order",
        "description": "Look up order details by order ID or customer ID. Use when the customer asks about a specific order, shipment status, or delivery issue. Returns: order_id, items, status, tracking_number, estimated_delivery.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "Order ID (format: ORD-XXXXXXXX). If not provided, returns last 5 orders for the customer."
                },
                "customer_id": {
                    "type": "string",
                    "description": "Customer ID to look up recent orders"
                }
            },
            "required": []
        }
    },
    {
        "name": "check_billing",
        "description": "Check billing status, payment history, and upcoming charges. Use when customer asks about charges, invoices, payment failures, or subscription renewals. Returns: current_balance, next_charge_date, payment_method_last4, recent_transactions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {
                    "type": "string",
                    "description": "Customer ID (format: CUS-XXXXX)"
                },
                "include_history": {
                    "type": "boolean",
                    "description": "Whether to include last 10 transactions. Default: false"
                }
            },
            "required": ["customer_id"]
        }
    },
    {
        "name": "process_refund",
        "description": "Process a refund for a specific order. ONLY use when the customer explicitly requests a refund AND the order is eligible (delivered, within 30-day window). Requires order_id and reason. Refunds over $500 require human approval.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "Order ID to refund (format: ORD-XXXXXXXX)"
                },
                "amount": {
                    "type": "number",
                    "description": "Refund amount in USD. Must not exceed original order total."
                },
                "reason": {
                    "type": "string",
                    "enum": ["defective", "not_as_described", "late_delivery", "customer_request", "duplicate_charge"],
                    "description": "Reason category for the refund"
                }
            },
            "required": ["order_id", "amount", "reason"]
        }
    },
    {
        "name": "escalate_to_human",
        "description": "Escalate the conversation to a human support agent. Use when: (1) the customer explicitly requests a human, (2) the issue requires policy exceptions beyond your authority, (3) you cannot resolve after 3 tool attempts, (4) the customer expresses strong frustration. Always include a summary of what you've already tried.",
        "input_schema": {
            "type": "object",
            "properties": {
                "reason": {
                    "type": "string",
                    "description": "Why this conversation needs human intervention"
                },
                "priority": {
                    "type": "string",
                    "enum": ["low", "medium", "high", "urgent"],
                    "description": "Urgency level based on customer impact and sentiment"
                },
                "summary": {
                    "type": "string",
                    "description": "Structured summary: customer issue, steps already taken, what the customer needs"
                },
                "customer_id": {
                    "type": "string",
                    "description": "Customer ID for the human agent to pull up"
                }
            },
            "required": ["reason", "priority", "summary", "customer_id"]
        }
    }
]

# Print tool count and names for verification
print(f"Defined {len(support_tools)} support tools:")
for tool in support_tools:
    print(f"  - {tool['name']}: {tool['description'][:60]}...")

2.2 Tool Descriptions That Guide Decisions

Notice the patterns in the tool descriptions above:

  • When to use — Every description starts with when Claude should select this tool (“Use when the customer asks about...”)
  • Preconditionsprocess_refund specifies “ONLY use when...AND the order is eligible”
  • Constraints — “Refunds over $500 require human approval” teaches Claude the guardrail boundary
  • Return values — Describing what the tool returns helps Claude plan multi-step workflows
  • Escalation triggersescalate_to_human explicitly lists the four escalation conditions

This approach means you need minimal system prompt instructions about tool selection — the descriptions themselves encode the decision logic.

3. Context Preservation in Support

Support conversations are inherently multi-turn. The customer states a problem, the agent asks clarifying questions, verifies identity, looks up data, and resolves. Throughout this flow, the agent must maintain context about what’s been verified, what tools have been called, and what the customer’s emotional state is.

CCA Exam Pattern — Verification Before Access: The CCA exam frequently tests whether you implement identity verification before allowing data-access tools. The correct pattern: (1) Ask for identifying information, (2) Call a verification tool to confirm, (3) Only then enable data-access tools. Never skip verification even if the customer provides their own ID — it must be cross-referenced.

3.1 Identity Verification Gate

The verification gate pattern ensures the agent never exposes customer data to an unauthorized party. The system prompt instructs Claude to verify before accessing, and the tool execution layer enforces it:

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

# Simulated customer database
CUSTOMERS_DB = {
    "CUS-12345": {
        "name": "Sarah Chen",
        "email": "sarah.chen@example.com",
        "plan": "Enterprise",
        "last_4_ssn": "7890",
        "orders": ["ORD-00112233", "ORD-00445566"]
    }
}

def run_support_agent_with_verification(user_message: str, conversation_history: list):
    """Support agent with verification gate and context tracking."""

    # Session state tracks verification status
    session = {
        "verified": False,
        "customer_id": None,
        "tool_calls_count": 0,
        "started_at": datetime.now().isoformat()
    }

    system_prompt = """You are a customer support agent for TechCorp.

VERIFICATION RULES (MANDATORY):
1. Before accessing ANY customer data, you MUST verify the customer's identity.
2. Ask for their email AND one of: order ID, last 4 of SSN, or account PIN.
3. Call verify_identity tool with their responses.
4. Only after verification succeeds, proceed with data lookups.
5. If verification fails twice, escalate to human agent.

RESOLUTION RULES:
- Always be empathetic and professional
- Resolve in as few turns as possible
- If you cannot resolve in 3 tool calls, escalate
- Never promise what you cannot deliver (no "I'll make sure it never happens again")
- Never share full SSN, credit card numbers, or passwords

ESCALATION RULES:
- Customer explicitly asks for a human
- Issue requires policy exception (refund > $500, account deletion)
- Customer sentiment is angry after 2 failed resolution attempts
- You are uncertain about the correct action"""

    messages = conversation_history + [{"role": "user", "content": user_message}]

    # Define tools available based on verification state
    tools = [
        {
            "name": "verify_identity",
            "description": "Verify customer identity using email + secondary factor. Must be called before any data access.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "email": {"type": "string"},
                    "verification_factor": {"type": "string", "description": "Order ID, last 4 SSN, or account PIN"},
                    "factor_type": {"type": "string", "enum": ["order_id", "ssn_last4", "pin"]}
                },
                "required": ["email", "verification_factor", "factor_type"]
            }
        },
        {
            "name": "get_customer",
            "description": "Get customer profile. REQUIRES: identity must be verified first.",
            "input_schema": {
                "type": "object",
                "properties": {"customer_id": {"type": "string"}},
                "required": ["customer_id"]
            }
        },
        {
            "name": "escalate_to_human",
            "description": "Escalate to human agent with full context summary.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string"},
                    "priority": {"type": "string", "enum": ["low", "medium", "high", "urgent"]},
                    "summary": {"type": "string"}
                },
                "required": ["reason", "priority", "summary"]
            }
        }
    ]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system_prompt,
        tools=tools,
        messages=messages
    )

    # Process tool calls with verification enforcement
    while response.stop_reason == "tool_use":
        tool_use = next(b for b in response.content if b.type == "tool_use")
        session["tool_calls_count"] += 1

        # ENFORCEMENT: Block data access if not verified
        if tool_use.name == "get_customer" and not session["verified"]:
            tool_result = {
                "type": "error",
                "error": "ACCESS_DENIED: Identity not yet verified. Call verify_identity first."
            }
        elif tool_use.name == "verify_identity":
            # Simulate verification check
            email = tool_use.input.get("email", "")
            factor = tool_use.input.get("verification_factor", "")
            # Check against DB
            verified = any(
                c["email"] == email and
                (factor in c.get("orders", []) or factor == c.get("last_4_ssn", ""))
                for c in CUSTOMERS_DB.values()
            )
            if verified:
                session["verified"] = True
                session["customer_id"] = next(
                    k for k, v in CUSTOMERS_DB.items() if v["email"] == email
                )
                tool_result = {"verified": True, "customer_id": session["customer_id"]}
            else:
                tool_result = {"verified": False, "message": "Verification failed. Please try again."}
        else:
            tool_result = {"status": "executed", "data": "...simulated response..."}

        # Continue conversation with tool result
        messages = messages + [
            {"role": "assistant", "content": response.content},
            {"role": "user", "content": [{"type": "tool_result", "tool_use_id": tool_use.id, "content": json.dumps(tool_result)}]}
        ]
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=system_prompt,
            tools=tools,
            messages=messages
        )

    # Extract final text response
    final_text = next((b.text for b in response.content if b.type == "text"), "")
    print(f"Agent response ({session['tool_calls_count']} tool calls, verified={session['verified']}):")
    print(final_text[:200])
    return final_text, session

# Example: Customer contacts support
response, session = run_support_agent_with_verification(
    "Hi, I need help with my recent order. It arrived damaged.",
    conversation_history=[]
)

3.2 Passing Context to Escalation

When the agent escalates, the human agent must receive a structured summary — not the raw conversation. This prevents the customer from repeating themselves and reduces handle time:

import json
from datetime import datetime

def build_escalation_handoff(session_data: dict, conversation_turns: list) -> dict:
    """Build a structured handoff summary for the human agent queue."""

    # Extract key information from the session
    handoff = {
        "escalation_id": f"ESC-{datetime.now().strftime('%Y%m%d%H%M%S')}",
        "timestamp": datetime.now().isoformat(),
        "customer": {
            "id": session_data.get("customer_id", "UNVERIFIED"),
            "verified": session_data.get("verified", False),
            "sentiment": session_data.get("detected_sentiment", "neutral")
        },
        "issue": {
            "category": session_data.get("intent_category", "unknown"),
            "summary": session_data.get("issue_summary", ""),
            "urgency": calculate_urgency(session_data)
        },
        "resolution_attempts": {
            "tools_called": session_data.get("tools_called", []),
            "total_tool_calls": session_data.get("tool_calls_count", 0),
            "what_was_tried": session_data.get("attempted_resolutions", []),
            "why_escalated": session_data.get("escalation_reason", "")
        },
        "context_for_human": {
            "customer_quote": extract_key_quotes(conversation_turns),
            "relevant_data_pulled": session_data.get("fetched_data_summary", {}),
            "suggested_resolution": session_data.get("suggested_next_step", "")
        }
    }
    return handoff

def calculate_urgency(session: dict) -> str:
    """Determine urgency from session signals."""
    if session.get("detected_sentiment") == "angry":
        return "high"
    if session.get("tool_calls_count", 0) >= 3:
        return "medium"
    if session.get("customer_tier") == "Enterprise":
        return "high"
    return "low"

def extract_key_quotes(turns: list) -> list:
    """Pull direct customer quotes that capture the core issue."""
    # In production, use Claude to extract key quotes
    return [turn["content"][:100] for turn in turns if turn.get("role") == "user"][:3]

# Example handoff
sample_session = {
    "customer_id": "CUS-12345",
    "verified": True,
    "detected_sentiment": "frustrated",
    "intent_category": "billing_dispute",
    "issue_summary": "Customer charged twice for annual subscription renewal",
    "tools_called": ["verify_identity", "get_customer", "check_billing"],
    "tool_calls_count": 3,
    "attempted_resolutions": ["Looked up billing history", "Confirmed duplicate charge exists"],
    "escalation_reason": "Refund amount ($499) requires human approval per policy",
    "customer_tier": "Enterprise",
    "fetched_data_summary": {"duplicate_charge": "$499", "charge_dates": ["2026-05-20", "2026-05-21"]},
    "suggested_next_step": "Process refund for duplicate charge on 2026-05-21"
}

handoff = build_escalation_handoff(sample_session, [
    {"role": "user", "content": "I was charged twice for my subscription. This is ridiculous."},
    {"role": "user", "content": "Yes my email is sarah.chen@example.com, order ORD-00112233"}
])

print(json.dumps(handoff, indent=2))

4. Escalation Criteria & Human Handoff

The escalation decision is the most critical judgment a support agent makes. Escalate too early and you waste human agent time. Escalate too late and customers suffer through an agent that can’t help them. The CCA exam expects you to implement explicit, measurable criteria rather than vague “if it seems difficult” logic.

CCA Exam Pattern — Escalation Decision Logic: The exam tests four specific escalation triggers: (1) Confidence below threshold after classification, (2) Repeated tool failures (3+ attempts without resolution), (3) Policy boundary violations (actions exceeding agent authority), (4) Customer anger detection via sentiment signals. You must implement ALL FOUR in a production agent.

4.1 When to Escalate

Each escalation trigger maps to a specific detection mechanism:

TriggerDetection MethodExamplePriority
Low ConfidenceClassification score < 0.6Ambiguous intent, multiple possible categoriesMedium
Repeated Failurestool_calls_count ≥ 3 without resolutionOrder lookup fails, API errorsMedium
Policy ViolationGuardrail hook blocks actionRefund > $500, account deletion requestHigh
Customer AngerSentiment analysis + explicit keywords“I want to speak to a manager”, profanityUrgent

4.2 Escalation Decision Logic

import anthropic
import json
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class EscalationDecision:
    should_escalate: bool
    reason: str
    priority: str  # low, medium, high, urgent
    confidence: float
    triggers_fired: list = field(default_factory=list)

def evaluate_escalation(session_state: dict, latest_message: str) -> EscalationDecision:
    """Evaluate whether the current conversation should be escalated."""

    triggers = []
    max_priority = "low"
    priority_order = {"low": 0, "medium": 1, "high": 2, "urgent": 3}

    # Trigger 1: Low classification confidence
    if session_state.get("classification_confidence", 1.0) < 0.6:
        triggers.append("low_confidence")
        max_priority = "medium"

    # Trigger 2: Repeated tool failures
    if session_state.get("tool_calls_count", 0) >= 3 and not session_state.get("resolved", False):
        triggers.append("repeated_failures")
        if priority_order["medium"] > priority_order[max_priority]:
            max_priority = "medium"

    # Trigger 3: Policy boundary violation
    blocked_actions = session_state.get("blocked_by_guardrail", [])
    if blocked_actions:
        triggers.append("policy_violation")
        if priority_order["high"] > priority_order[max_priority]:
            max_priority = "high"

    # Trigger 4: Customer anger detection
    anger_signals = detect_anger(latest_message)
    if anger_signals["is_angry"]:
        triggers.append("customer_anger")
        if priority_order["urgent"] > priority_order[max_priority]:
            max_priority = "urgent"

    # Trigger 5: Explicit human request
    human_request_phrases = [
        "speak to a human", "talk to a person", "real person",
        "manager", "supervisor", "transfer me"
    ]
    if any(phrase in latest_message.lower() for phrase in human_request_phrases):
        triggers.append("explicit_human_request")
        if priority_order["high"] > priority_order[max_priority]:
            max_priority = "high"

    should_escalate = len(triggers) > 0
    reason = f"Escalation triggered by: {', '.join(triggers)}" if triggers else "No escalation needed"

    return EscalationDecision(
        should_escalate=should_escalate,
        reason=reason,
        priority=max_priority,
        confidence=1.0 - (session_state.get("classification_confidence", 1.0)),
        triggers_fired=triggers
    )

def detect_anger(message: str) -> dict:
    """Detect anger signals in customer message using keyword + pattern matching."""
    anger_keywords = ["furious", "unacceptable", "terrible", "worst", "scam", "lawsuit", "sue"]
    profanity_count = sum(1 for word in message.lower().split() if word in ["damn", "hell", "ridiculous"])
    caps_ratio = sum(1 for c in message if c.isupper()) / max(len(message), 1)

    is_angry = (
        any(kw in message.lower() for kw in anger_keywords) or
        profanity_count >= 1 or
        caps_ratio > 0.5  # More than half the message is CAPS
    )
    return {"is_angry": is_angry, "caps_ratio": round(caps_ratio, 2), "profanity_count": profanity_count}

# Test escalation scenarios
test_cases = [
    {"session": {"tool_calls_count": 4, "resolved": False}, "message": "This still isn't working"},
    {"session": {"blocked_by_guardrail": ["refund_over_limit"]}, "message": "Just refund my $800 order"},
    {"session": {"classification_confidence": 0.45}, "message": "I have a complex issue with my account"},
    {"session": {}, "message": "THIS IS UNACCEPTABLE. I WANT TO SPEAK TO A MANAGER NOW."},
    {"session": {"resolved": True}, "message": "Thanks, that fixed it!"}
]

for i, case in enumerate(test_cases):
    decision = evaluate_escalation(case["session"], case["message"])
    print(f"Case {i+1}: escalate={decision.should_escalate}, priority={decision.priority}, triggers={decision.triggers_fired}")

The escalation sequence from the agent’s perspective follows a clear handoff protocol:

Escalation & Human Handoff Sequence
sequenceDiagram
    participant C as Customer
    participant A as Support Agent
    participant G as Guardrails
    participant Q as Human Queue
    participant H as Human Agent

    C->>A: "Refund my $800 order"
    A->>G: PreToolUse: process_refund($800)
    G-->>A: BLOCKED (exceeds $500 limit)
    A->>A: Build handoff summary
    A->>Q: Escalate(priority=high, reason=policy_violation)
    A->>C: "I'll connect you with a specialist who can process larger refunds."
    Q->>H: Deliver handoff package
    H->>C: "Hi Sarah, I see you need a refund for $800 on ORD-123. Let me process that now."
    Note over H,C: Human resolves without repeating questions
                        

Notice the key UX detail: the agent tells the customer why they’re being transferred and what will happen next. No customer wants to hear “transferring you” without context.

5. Guardrails for Support Agents

Support agents interact with real customer data and real financial systems. Guardrails prevent three categories of harm: data exposure (leaking PII), unauthorized actions (refunds beyond authority), and prompt injection (customers manipulating the agent into bypassing rules).

5.1 Input Validation

Customer messages can contain prompt injection attempts — text designed to override the system prompt. A production agent must sanitize inputs without breaking legitimate requests:

  • Instruction override detection — Flag messages containing “ignore previous instructions”, “you are now”, “system:” prefixes
  • Character limits — Reject messages over 2,000 characters (legitimate support requests are shorter)
  • Encoding attacks — Normalize Unicode before processing (prevents homoglyph attacks)

5.2 Output Constraints

The system prompt defines what the agent must never output, but hooks enforce it at the code layer:

  • PII redaction — Post-response hook scans for SSN patterns, credit card numbers, full addresses
  • Promise prevention — Block phrases like “I guarantee”, “I promise”, “we will always”
  • Scope limitation — Agent cannot discuss competitors, provide legal advice, or make commitments about future features

5.3 Cost Controls & PreToolUse Hook

The PreToolUse hook from Part 5 is the enforcement mechanism for guardrails. Here’s a production implementation that blocks high-value refunds and limits tool calls per session:

import anthropic
import json
import re
from datetime import datetime

class SupportGuardrails:
    """PreToolUse guardrails for customer support agent."""

    MAX_TOOL_CALLS_PER_SESSION = 10
    MAX_REFUND_AMOUNT = 500.00  # USD
    BLOCKED_PATTERNS = [
        r"\b\d{3}-\d{2}-\d{4}\b",  # SSN
        r"\b\d{16}\b",              # Credit card
        r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b"  # CC with separators
    ]

    def __init__(self):
        self.tool_call_count = 0
        self.total_refund_amount = 0.0
        self.blocked_actions = []

    def pre_tool_use(self, tool_name: str, tool_input: dict) -> dict:
        """
        Evaluate whether a tool call should proceed.
        Returns: {"action": "allow"} or {"action": "block", "reason": "..."}
        """
        self.tool_call_count += 1

        # Guard 1: Session tool call limit
        if self.tool_call_count > self.MAX_TOOL_CALLS_PER_SESSION:
            self.blocked_actions.append("session_limit_exceeded")
            return {
                "action": "block",
                "reason": f"Session limit exceeded ({self.MAX_TOOL_CALLS_PER_SESSION} tool calls max). Escalate to human."
            }

        # Guard 2: Refund amount limit
        if tool_name == "process_refund":
            amount = tool_input.get("amount", 0)
            if amount > self.MAX_REFUND_AMOUNT:
                self.blocked_actions.append("refund_over_limit")
                return {
                    "action": "block",
                    "reason": f"Refund ${amount:.2f} exceeds limit (${self.MAX_REFUND_AMOUNT:.2f}). Requires human approval."
                }
            self.total_refund_amount += amount
            if self.total_refund_amount > self.MAX_REFUND_AMOUNT:
                self.blocked_actions.append("cumulative_refund_limit")
                return {
                    "action": "block",
                    "reason": f"Cumulative refunds (${self.total_refund_amount:.2f}) exceed session limit."
                }

        # Guard 3: Input sanitization (check tool inputs for injected PII)
        input_str = json.dumps(tool_input)
        for pattern in self.BLOCKED_PATTERNS:
            if re.search(pattern, input_str):
                self.blocked_actions.append("pii_in_input")
                return {
                    "action": "block",
                    "reason": "Sensitive data pattern detected in tool input. Blocked for security."
                }

        return {"action": "allow"}

    def post_response(self, response_text: str) -> str:
        """Scan and redact any PII that slipped through in the response."""
        redacted = response_text
        for pattern in self.BLOCKED_PATTERNS:
            redacted = re.sub(pattern, "[REDACTED]", redacted)
        return redacted

# Demo: test the guardrails
guardrails = SupportGuardrails()

test_calls = [
    ("get_customer", {"customer_id": "CUS-12345"}),
    ("process_refund", {"order_id": "ORD-001", "amount": 49.99, "reason": "defective"}),
    ("process_refund", {"order_id": "ORD-002", "amount": 750.00, "reason": "customer_request"}),
    ("get_customer", {"customer_id": "CUS-12345", "note": "SSN is 123-45-6789"}),
]

for tool_name, tool_input in test_calls:
    result = guardrails.pre_tool_use(tool_name, tool_input)
    status = "ALLOWED" if result["action"] == "allow" else f"BLOCKED: {result['reason']}"
    print(f"{tool_name}({list(tool_input.keys())}) -> {status}")

6. Evaluation Patterns for Support

A support agent without evaluation is a liability. You must measure three dimensions: first-contact resolution rate (did the customer’s issue get resolved?), response quality (was the response helpful and accurate?), and escalation accuracy (did the agent escalate the right cases?).

6.1 Measuring First-Contact Resolution

FCR measurement requires tracking whether the customer contacts support again within a defined window (typically 7 days) for the same issue:

  • Resolved — Customer does not re-contact within 7 days on the same topic
  • Not resolved — Customer contacts again within 7 days with the same or related issue
  • Escalated — Agent transferred to human (not counted as failure, but tracked separately)

6.2 Eval Framework

Build an evaluation suite with golden test cases that cover resolution, escalation, and guardrail enforcement:

import anthropic
import json
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()

@dataclass
class SupportEvalCase:
    """A single test case for support agent evaluation."""
    scenario: str
    customer_message: str
    expected_behavior: str  # "resolve", "escalate", "verify_first", "block"
    expected_tools: list    # Tools the agent should call
    tags: list              # e.g., ["billing", "refund", "angry_customer"]

# Golden test cases covering critical paths
eval_cases = [
    SupportEvalCase(
        scenario="Simple order status inquiry (should resolve directly)",
        customer_message="Hi, I placed order ORD-99887766 last week. Can you tell me where it is?",
        expected_behavior="resolve",
        expected_tools=["verify_identity", "lookup_order"],
        tags=["order_status", "happy_path"]
    ),
    SupportEvalCase(
        scenario="High-value refund (should escalate due to guardrail)",
        customer_message="I need a full refund for my $899 enterprise subscription. The product doesn't work as advertised.",
        expected_behavior="escalate",
        expected_tools=["verify_identity", "get_customer", "escalate_to_human"],
        tags=["refund", "policy_violation", "high_value"]
    ),
    SupportEvalCase(
        scenario="Angry customer demanding manager (should escalate immediately)",
        customer_message="THIS IS THE THIRD TIME I'M CALLING ABOUT THIS. YOUR SERVICE IS TERRIBLE. GET ME A MANAGER NOW.",
        expected_behavior="escalate",
        expected_tools=["escalate_to_human"],
        tags=["angry_customer", "explicit_human_request"]
    ),
]

def evaluate_agent_response(case: SupportEvalCase, agent_response: dict) -> dict:
    """Grade the agent's response against expected behavior using LLM-as-judge."""

    grading_prompt = f"""You are evaluating a customer support agent's response.

SCENARIO: {case.scenario}
CUSTOMER MESSAGE: {case.customer_message}
EXPECTED BEHAVIOR: {case.expected_behavior}
EXPECTED TOOLS: {case.expected_tools}

AGENT'S ACTUAL RESPONSE:
{json.dumps(agent_response, indent=2)}

Grade the agent on these criteria (1-5 scale each):
1. BEHAVIOR_MATCH: Did the agent do what was expected (resolve/escalate/verify)?
2. TOOL_SELECTION: Did the agent call the right tools in the right order?
3. TONE: Was the response empathetic, professional, and appropriate?
4. COMPLETENESS: Did the agent address all aspects of the customer's issue?

Respond in JSON format:
{{"behavior_match": N, "tool_selection": N, "tone": N, "completeness": N, "reasoning": "..."}}"""

    grade_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{"role": "user", "content": grading_prompt}]
    )
    return json.loads(grade_response.content[0].text)

def run_eval_suite(cases: list) -> dict:
    """Run full evaluation suite and compute aggregate metrics."""
    results = []
    for case in cases:
        # In production, you'd run the actual agent here
        # For demo, we simulate with a placeholder
        simulated_response = {
            "tools_called": case.expected_tools,  # Placeholder
            "final_action": case.expected_behavior,
            "response_text": "Simulated agent response for evaluation"
        }
        # grade = evaluate_agent_response(case, simulated_response)
        # results.append(grade)
        results.append({"scenario": case.scenario, "expected": case.expected_behavior})

    print(f"Evaluation suite: {len(cases)} test cases")
    for r in results:
        print(f"  - {r['scenario'][:60]}... | Expected: {r['expected']}")
    return {"total_cases": len(cases), "results": results}

# Run evaluation
run_eval_suite(eval_cases)
Case Study Production Deployment
Telecom Company: AI Support Agent Rollout

A mid-size telecom provider piloted a Claude-based customer support agent for billing inquiries, service outage reports, and plan changes. In a healthy rollout, teams often look for outcomes like:

  • First-contact resolution: Material improvement over the previous rule-based bot on routine cases
  • Escalation load: Fewer tickets reaching human agents for straightforward requests
  • Handle time: Routine-case resolution becoming much faster than human-only support
  • Escalation quality: Human reviewers judging most escalations appropriate
  • Guardrails: High-value refund attempts consistently routed to human review

Key architecture choices: Haiku for intent routing, Sonnet for conversation, 5 MCP tools (account lookup, billing check, service status, plan change, escalation), PreToolUse hook for spend limits, sentiment-based urgency scoring.

Telecom Escalation Control Production MCP Tools

Next in the SDK Track

In Part 19: Content Moderation & Legal Summarization, we’ll cover CCA modules 1.4 and 1.5 — content moderation classifiers, severity routing, human review pipelines, legal document chunking, and hierarchical summarization with citation preservation.