1. End-to-End Customer Support Pattern
A production customer support agent is more than a chatbot with tools. It must verify identity before accessing data, choose the right tool for each sub-task, recognize when it cannot help, and hand off to humans with full context. The CCA exam tests your ability to wire all these components into a single coherent system.
1.1 What Makes a Good Support Agent
Three metrics define a good customer support agent:
- First-Contact Resolution (FCR) — The customer’s issue is fully resolved in a single interaction without callbacks or transfers. Target: >70% for general support.
- Context Awareness — The agent understands the customer’s history, current subscription, open tickets, and recent interactions before responding.
- Escalation Precision — The agent escalates exactly the right cases (never too early, never too late) with structured context so the human agent doesn’t repeat questions.
1.2 Architecture Overview
The support agent architecture follows a clear pipeline: the user message arrives, the routing layer classifies intent (from Part 17), the support agent processes using MCP tools, and either resolves directly or escalates with a structured handoff summary.
| Component | Responsibility | Implementation |
|---|---|---|
| Router | Classify intent & urgency | Haiku + forced tool_choice (Part 17) |
| Verification Gate | Confirm customer identity | System prompt + tool sequence |
| Support Agent | Resolve issue using tools | Sonnet + MCP tools + agentic loop |
| Guardrails | Block unsafe operations | Hooks (PreToolUse / PostResponse) |
| Escalation | Hand off to human with context | Structured output + queue integration |
1.3 Complete Support Flow
flowchart TD
A["Customer Message"] --> B{"Identity Verified?"}
B -->|"No"| C["Request Verification
(email + order ID)"]
C --> B
B -->|"Yes"| D["Classify Intent"]
D --> E{"Can Resolve?"}
E -->|"Yes"| F["Execute MCP Tools
(lookup, update, refund)"]
F --> G{"Guardrail Check"}
G -->|"Pass"| H["Deliver Resolution"]
G -->|"Blocked"| I["Escalate: Policy Violation"]
E -->|"No"| J{"Confidence < 0.6?"}
J -->|"Yes"| K["Escalate with Context"]
J -->|"No"| L["Ask Clarifying Question"]
L --> D
K --> M["Human Agent Queue"]
I --> M
H --> N["Customer Satisfied?"]
N -->|"No"| K
style C fill:#BF092F,color:#fff
style H fill:#3B9797,color:#fff
style M fill:#132440,color:#fff
style F fill:#16476A,color:#fff
Key design principles visible in this flow:
- Verification first — Never access customer data before confirming identity
- Guardrails at the boundary — Check before executing any tool that modifies state
- Graceful degradation — Low confidence leads to clarification, not random actions
- Feedback loop — If the customer isn’t satisfied after resolution, escalate rather than loop forever
2. MCP Tool Design for Support
The support agent’s capabilities are defined entirely by its tool schemas. Well-designed tool descriptions guide Claude toward correct tool selection without extensive prompt engineering. Poorly described tools lead to hallucinated parameters and incorrect invocations.
2.1 Tool Schemas
A production support agent typically needs 5–7 core tools. Each tool has a specific scope and clear documentation in its description field that tells Claude when to use it:
import anthropic
import json
# Define MCP tools for a customer support agent
support_tools = [
{
"name": "get_customer",
"description": "Retrieve customer profile by email or customer ID. Use this FIRST after identity verification to load the customer's account details, subscription tier, and contact preferences. Returns: name, email, plan, signup_date, lifetime_value.",
"input_schema": {
"type": "object",
"properties": {
"identifier": {
"type": "string",
"description": "Customer email address or customer ID (format: CUS-XXXXX)"
},
"identifier_type": {
"type": "string",
"enum": ["email", "customer_id"],
"description": "Whether the identifier is an email or customer ID"
}
},
"required": ["identifier", "identifier_type"]
}
},
{
"name": "lookup_order",
"description": "Look up order details by order ID or customer ID. Use when the customer asks about a specific order, shipment status, or delivery issue. Returns: order_id, items, status, tracking_number, estimated_delivery.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "Order ID (format: ORD-XXXXXXXX). If not provided, returns last 5 orders for the customer."
},
"customer_id": {
"type": "string",
"description": "Customer ID to look up recent orders"
}
},
"required": []
}
},
{
"name": "check_billing",
"description": "Check billing status, payment history, and upcoming charges. Use when customer asks about charges, invoices, payment failures, or subscription renewals. Returns: current_balance, next_charge_date, payment_method_last4, recent_transactions.",
"input_schema": {
"type": "object",
"properties": {
"customer_id": {
"type": "string",
"description": "Customer ID (format: CUS-XXXXX)"
},
"include_history": {
"type": "boolean",
"description": "Whether to include last 10 transactions. Default: false"
}
},
"required": ["customer_id"]
}
},
{
"name": "process_refund",
"description": "Process a refund for a specific order. ONLY use when the customer explicitly requests a refund AND the order is eligible (delivered, within 30-day window). Requires order_id and reason. Refunds over $500 require human approval.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "Order ID to refund (format: ORD-XXXXXXXX)"
},
"amount": {
"type": "number",
"description": "Refund amount in USD. Must not exceed original order total."
},
"reason": {
"type": "string",
"enum": ["defective", "not_as_described", "late_delivery", "customer_request", "duplicate_charge"],
"description": "Reason category for the refund"
}
},
"required": ["order_id", "amount", "reason"]
}
},
{
"name": "escalate_to_human",
"description": "Escalate the conversation to a human support agent. Use when: (1) the customer explicitly requests a human, (2) the issue requires policy exceptions beyond your authority, (3) you cannot resolve after 3 tool attempts, (4) the customer expresses strong frustration. Always include a summary of what you've already tried.",
"input_schema": {
"type": "object",
"properties": {
"reason": {
"type": "string",
"description": "Why this conversation needs human intervention"
},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "urgent"],
"description": "Urgency level based on customer impact and sentiment"
},
"summary": {
"type": "string",
"description": "Structured summary: customer issue, steps already taken, what the customer needs"
},
"customer_id": {
"type": "string",
"description": "Customer ID for the human agent to pull up"
}
},
"required": ["reason", "priority", "summary", "customer_id"]
}
}
]
# Print tool count and names for verification
print(f"Defined {len(support_tools)} support tools:")
for tool in support_tools:
print(f" - {tool['name']}: {tool['description'][:60]}...")
2.2 Tool Descriptions That Guide Decisions
Notice the patterns in the tool descriptions above:
- When to use — Every description starts with when Claude should select this tool (“Use when the customer asks about...”)
- Preconditions —
process_refundspecifies “ONLY use when...AND the order is eligible” - Constraints — “Refunds over $500 require human approval” teaches Claude the guardrail boundary
- Return values — Describing what the tool returns helps Claude plan multi-step workflows
- Escalation triggers —
escalate_to_humanexplicitly lists the four escalation conditions
This approach means you need minimal system prompt instructions about tool selection — the descriptions themselves encode the decision logic.
3. Context Preservation in Support
Support conversations are inherently multi-turn. The customer states a problem, the agent asks clarifying questions, verifies identity, looks up data, and resolves. Throughout this flow, the agent must maintain context about what’s been verified, what tools have been called, and what the customer’s emotional state is.
3.1 Identity Verification Gate
The verification gate pattern ensures the agent never exposes customer data to an unauthorized party. The system prompt instructs Claude to verify before accessing, and the tool execution layer enforces it:
import anthropic
import json
from datetime import datetime
client = anthropic.Anthropic()
# Simulated customer database
CUSTOMERS_DB = {
"CUS-12345": {
"name": "Sarah Chen",
"email": "sarah.chen@example.com",
"plan": "Enterprise",
"last_4_ssn": "7890",
"orders": ["ORD-00112233", "ORD-00445566"]
}
}
def run_support_agent_with_verification(user_message: str, conversation_history: list):
"""Support agent with verification gate and context tracking."""
# Session state tracks verification status
session = {
"verified": False,
"customer_id": None,
"tool_calls_count": 0,
"started_at": datetime.now().isoformat()
}
system_prompt = """You are a customer support agent for TechCorp.
VERIFICATION RULES (MANDATORY):
1. Before accessing ANY customer data, you MUST verify the customer's identity.
2. Ask for their email AND one of: order ID, last 4 of SSN, or account PIN.
3. Call verify_identity tool with their responses.
4. Only after verification succeeds, proceed with data lookups.
5. If verification fails twice, escalate to human agent.
RESOLUTION RULES:
- Always be empathetic and professional
- Resolve in as few turns as possible
- If you cannot resolve in 3 tool calls, escalate
- Never promise what you cannot deliver (no "I'll make sure it never happens again")
- Never share full SSN, credit card numbers, or passwords
ESCALATION RULES:
- Customer explicitly asks for a human
- Issue requires policy exception (refund > $500, account deletion)
- Customer sentiment is angry after 2 failed resolution attempts
- You are uncertain about the correct action"""
messages = conversation_history + [{"role": "user", "content": user_message}]
# Define tools available based on verification state
tools = [
{
"name": "verify_identity",
"description": "Verify customer identity using email + secondary factor. Must be called before any data access.",
"input_schema": {
"type": "object",
"properties": {
"email": {"type": "string"},
"verification_factor": {"type": "string", "description": "Order ID, last 4 SSN, or account PIN"},
"factor_type": {"type": "string", "enum": ["order_id", "ssn_last4", "pin"]}
},
"required": ["email", "verification_factor", "factor_type"]
}
},
{
"name": "get_customer",
"description": "Get customer profile. REQUIRES: identity must be verified first.",
"input_schema": {
"type": "object",
"properties": {"customer_id": {"type": "string"}},
"required": ["customer_id"]
}
},
{
"name": "escalate_to_human",
"description": "Escalate to human agent with full context summary.",
"input_schema": {
"type": "object",
"properties": {
"reason": {"type": "string"},
"priority": {"type": "string", "enum": ["low", "medium", "high", "urgent"]},
"summary": {"type": "string"}
},
"required": ["reason", "priority", "summary"]
}
}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
tools=tools,
messages=messages
)
# Process tool calls with verification enforcement
while response.stop_reason == "tool_use":
tool_use = next(b for b in response.content if b.type == "tool_use")
session["tool_calls_count"] += 1
# ENFORCEMENT: Block data access if not verified
if tool_use.name == "get_customer" and not session["verified"]:
tool_result = {
"type": "error",
"error": "ACCESS_DENIED: Identity not yet verified. Call verify_identity first."
}
elif tool_use.name == "verify_identity":
# Simulate verification check
email = tool_use.input.get("email", "")
factor = tool_use.input.get("verification_factor", "")
# Check against DB
verified = any(
c["email"] == email and
(factor in c.get("orders", []) or factor == c.get("last_4_ssn", ""))
for c in CUSTOMERS_DB.values()
)
if verified:
session["verified"] = True
session["customer_id"] = next(
k for k, v in CUSTOMERS_DB.items() if v["email"] == email
)
tool_result = {"verified": True, "customer_id": session["customer_id"]}
else:
tool_result = {"verified": False, "message": "Verification failed. Please try again."}
else:
tool_result = {"status": "executed", "data": "...simulated response..."}
# Continue conversation with tool result
messages = messages + [
{"role": "assistant", "content": response.content},
{"role": "user", "content": [{"type": "tool_result", "tool_use_id": tool_use.id, "content": json.dumps(tool_result)}]}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
tools=tools,
messages=messages
)
# Extract final text response
final_text = next((b.text for b in response.content if b.type == "text"), "")
print(f"Agent response ({session['tool_calls_count']} tool calls, verified={session['verified']}):")
print(final_text[:200])
return final_text, session
# Example: Customer contacts support
response, session = run_support_agent_with_verification(
"Hi, I need help with my recent order. It arrived damaged.",
conversation_history=[]
)
3.2 Passing Context to Escalation
When the agent escalates, the human agent must receive a structured summary — not the raw conversation. This prevents the customer from repeating themselves and reduces handle time:
import json
from datetime import datetime
def build_escalation_handoff(session_data: dict, conversation_turns: list) -> dict:
"""Build a structured handoff summary for the human agent queue."""
# Extract key information from the session
handoff = {
"escalation_id": f"ESC-{datetime.now().strftime('%Y%m%d%H%M%S')}",
"timestamp": datetime.now().isoformat(),
"customer": {
"id": session_data.get("customer_id", "UNVERIFIED"),
"verified": session_data.get("verified", False),
"sentiment": session_data.get("detected_sentiment", "neutral")
},
"issue": {
"category": session_data.get("intent_category", "unknown"),
"summary": session_data.get("issue_summary", ""),
"urgency": calculate_urgency(session_data)
},
"resolution_attempts": {
"tools_called": session_data.get("tools_called", []),
"total_tool_calls": session_data.get("tool_calls_count", 0),
"what_was_tried": session_data.get("attempted_resolutions", []),
"why_escalated": session_data.get("escalation_reason", "")
},
"context_for_human": {
"customer_quote": extract_key_quotes(conversation_turns),
"relevant_data_pulled": session_data.get("fetched_data_summary", {}),
"suggested_resolution": session_data.get("suggested_next_step", "")
}
}
return handoff
def calculate_urgency(session: dict) -> str:
"""Determine urgency from session signals."""
if session.get("detected_sentiment") == "angry":
return "high"
if session.get("tool_calls_count", 0) >= 3:
return "medium"
if session.get("customer_tier") == "Enterprise":
return "high"
return "low"
def extract_key_quotes(turns: list) -> list:
"""Pull direct customer quotes that capture the core issue."""
# In production, use Claude to extract key quotes
return [turn["content"][:100] for turn in turns if turn.get("role") == "user"][:3]
# Example handoff
sample_session = {
"customer_id": "CUS-12345",
"verified": True,
"detected_sentiment": "frustrated",
"intent_category": "billing_dispute",
"issue_summary": "Customer charged twice for annual subscription renewal",
"tools_called": ["verify_identity", "get_customer", "check_billing"],
"tool_calls_count": 3,
"attempted_resolutions": ["Looked up billing history", "Confirmed duplicate charge exists"],
"escalation_reason": "Refund amount ($499) requires human approval per policy",
"customer_tier": "Enterprise",
"fetched_data_summary": {"duplicate_charge": "$499", "charge_dates": ["2026-05-20", "2026-05-21"]},
"suggested_next_step": "Process refund for duplicate charge on 2026-05-21"
}
handoff = build_escalation_handoff(sample_session, [
{"role": "user", "content": "I was charged twice for my subscription. This is ridiculous."},
{"role": "user", "content": "Yes my email is sarah.chen@example.com, order ORD-00112233"}
])
print(json.dumps(handoff, indent=2))
4. Escalation Criteria & Human Handoff
The escalation decision is the most critical judgment a support agent makes. Escalate too early and you waste human agent time. Escalate too late and customers suffer through an agent that can’t help them. The CCA exam expects you to implement explicit, measurable criteria rather than vague “if it seems difficult” logic.
4.1 When to Escalate
Each escalation trigger maps to a specific detection mechanism:
| Trigger | Detection Method | Example | Priority |
|---|---|---|---|
| Low Confidence | Classification score < 0.6 | Ambiguous intent, multiple possible categories | Medium |
| Repeated Failures | tool_calls_count ≥ 3 without resolution | Order lookup fails, API errors | Medium |
| Policy Violation | Guardrail hook blocks action | Refund > $500, account deletion request | High |
| Customer Anger | Sentiment analysis + explicit keywords | “I want to speak to a manager”, profanity | Urgent |
4.2 Escalation Decision Logic
import anthropic
import json
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class EscalationDecision:
should_escalate: bool
reason: str
priority: str # low, medium, high, urgent
confidence: float
triggers_fired: list = field(default_factory=list)
def evaluate_escalation(session_state: dict, latest_message: str) -> EscalationDecision:
"""Evaluate whether the current conversation should be escalated."""
triggers = []
max_priority = "low"
priority_order = {"low": 0, "medium": 1, "high": 2, "urgent": 3}
# Trigger 1: Low classification confidence
if session_state.get("classification_confidence", 1.0) < 0.6:
triggers.append("low_confidence")
max_priority = "medium"
# Trigger 2: Repeated tool failures
if session_state.get("tool_calls_count", 0) >= 3 and not session_state.get("resolved", False):
triggers.append("repeated_failures")
if priority_order["medium"] > priority_order[max_priority]:
max_priority = "medium"
# Trigger 3: Policy boundary violation
blocked_actions = session_state.get("blocked_by_guardrail", [])
if blocked_actions:
triggers.append("policy_violation")
if priority_order["high"] > priority_order[max_priority]:
max_priority = "high"
# Trigger 4: Customer anger detection
anger_signals = detect_anger(latest_message)
if anger_signals["is_angry"]:
triggers.append("customer_anger")
if priority_order["urgent"] > priority_order[max_priority]:
max_priority = "urgent"
# Trigger 5: Explicit human request
human_request_phrases = [
"speak to a human", "talk to a person", "real person",
"manager", "supervisor", "transfer me"
]
if any(phrase in latest_message.lower() for phrase in human_request_phrases):
triggers.append("explicit_human_request")
if priority_order["high"] > priority_order[max_priority]:
max_priority = "high"
should_escalate = len(triggers) > 0
reason = f"Escalation triggered by: {', '.join(triggers)}" if triggers else "No escalation needed"
return EscalationDecision(
should_escalate=should_escalate,
reason=reason,
priority=max_priority,
confidence=1.0 - (session_state.get("classification_confidence", 1.0)),
triggers_fired=triggers
)
def detect_anger(message: str) -> dict:
"""Detect anger signals in customer message using keyword + pattern matching."""
anger_keywords = ["furious", "unacceptable", "terrible", "worst", "scam", "lawsuit", "sue"]
profanity_count = sum(1 for word in message.lower().split() if word in ["damn", "hell", "ridiculous"])
caps_ratio = sum(1 for c in message if c.isupper()) / max(len(message), 1)
is_angry = (
any(kw in message.lower() for kw in anger_keywords) or
profanity_count >= 1 or
caps_ratio > 0.5 # More than half the message is CAPS
)
return {"is_angry": is_angry, "caps_ratio": round(caps_ratio, 2), "profanity_count": profanity_count}
# Test escalation scenarios
test_cases = [
{"session": {"tool_calls_count": 4, "resolved": False}, "message": "This still isn't working"},
{"session": {"blocked_by_guardrail": ["refund_over_limit"]}, "message": "Just refund my $800 order"},
{"session": {"classification_confidence": 0.45}, "message": "I have a complex issue with my account"},
{"session": {}, "message": "THIS IS UNACCEPTABLE. I WANT TO SPEAK TO A MANAGER NOW."},
{"session": {"resolved": True}, "message": "Thanks, that fixed it!"}
]
for i, case in enumerate(test_cases):
decision = evaluate_escalation(case["session"], case["message"])
print(f"Case {i+1}: escalate={decision.should_escalate}, priority={decision.priority}, triggers={decision.triggers_fired}")
The escalation sequence from the agent’s perspective follows a clear handoff protocol:
sequenceDiagram
participant C as Customer
participant A as Support Agent
participant G as Guardrails
participant Q as Human Queue
participant H as Human Agent
C->>A: "Refund my $800 order"
A->>G: PreToolUse: process_refund($800)
G-->>A: BLOCKED (exceeds $500 limit)
A->>A: Build handoff summary
A->>Q: Escalate(priority=high, reason=policy_violation)
A->>C: "I'll connect you with a specialist who can process larger refunds."
Q->>H: Deliver handoff package
H->>C: "Hi Sarah, I see you need a refund for $800 on ORD-123. Let me process that now."
Note over H,C: Human resolves without repeating questions
Notice the key UX detail: the agent tells the customer why they’re being transferred and what will happen next. No customer wants to hear “transferring you” without context.
5. Guardrails for Support Agents
Support agents interact with real customer data and real financial systems. Guardrails prevent three categories of harm: data exposure (leaking PII), unauthorized actions (refunds beyond authority), and prompt injection (customers manipulating the agent into bypassing rules).
5.1 Input Validation
Customer messages can contain prompt injection attempts — text designed to override the system prompt. A production agent must sanitize inputs without breaking legitimate requests:
- Instruction override detection — Flag messages containing “ignore previous instructions”, “you are now”, “system:” prefixes
- Character limits — Reject messages over 2,000 characters (legitimate support requests are shorter)
- Encoding attacks — Normalize Unicode before processing (prevents homoglyph attacks)
5.2 Output Constraints
The system prompt defines what the agent must never output, but hooks enforce it at the code layer:
- PII redaction — Post-response hook scans for SSN patterns, credit card numbers, full addresses
- Promise prevention — Block phrases like “I guarantee”, “I promise”, “we will always”
- Scope limitation — Agent cannot discuss competitors, provide legal advice, or make commitments about future features
5.3 Cost Controls & PreToolUse Hook
The PreToolUse hook from Part 5 is the enforcement mechanism for guardrails. Here’s a production implementation that blocks high-value refunds and limits tool calls per session:
import anthropic
import json
import re
from datetime import datetime
class SupportGuardrails:
"""PreToolUse guardrails for customer support agent."""
MAX_TOOL_CALLS_PER_SESSION = 10
MAX_REFUND_AMOUNT = 500.00 # USD
BLOCKED_PATTERNS = [
r"\b\d{3}-\d{2}-\d{4}\b", # SSN
r"\b\d{16}\b", # Credit card
r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b" # CC with separators
]
def __init__(self):
self.tool_call_count = 0
self.total_refund_amount = 0.0
self.blocked_actions = []
def pre_tool_use(self, tool_name: str, tool_input: dict) -> dict:
"""
Evaluate whether a tool call should proceed.
Returns: {"action": "allow"} or {"action": "block", "reason": "..."}
"""
self.tool_call_count += 1
# Guard 1: Session tool call limit
if self.tool_call_count > self.MAX_TOOL_CALLS_PER_SESSION:
self.blocked_actions.append("session_limit_exceeded")
return {
"action": "block",
"reason": f"Session limit exceeded ({self.MAX_TOOL_CALLS_PER_SESSION} tool calls max). Escalate to human."
}
# Guard 2: Refund amount limit
if tool_name == "process_refund":
amount = tool_input.get("amount", 0)
if amount > self.MAX_REFUND_AMOUNT:
self.blocked_actions.append("refund_over_limit")
return {
"action": "block",
"reason": f"Refund ${amount:.2f} exceeds limit (${self.MAX_REFUND_AMOUNT:.2f}). Requires human approval."
}
self.total_refund_amount += amount
if self.total_refund_amount > self.MAX_REFUND_AMOUNT:
self.blocked_actions.append("cumulative_refund_limit")
return {
"action": "block",
"reason": f"Cumulative refunds (${self.total_refund_amount:.2f}) exceed session limit."
}
# Guard 3: Input sanitization (check tool inputs for injected PII)
input_str = json.dumps(tool_input)
for pattern in self.BLOCKED_PATTERNS:
if re.search(pattern, input_str):
self.blocked_actions.append("pii_in_input")
return {
"action": "block",
"reason": "Sensitive data pattern detected in tool input. Blocked for security."
}
return {"action": "allow"}
def post_response(self, response_text: str) -> str:
"""Scan and redact any PII that slipped through in the response."""
redacted = response_text
for pattern in self.BLOCKED_PATTERNS:
redacted = re.sub(pattern, "[REDACTED]", redacted)
return redacted
# Demo: test the guardrails
guardrails = SupportGuardrails()
test_calls = [
("get_customer", {"customer_id": "CUS-12345"}),
("process_refund", {"order_id": "ORD-001", "amount": 49.99, "reason": "defective"}),
("process_refund", {"order_id": "ORD-002", "amount": 750.00, "reason": "customer_request"}),
("get_customer", {"customer_id": "CUS-12345", "note": "SSN is 123-45-6789"}),
]
for tool_name, tool_input in test_calls:
result = guardrails.pre_tool_use(tool_name, tool_input)
status = "ALLOWED" if result["action"] == "allow" else f"BLOCKED: {result['reason']}"
print(f"{tool_name}({list(tool_input.keys())}) -> {status}")
6. Evaluation Patterns for Support
A support agent without evaluation is a liability. You must measure three dimensions: first-contact resolution rate (did the customer’s issue get resolved?), response quality (was the response helpful and accurate?), and escalation accuracy (did the agent escalate the right cases?).
6.1 Measuring First-Contact Resolution
FCR measurement requires tracking whether the customer contacts support again within a defined window (typically 7 days) for the same issue:
- Resolved — Customer does not re-contact within 7 days on the same topic
- Not resolved — Customer contacts again within 7 days with the same or related issue
- Escalated — Agent transferred to human (not counted as failure, but tracked separately)
6.2 Eval Framework
Build an evaluation suite with golden test cases that cover resolution, escalation, and guardrail enforcement:
import anthropic
import json
from dataclasses import dataclass
from typing import Optional
client = anthropic.Anthropic()
@dataclass
class SupportEvalCase:
"""A single test case for support agent evaluation."""
scenario: str
customer_message: str
expected_behavior: str # "resolve", "escalate", "verify_first", "block"
expected_tools: list # Tools the agent should call
tags: list # e.g., ["billing", "refund", "angry_customer"]
# Golden test cases covering critical paths
eval_cases = [
SupportEvalCase(
scenario="Simple order status inquiry (should resolve directly)",
customer_message="Hi, I placed order ORD-99887766 last week. Can you tell me where it is?",
expected_behavior="resolve",
expected_tools=["verify_identity", "lookup_order"],
tags=["order_status", "happy_path"]
),
SupportEvalCase(
scenario="High-value refund (should escalate due to guardrail)",
customer_message="I need a full refund for my $899 enterprise subscription. The product doesn't work as advertised.",
expected_behavior="escalate",
expected_tools=["verify_identity", "get_customer", "escalate_to_human"],
tags=["refund", "policy_violation", "high_value"]
),
SupportEvalCase(
scenario="Angry customer demanding manager (should escalate immediately)",
customer_message="THIS IS THE THIRD TIME I'M CALLING ABOUT THIS. YOUR SERVICE IS TERRIBLE. GET ME A MANAGER NOW.",
expected_behavior="escalate",
expected_tools=["escalate_to_human"],
tags=["angry_customer", "explicit_human_request"]
),
]
def evaluate_agent_response(case: SupportEvalCase, agent_response: dict) -> dict:
"""Grade the agent's response against expected behavior using LLM-as-judge."""
grading_prompt = f"""You are evaluating a customer support agent's response.
SCENARIO: {case.scenario}
CUSTOMER MESSAGE: {case.customer_message}
EXPECTED BEHAVIOR: {case.expected_behavior}
EXPECTED TOOLS: {case.expected_tools}
AGENT'S ACTUAL RESPONSE:
{json.dumps(agent_response, indent=2)}
Grade the agent on these criteria (1-5 scale each):
1. BEHAVIOR_MATCH: Did the agent do what was expected (resolve/escalate/verify)?
2. TOOL_SELECTION: Did the agent call the right tools in the right order?
3. TONE: Was the response empathetic, professional, and appropriate?
4. COMPLETENESS: Did the agent address all aspects of the customer's issue?
Respond in JSON format:
{{"behavior_match": N, "tool_selection": N, "tone": N, "completeness": N, "reasoning": "..."}}"""
grade_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{"role": "user", "content": grading_prompt}]
)
return json.loads(grade_response.content[0].text)
def run_eval_suite(cases: list) -> dict:
"""Run full evaluation suite and compute aggregate metrics."""
results = []
for case in cases:
# In production, you'd run the actual agent here
# For demo, we simulate with a placeholder
simulated_response = {
"tools_called": case.expected_tools, # Placeholder
"final_action": case.expected_behavior,
"response_text": "Simulated agent response for evaluation"
}
# grade = evaluate_agent_response(case, simulated_response)
# results.append(grade)
results.append({"scenario": case.scenario, "expected": case.expected_behavior})
print(f"Evaluation suite: {len(cases)} test cases")
for r in results:
print(f" - {r['scenario'][:60]}... | Expected: {r['expected']}")
return {"total_cases": len(cases), "results": results}
# Run evaluation
run_eval_suite(eval_cases)
Telecom Company: AI Support Agent Rollout
A mid-size telecom provider piloted a Claude-based customer support agent for billing inquiries, service outage reports, and plan changes. In a healthy rollout, teams often look for outcomes like:
- First-contact resolution: Material improvement over the previous rule-based bot on routine cases
- Escalation load: Fewer tickets reaching human agents for straightforward requests
- Handle time: Routine-case resolution becoming much faster than human-only support
- Escalation quality: Human reviewers judging most escalations appropriate
- Guardrails: High-value refund attempts consistently routed to human review
Key architecture choices: Haiku for intent routing, Sonnet for conversation, 5 MCP tools (account lookup, billing check, service status, plan change, escalation), PreToolUse hook for spend limits, sentiment-based urgency scoring.
Next in the SDK Track
In Part 19: Content Moderation & Legal Summarization, we’ll cover CCA modules 1.4 and 1.5 — content moderation classifiers, severity routing, human review pipelines, legal document chunking, and hierarchical summarization with citation preservation.