Anthropic SDK Track Part 2: Messages API & Content Blocks

                        
                        What You’ll Learn: The Messages API is the single endpoint you’ll use for everything in Claude — from simple questions to complex multi-turn conversations with tool use. This article teaches you how the API works under the hood: message roles, content blocks, stop reasons, streaming, and the conversation patterns that make Claude powerful.
                    

1. Messages Create

The messages.create() endpoint is the core of the Claude API. Every interaction — whether a simple question, a multi-turn conversation, or an agentic tool-use loop — flows through this single endpoint. Understanding its anatomy is essential for everything that follows in this track.

1.1 Request Anatomy

A Messages API request requires three mandatory fields: model, max_tokens, and messages. The system prompt is passed as a separate top-level parameter (not inside the messages array), which is a key architectural difference from OpenAI.

import anthropic

client = anthropic.Anthropic()

# Complete request anatomy with all common parameters
response = client.messages.create(
    model="claude-sonnet-4-6",   # Required: model ID
    max_tokens=1024,                     # Required: output token limit
    system="You are a helpful coding assistant. Be concise.", # Optional: system prompt
    messages=[                           # Required: conversation messages
        {"role": "user", "content": "Explain Python decorators in 3 sentences."}
    ],
    temperature=0.7,                     # Optional: 0.0-1.0 (default 1.0)
    # Note: temperature and top_p are mutually exclusive — use one or the other
    stop_sequences=["---"],              # Optional: custom stop strings
    metadata={"user_id": "user-123"}     # Optional: for abuse tracking
)

print(response.content[0].text)
print(f"Model: {response.model}")
print(f"Stop reason: {response.stop_reason}")

                        
                        Key Difference from OpenAI: In the Anthropic API, system is a separate top-level parameter, not a message with role: "system". This architectural choice keeps system instructions clearly separated from conversation history, making it easier to apply prompt caching and maintain clean message arrays.
                    

1.2 System Prompts

System prompts define Claude’s behavior, persona, and constraints. They can be a simple string or an array of content blocks (useful for prompt caching). The system prompt is processed before any messages and shapes all subsequent responses.

Here is the simple string form — suitable for most applications:

import anthropic

client = anthropic.Anthropic()

# Simple string system prompt
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system="You are a senior Python developer. Provide production-ready code with error handling. Never use print() for logging — use the logging module.",
    messages=[
        {"role": "user", "content": "Write a function to retry HTTP requests with exponential backoff."}
    ]
)
print(response.content[0].text)

For advanced use cases, pass the system prompt as an array of content blocks. This format is required when using prompt caching (covered in Section 7) — you attach cache_control markers to individual blocks so Anthropic can reuse previously processed prefixes, reducing latency and cost by up to 90% for the cached portion:

import anthropic

client = anthropic.Anthropic()

# System prompt as content block array — required format for prompt caching
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=[
        {
            "type": "text",
            "text": "You are a senior Python developer. Provide production-ready code with error handling. Never use print() for logging — use the logging module.",
            "cache_control": {"type": "ephemeral"}  # Marks this block for caching
        }
    ],
    messages=[
        {"role": "user", "content": "Write a function to retry HTTP requests with exponential backoff."}
    ]
)
print(response.content[0].text)
# Check cache usage:
print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)}")

Prompt Caching: For a full deep-dive on caching strategies — automatic vs. explicit breakpoints, multi-turn caching, tool caching, pre-warming, invalidation rules, and a real-world production example — see Section 7: Prompt Caching below.

1.3 Multi-Turn Conversations

Multi-turn conversations are built by appending messages with alternating user and assistant roles. The full conversation history is passed with every request — Claude is stateless and does not retain context between API calls.

import anthropic

client = anthropic.Anthropic()

# Build a multi-turn conversation
conversation = [
    {"role": "user", "content": "What is a Python context manager?"},
    {"role": "assistant", "content": "A context manager is an object that defines the runtime context for a `with` statement. It implements `__enter__()` and `__exit__()` methods to set up and tear down resources automatically."},
    {"role": "user", "content": "Show me a custom one for database connections."}
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a Python instructor. Use type hints in all code examples.",
    messages=conversation
)

# Append the response to continue the conversation
conversation.append({"role": "assistant", "content": response.content[0].text})
conversation.append({"role": "user", "content": "Now add error handling and connection pooling."})

# Next turn with full history
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a Python instructor. Use type hints in all code examples.",
    messages=conversation
)
print(response2.content[0].text)

                        
                        Stateless API: Claude does not remember previous requests. You must pass the complete conversation history with every API call. For long conversations, this means managing token usage carefully — see Part 12 (Context Preservation) for strategies.
                    

2. Content Blocks Architecture

Unlike simpler APIs that return plain text, the Claude Messages API uses a content blocks architecture. Each response is an array of typed blocks, enabling rich interactions where Claude can mix text with tool calls, thinking, and structured data in a single response.

Content Block Types

flowchart LR
    R["response.content"] --> T["TextBlock"]
    R --> TU["ToolUseBlock"]
    R --> TH["ThinkingBlock"]
    T --> |"type: text"| T1["text: string"]
    TU --> |"type: tool_use"| TU1["id + name + input"]
    TH --> |"type: thinking"| TH1["thinking: string"]
    M["messages[].content"] --> IM["ImageBlock"]
    M --> DOC["DocumentBlock"]
    IM --> |"type: image"| IM1["source: base64 or url"]
    DOC --> |"type: document"| DOC1["source: base64 PDF"]

2.1 TextBlock

The most common content block. Contains Claude’s text response. A response may contain multiple TextBlocks when mixed with other block types.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[{"role": "user", "content": "Say hello in three languages."}]
)

# Access the text content
for block in response.content:
    if block.type == "text":
        print(block.text)

# Shorthand for single-block responses
print(response.content[0].text)

2.2 ToolUseBlock

When Claude decides to call a tool, it emits a ToolUseBlock containing the tool name, a unique ID, and the input arguments as a JSON object. The stop_reason will be "tool_use" indicating you need to execute the tool and return results.

import anthropic
import json

client = anthropic.Anthropic()

# Define tools available to Claude
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a location. Use this when the user asks about weather conditions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name, e.g. 'London, UK'"},
                "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature units"}
            },
            "required": ["location"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# Inspect the response — Claude will emit a ToolUseBlock
for block in response.content:
    if block.type == "tool_use":
        print(f"Tool: {block.name}")
        print(f"ID: {block.id}")
        print(f"Input: {json.dumps(block.input, indent=2)}")
        # Output:
        # Tool: get_weather
        # ID: toolu_01ABC123...
        # Input: {"location": "Tokyo", "units": "celsius"}

print(f"Stop reason: {response.stop_reason}")  # "tool_use"

2.3 ToolResultBlock

After executing a tool, return the results by adding an assistant message (with the ToolUseBlock) followed by a user message containing a tool_result content block. The tool_use_id links the result back to the original tool call.

import anthropic

client = anthropic.Anthropic()

# After receiving a ToolUseBlock, execute the tool and return results
# This continues the conversation from the previous example
messages = [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_use",
                "id": "toolu_01ABC123",
                "name": "get_weather",
                "input": {"location": "Tokyo", "units": "celsius"}
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "tool_result",
                "tool_use_id": "toolu_01ABC123",  # Must match the tool_use id
                "content": "Temperature: 22°C, Condition: Partly cloudy, Humidity: 65%"
            }
        ]
    }
]

# Claude will now generate a natural language response incorporating the tool result
# Define the tool so Claude knows the schema
tools = [{
    "name": "get_weather",
    "description": "Get weather for a location.",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string"},
            "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    tools=tools,
    messages=messages
)

print(response.content[0].text)
# "The weather in Tokyo is currently 22°C (partly cloudy) with 65% humidity."
print(response.stop_reason)  # "end_turn" — Claude is done

2.4 ThinkingBlock

When extended thinking is enabled, Claude’s internal reasoning appears as ThinkingBlock content before the final answer. This is useful for complex reasoning tasks and debugging. See Part 14 (Extended Thinking) for full coverage.

import anthropic

client = anthropic.Anthropic()

# Enable extended thinking
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Token budget for thinking
    },
    messages=[{"role": "user", "content": "Solve: If 3x + 7 = 22, what is x?"}]
)

# Response contains ThinkingBlock(s) followed by TextBlock(s)
for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking]: {block.thinking[:200]}...")
    elif block.type == "text":
        print(f"[Answer]: {block.text}")

                        
                        RedactedThinkingBlock: Occasionally, Claude’s thinking content may be filtered for safety. In this case you’ll receive a block with type: "redacted_thinking" instead of "thinking". Always check for both types when processing extended thinking responses: if block.type == "redacted_thinking": print("[Redacted]").
                    

2.5 Multimodal Input Blocks (Image & Document)

Claude supports multimodal input through specialized content blocks in the messages array. Instead of passing a plain string for the content field, provide an array of content blocks mixing text with images or documents. These are input blocks (sent by the user), not response blocks.

ImageBlock — Vision Input

Send images for Claude to analyze via base64 encoding or a public URL. Supported formats: JPEG, PNG, GIF, WebP. Maximum size: 20MB per image, up to 20 images per request.

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Option 1: Base64-encoded image from file
image_data = base64.standard_b64encode(Path("screenshot.png").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "What does this screenshot show? Identify any UI issues."
                }
            ]
        }
    ]
)

print(response.content[0].text)

import anthropic

client = anthropic.Anthropic()

# Option 2: Image from URL (no download/encoding needed)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/architecture-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Review this architecture diagram. Are there any single points of failure?"
                }
            ]
        }
    ]
)

print(response.content[0].text)
print(f"Input tokens: {response.usage.input_tokens}")  # Images consume ~1,600 tokens per 1568x1568 tile

DocumentBlock — PDF Input

Send PDF documents for Claude to read and analyze. The document is passed as base64-encoded content. Claude can extract text, tables, charts, and images from PDFs — useful for contract review, report analysis, and document Q&A.

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Send a PDF for analysis
pdf_data = base64.standard_b64encode(Path("api-spec.pdf").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the breaking changes in this API specification. List each change with the affected endpoint."
                }
            ]
        }
    ]
)

print(response.content[0].text)
print(f"Input tokens: {response.usage.input_tokens}")

                        
                        Token Cost for Multimodal: Images are tokenized by tile — roughly 1,600 tokens per 1568×1568 pixel tile. A typical screenshot costs ~1,600–3,200 tokens. PDFs are tokenized per page (~1,500–3,000 tokens/page depending on density). Use client.messages.count_tokens() to measure exact costs before sending large documents. These input blocks support cache_control for prompt caching — ideal when asking multiple questions about the same document.
                    

Real-World Application

Real-Time Medical Triage Assistant

A healthcare platform used streaming + multi-turn conversations to build a symptom checker that asks follow-up questions before suggesting next steps. Key: maintaining context across turns and using stop_reason to detect when Claude needs more info vs. is ready to conclude. The system achieved 92% concordance with nurse triage decisions while reducing average wait times by 6 minutes.

StreamingMulti-TurnHealthcare

3. Stop Reason

The stop_reason field tells you why Claude stopped generating. This is critical for agentic applications where you need to know whether to continue the loop (tool_use) or present the final answer (end_turn).

3.1 Stop Reason Values

Value	Meaning	Action
`"end_turn"`	Claude finished its response naturally	Present response to user
`"tool_use"`	Claude wants to call one or more tools	Execute tools, return results, continue loop
`"max_tokens"`	Hit the max_tokens limit	Increase limit or continue in next request
`"stop_sequence"`	Hit a custom stop sequence	Process output up to that point

Here is how to branch on stop_reason in an application — the foundation of agentic loop control (covered deeply in Part 3):

import anthropic
import json

client = anthropic.Anthropic()

# Define a tool for this example
tools = [{
    "name": "get_weather",
    "description": "Get weather for a location.",
    "input_schema": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Look up the weather and tell me what to wear."}]
)

# Branch based on stop_reason
if response.stop_reason == "end_turn":
    # Claude is done — present the final answer
    print("Final:", response.content[0].text)

elif response.stop_reason == "tool_use":
    # Claude wants to call a tool — execute it and continue
    tool_blocks = [b for b in response.content if b.type == "tool_use"]
    print(f"Claude wants to call {len(tool_blocks)} tool(s)")
    for tool in tool_blocks:
        print(f"  → {tool.name}({tool.input})")

elif response.stop_reason == "max_tokens":
    # Output was truncated — handle gracefully
    print("Warning: response truncated at max_tokens limit")

elif response.stop_reason == "stop_sequence":
    # Hit a custom stop sequence
    print("Stopped at custom sequence")

3.2 Custom Stop Sequences

Custom stop sequences let you control exactly where Claude stops generating. When Claude produces any string in your stop_sequences array, it immediately halts output and sets stop_reason to "stop_sequence". The stop string itself is not included in the response text — it acts as a boundary marker you can rely on for parsing.

Key behaviors:

Up to 4 stop sequences per request (array of strings)
Matching is exact and case-sensitive
The matched stop sequence is excluded from the output
Works alongside max_tokens — whichever triggers first wins
Compatible with streaming (the stream ends when a stop sequence is hit)

Use Case 1: Section Extraction — extract only the first section of a structured response:

import anthropic

client = anthropic.Anthropic()

# Stop at the first section boundary — get only the summary
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    stop_sequences=["## ", "---"],
    messages=[
        {"role": "user", "content": "Write a technical summary of Python async/await, then a code example section. Use ## headings."}
    ]
)

# Only the first section is returned — Claude stopped at "## " (next heading)
print(response.content[0].text)
print(f"Stop reason: {response.stop_reason}")  # "stop_sequence"

Use Case 2: Template Filling — have Claude fill in a template and stop at a placeholder boundary:

import anthropic

client = anthropic.Anthropic()

# Fill in a template — Claude generates until it hits the delimiter
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    stop_sequences=["{{END}}"],
    messages=[
        {"role": "user", "content": "Generate a product description for a wireless mouse. Write naturally and end with {{END}}."}
    ]
)

# Clean output — no delimiter in the text, just the description
product_description = response.content[0].text.strip()
print(product_description)

Use Case 3: Iterative Generation — generate content step-by-step by stopping and resuming at markers. This is a powerful pattern for building pipelines that process each section individually:

import anthropic

client = anthropic.Anthropic()

# Generate a multi-step plan, stopping after each step
messages = [
    {"role": "user", "content": "List 3 steps to deploy a FastAPI app to production. Number each step (1. 2. 3.) and put --- between steps."}
]

steps = []
for i in range(3):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        stop_sequences=["---"],
        messages=messages
    )

    step_text = response.content[0].text.strip()
    steps.append(step_text)

    if response.stop_reason == "stop_sequence":
        # Append assistant's partial output (including the stop marker)
        messages.append({"role": "assistant", "content": response.content[0].text + "---"})
        # API requires conversation to end with a user message — prompt to continue
        messages.append({"role": "user", "content": "Continue."})
    else:
        break  # end_turn or max_tokens — no more steps

for i, step in enumerate(steps, 1):
    print(f"Step {i}: {step}\n")

                        
                        Resuming After Stop: When Claude stops at a custom sequence, you can continue generation by appending the partial response (including the stop sequence) as an assistant message, followed by a user message (e.g., "Continue.") since the API requires conversations to end with a user turn. This “stop and resume” pattern is essential for iterative pipelines, chunked processing, and building structured outputs piece by piece. Note that stop_sequences are checked against the raw output text — if Claude wraps output in markdown code fences, include the delimiter outside the fences.
                    

4. Streaming

Streaming delivers response tokens as they’re generated, reducing time-to-first-token (TTFT) and enabling real-time UIs. The Anthropic API uses Server-Sent Events (SSE) with typed event objects that map to the content block architecture.

4.1 SSE Event Types

Event	Purpose	Key Data
`message_start`	Response begins	Message ID, model, usage (input_tokens)
`content_block_start`	New content block begins	Block type, index
`content_block_delta`	Incremental content	Text delta or tool input delta
`content_block_stop`	Block complete	Block index
`message_delta`	Message-level updates	stop_reason, usage (output_tokens)
`message_stop`	Response complete	—

4.2 SDK Stream Helpers

The Python SDK provides a high-level streaming interface that handles SSE parsing and delivers events as typed objects. Use the with client.messages.stream() context manager for the cleanest pattern:

import anthropic

client = anthropic.Anthropic()

# High-level streaming with the SDK helper
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain microservices architecture."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Access the final message after streaming completes
final_message = stream.get_final_message()
print(f"\n\nTokens used: {final_message.usage.input_tokens} in, {final_message.usage.output_tokens} out")

For more granular control over individual SSE events (useful when handling tool calls in streaming mode), use the raw event stream:

import anthropic

client = anthropic.Anthropic()

# Define a tool for streaming example
tools = [{
    "name": "get_weather",
    "description": "Get weather for a location.",
    "input_schema": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"]
    }
}]

# Low-level event-based streaming for tool_use handling
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Paris?"}]
) as stream:
    for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "tool_use":
                print(f"Tool call starting: {event.content_block.name}")
        elif event.type == "content_block_delta":
            if event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)
            elif event.delta.type == "input_json_delta":
                print(event.delta.partial_json, end="")
        elif event.type == "message_delta":
            print(f"\nStop reason: {event.delta.stop_reason}")

The TypeScript SDK provides equivalent streaming with async iterators:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function streamResponse() {
    const stream = client.messages.stream({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        messages: [{ role: "user", content: "Explain microservices." }]
    });

    // Event-based streaming
    stream.on("text", (text) => process.stdout.write(text));
    stream.on("message", (message) => {
        console.log(`\nDone. Tokens: ${message.usage.output_tokens}`);
    });

    // Or use async iteration
    // for await (const event of stream) { ... }

    const finalMessage = await stream.finalMessage();
    return finalMessage;
}

streamResponse();

5. Token Management

Understanding token usage is essential for cost control and context window management. Every response includes a usage object, and the SDK provides a dedicated token counting endpoint for pre-flight checks.

5.1 Usage Object

Every response includes detailed token accounting. When prompt caching is active, additional fields track cache hits and creation costs:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Hello!"}]
)

# Standard usage fields
usage = response.usage
print(f"Input tokens:  {usage.input_tokens}")
print(f"Output tokens: {usage.output_tokens}")

# With prompt caching enabled, additional fields appear:
# usage.cache_creation_input_tokens — tokens written to cache (first call)
# usage.cache_read_input_tokens    — tokens read from cache (subsequent calls)

5.2 Count Tokens API

Use the count_tokens endpoint to measure token usage before making a request. This is invaluable for context window management — verifying your messages fit within the model’s limit before sending:

import anthropic

client = anthropic.Anthropic()

# Count tokens before sending a request
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain the theory of relativity in detail."}
    ]
)

print(f"Input tokens: {token_count.input_tokens}")

# Use this for context window management
MAX_CONTEXT = 200000
available_for_output = MAX_CONTEXT - token_count.input_tokens
print(f"Available for output: {available_for_output} tokens")

# Validate before sending
if token_count.input_tokens > MAX_CONTEXT * 0.8:
    print("Warning: approaching context limit, consider summarizing history")

                        
                        Cost Formula: Total cost = (input_tokens × input_price) + (output_tokens × output_price). Cache reads cost 90% less than regular input tokens. Always track usage per-request for cost attribution and anomaly detection.
                    

                        
                        Try It Yourself: Build a multi-turn conversation where Claude acts as a Socratic tutor. It should ask clarifying questions instead of giving direct answers. Implement at least 5 turns, handling the message history correctly. Then add streaming to see responses token-by-token.
                    

6. Count Tokens & Advanced Patterns

The Count Tokens endpoint lets you check how many tokens a message will consume before sending it — essential for staying within context limits, estimating costs, and deciding when to summarize or truncate. Unlike the usage field in responses (which tells you AFTER), count_tokens tells you BEFORE.

6.1 Count Tokens Endpoint

import anthropic
import json

client = anthropic.Anthropic()

# COUNT TOKENS — Check token count BEFORE sending
# POST /v1/messages/count_tokens
# Use case: Decide whether to summarize before sending (context window management)

result = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    # tools=[] — optional, tools consume tokens too!
)

print(f"Input tokens: {result.input_tokens}")
# Use this to decide: is there room for the response?
# context_window = 200000
# room_for_output = context_window - result.input_tokens
# if room_for_output < 4096:
#     summarize_older_messages()

# COUNT WITH TOOLS — tools add to token count
tools = [{
    "name": "search",
    "description": "Search the knowledge base",
    "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}}
}]

result_with_tools = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system="You are a support agent.",
    messages=[{"role": "user", "content": "How do I reset my password?"}],
    tools=tools
)

print(f"Without tools: ~20 tokens")
print(f"With tools: {result_with_tools.input_tokens} tokens (tool schemas add overhead!)")
print("\nKey insight: Each tool definition adds ~50-200 tokens to EVERY request")

6.2 Multi-Turn Conversation Patterns

import anthropic
import json

client = anthropic.Anthropic()

# Multi-turn conversations require careful message management.
# Each turn adds to the context — eventually hitting the window limit.

# Pattern 1: Simple accumulation (fine for short conversations)
messages = []

def chat_simple(user_input: str) -> str:
    """Simple multi-turn — accumulates all messages."""
    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        messages=messages
    )

    assistant_msg = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_msg})
    return assistant_msg

# Pattern 2: Token-aware (checks before sending, summarizes if needed)
def chat_token_aware(user_input: str, max_context: int = 150000) -> str:
    """Multi-turn with token management — summarizes when context gets large."""
    messages.append({"role": "user", "content": user_input})

    # Check current token count
    count = client.messages.count_tokens(
        model="claude-sonnet-4-6",
        messages=messages
    )

    # If approaching limit, summarize older messages
    if count.input_tokens > max_context:
        summary = summarize_messages(messages[:-4])  # Keep last 4 messages
        messages[:] = [
            {"role": "user", "content": f"[Previous conversation summary: {summary}]"},
            {"role": "assistant", "content": "I understand the context. How can I help?"}
        ] + messages[-4:]

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        messages=messages
    )

    assistant_msg = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_msg})
    return assistant_msg

def summarize_messages(msgs: list) -> str:
    """Summarize older messages to free context space."""
    response = client.messages.create(
        model="claude-haiku-4-5",  # Cheap model for summarization
        max_tokens=500,
        system="Summarize this conversation, preserving key decisions and facts.",
        messages=[{"role": "user", "content": json.dumps(msgs)}]
    )
    return response.content[0].text

print("Pattern 1: Simple accumulation (short conversations)")
print("Pattern 2: Token-aware with auto-summarization (long conversations)")
print("\nAlways use count_tokens BEFORE sending to avoid context overflow")

6.3 Beta Features & Headers

import anthropic

client = anthropic.Anthropic()

# Beta features require special headers to opt-in
# This gives Anthropic a way to ship experimental features safely

# Enable beta features via the beta parameter:
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1000,
    # Beta features (check docs for currently available betas):
    # betas=["prompt-caching-2024-07-31", "max-tokens-3-5-sonnet-2024-07-15"]
    messages=[{"role": "user", "content": "Hello!"}]
)

# Or via raw headers (lower-level control):
# client = anthropic.Anthropic(
#     default_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
# )

# Common beta features (examples — check current docs):
# - Prompt caching (reduces cost for repeated system prompts)
# - Extended output (higher max_tokens limits)
# - Computer use (UI interaction tools)
# - PDF support (native PDF processing)

Beta Feature Lifecycle

stateDiagram-v2
    [*] --> Beta: Feature announced
    Beta --> Stable: Header optional
    Stable --> GA: Generally available
    GA --> Deprecated: Header ignored

    Beta: Requires header to enable
    Stable: Header works but optional
    GA: No header needed
    Deprecated: Header still accepted

                        
                        Beta Feature Summary: Opt in via the betas parameter or anthropic-beta header. Check docs.anthropic.com for currently available betas. Production tip: pin beta versions (date-stamped) for stability — beta APIs can change without notice.
                    

                        
                        CCA Exam Pattern (0.4): Questions test: (1) count_tokens checks token count BEFORE sending (not after). (2) Tool definitions add token overhead to every request (~50-200 tokens each). (3) Multi-turn requires message accumulation with role alternation. (4) Beta features require explicit opt-in via headers. (5) Pin model versions AND beta versions in production for stability.
                    

7. Prompt Caching

Prompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements. Cache reads cost only 10% of the base input token price, while cache writes add a 25% premium (5-min TTL) or 2× premium (1-hour TTL).

Prompt Caching Request Flow

flowchart TD
    A["API Request with cache_control"] --> B{"Cached prefix\nexists?"}
    B -->|Yes| C["Cache Read\n(10% cost)"]
    B -->|No| D["Process full prompt"]
    D --> E["Write to cache\n(125% cost)"]
    C --> F["Process only\nnew tokens"]
    E --> G["Generate response"]
    F --> G
    G --> H["Return response\n+ usage fields"]

7.1 Automatic vs. Explicit Caching

There are two approaches to enable prompt caching:

Automatic caching: Add a single cache_control field at the top level of your request body. The system automatically places the cache breakpoint on the last cacheable block and moves it forward as conversations grow. Best for multi-turn conversations.
Explicit cache breakpoints: Place cache_control directly on individual content blocks for fine-grained control over exactly what gets cached. Best when different sections change at different frequencies.

The simplest way to start is automatic caching — a single parameter handles everything:

import anthropic

client = anthropic.Anthropic()

# Automatic caching — one parameter, system handles breakpoint placement
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # Enable automatic caching
    system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.",
    messages=[
        {"role": "user", "content": "Analyze the major themes in 'Pride and Prejudice'."}
    ]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)}")

                        
                        Key Insight: With automatic caching, the system caches all content up to and including the last cacheable block. On subsequent requests with the same prefix, cached content is reused automatically. The cache breakpoint moves forward as conversations grow — no manual marker updates needed.
                    

7.2 Automatic Caching in Multi-Turn Conversations

Automatic caching excels in multi-turn scenarios. Each new request caches everything up to the last cacheable block, and previous content is read from cache:

Request	Content	Cache Behavior
Request 1	System + User(1) + Asst(1) + User(2) ◄ cache	Everything written to cache
Request 2	System + User(1) + Asst(1) + User(2) + Asst(2) + User(3) ◄ cache	System through User(2) read from cache; Asst(2) + User(3) written
Request 3	System + User(1) + … + User(3) + Asst(3) + User(4) ◄ cache	System through User(3) read from cache; Asst(3) + User(4) written

import anthropic

client = anthropic.Anthropic()

# Multi-turn with automatic caching — breakpoint moves forward automatically
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="You are a helpful assistant that remembers our conversation.",
    messages=[
        {"role": "user", "content": "My name is Alex. I work on machine learning."},
        {"role": "assistant", "content": "Nice to meet you, Alex! How can I help with your ML work today?"},
        {"role": "user", "content": "What did I say I work on?"}
    ]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)}")
print(f"\nAnswer: {response.content[0].text}")

7.3 Explicit Cache Breakpoints

For fine-grained control, place cache_control directly on individual content blocks. This is useful when you need to cache sections that change at different frequencies. Cache prefixes are created in the order: tools → system → messages.

import anthropic

client = anthropic.Anthropic()

# Explicit breakpoints — cache system prompt + long document separately
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an AI assistant tasked with analyzing legal documents.",
        },
        {
            "type": "text",
            "text": "Here is the full text of a complex legal agreement: [Insert 50-page legal agreement here]",
            "cache_control": {"type": "ephemeral"},  # Cache breakpoint here
        },
    ],
    messages=[
        {"role": "user", "content": "What are the key terms and conditions in this agreement?"}
    ]
)

# First request: cache_creation_input_tokens > 0 (writing to cache)
# Subsequent requests: cache_read_input_tokens > 0 (reading from cache at 10% cost)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)}")

                        
                        Critical: Place cache_control on the last block whose prefix is identical across the requests you want to share a cache. If you place it on a block that changes every request (timestamps, per-request context), the prefix hash never matches and you pay for a fresh cache write every time with no reads. The lookback does not find stable content behind your breakpoint — it only finds entries that earlier requests wrote at their breakpoints.
                    

Real-World Example: Here’s a production-grade pattern — caching a large system prompt (coding guidelines) with explicit breakpoints. The first call writes to cache; subsequent calls read from it at 90% discount:

import anthropic

client = anthropic.Anthropic()

# Production system prompt — large enough to exceed the 1,024-token cache minimum.
# In practice this would be your full coding guidelines, API reference, or knowledge base.
large_guidelines = """You are an expert code reviewer for our FastAPI + SQLAlchemy application.

## Review Guidelines

### Security Rules
1. All database queries MUST use parameterized statements — never string interpolation
2. Validate all user input with Pydantic v2 models before processing
3. Check authorization on every endpoint — never rely on frontend-only auth
4. Sanitize file uploads: check MIME type, size limits, and scan for malware
5. Never log sensitive data (passwords, tokens, PII) — use field-level masking
6. Use HTTPS-only cookies with SameSite=Strict for session management
7. Rate limit all public endpoints (use slowapi or custom middleware)
8. Validate Content-Type headers to prevent CSRF on JSON endpoints
9. Implement CORS with explicit origin allowlist — never use wildcard in production
10. Hash passwords with bcrypt (cost factor 12+) — never store plaintext or MD5

### Code Quality Standards
1. Maximum function length: 30 lines — split into helpers if longer
2. All functions must have type hints on parameters and return values
3. Use dependency injection via FastAPI's Depends() for shared logic
4. Database sessions must use async context managers (no manual close)
5. Use Pydantic v2 model_validator for cross-field validation
6. Prefer SQLAlchemy 2.0 select() syntax over legacy Query API
7. All endpoints must return proper HTTP status codes (201 for creation, 204 for deletion)
8. Use structured logging (JSON format) with correlation IDs per request
9. Docstrings required on all public functions — use Google style format
10. No magic numbers — use named constants or configuration values

### Performance Requirements
1. Add database indexes for all columns used in WHERE clauses
2. Use eager loading (selectinload/joinedload) to prevent N+1 queries
3. Paginate all list endpoints (max 100 items per page)
4. Cache expensive computations with Redis (TTL based on data volatility)
5. Use connection pooling (pool_size=20, max_overflow=10 for production)
6. Background tasks for operations taking more than 500ms
7. Compress responses over 1KB with gzip middleware
8. Use async database drivers (asyncpg for PostgreSQL) — never block event loop
9. Implement query timeout limits (30s max) to prevent long-running queries
10. Profile slow endpoints with OpenTelemetry spans — alert if p95 exceeds 200ms

### Architecture Patterns
1. Repository pattern for data access — never use db.session directly in routes
2. Service layer for business logic — routes only handle HTTP concerns
3. Domain events for cross-service communication (avoid tight coupling)
4. Use Alembic migrations for ALL schema changes — never alter tables manually
5. Feature flags for gradual rollouts (use LaunchDarkly or custom implementation)
6. Circuit breaker pattern for external service calls (prevent cascade failures)
7. Idempotency keys for all mutation endpoints (prevent duplicate processing)
8. CQRS pattern for read-heavy endpoints — separate read/write models
9. Event sourcing for audit-critical domains (payments, permissions)
10. API versioning via URL prefix (/v1/, /v2/) — never break existing contracts

### Testing Standards
1. Unit tests for all service layer functions (mock external dependencies)
2. Integration tests for database operations (use test database with fixtures)
3. API tests for all endpoints (test happy path, validation errors, auth failures)
4. Minimum 80% code coverage on new code — block PR if coverage drops
5. Use factories (factory_boy) for test data — never hardcode fixtures
6. Test async code with pytest-asyncio — verify proper await handling
7. Load tests for all public endpoints — verify p99 latency under 500ms at 100 RPS
8. Security tests: SQL injection, XSS, IDOR, broken auth (use OWASP ZAP)
9. Contract tests for inter-service APIs — verify schema compatibility
10. Chaos tests monthly: kill random pods, inject latency, corrupt responses

### Error Handling
1. Use custom exception classes inheriting from a base AppException
2. Global exception handler must return consistent error response format
3. Include correlation_id in all error responses for debugging
4. Log full stack trace for 500 errors, structured context for 400 errors
5. Never expose internal error details to clients (generic message + error code)
6. Retry transient failures (network, database timeout) with exponential backoff
7. Dead letter queue for failed async jobs — alert after 3 consecutive failures
8. Graceful degradation: return cached/stale data when downstream services are down
9. Health check endpoint (/health) must verify database connectivity and return 503 if down
10. Structured error codes: ERR_{DOMAIN}_{CODE} format (e.g., ERR_AUTH_TOKEN_EXPIRED)

### API Design Standards
1. Use plural nouns for resource endpoints (/users, /orders, not /user, /order)
2. Nested resources for parent-child: /users/{id}/orders (max 2 levels deep)
3. Filter via query params: ?status=active&sort=-created_at&limit=20
4. Return 422 for validation errors with field-level detail array
5. Use ETags for cache validation on GET endpoints
6. Implement cursor-based pagination for large datasets (not offset-based)
7. Rate limit response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
8. Consistent datetime format: ISO 8601 with timezone (2024-01-15T10:30:00Z)

Provide specific line references and severity ratings (critical/warning/info) for each finding.
End your review with a summary table: | Line | Severity | Issue | Fix |"""

# First call — writes the system prompt to cache
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_guidelines,
            "cache_control": {"type": "ephemeral"}  # Explicit breakpoint
        }
    ],
    messages=[
        {"role": "user", "content": "Review this endpoint:\n\n@app.post('/users')\nasync def create_user(name: str, email: str, db: Session = Depends(get_db)):\n    user = User(name=name, email=email)\n    db.add(user)\n    db.commit()\n    return {'id': user.id}"}
    ]
)

print("--- First call (cache creation) ---")
print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)} tokens")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)} tokens")
print(f"Input tokens: {response.usage.input_tokens}")

# Second call with SAME system prompt — cache hit
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_guidelines,  # Identical content → cache hit
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Review this endpoint:\n\n@app.get('/users/{user_id}')\nasync def get_user(user_id: int, db: Session = Depends(get_db)):\n    return db.query(User).filter(User.id == user_id).first()"}
    ]
)

print("\n--- Second call (cache read) ---")
print(f"Cache created: {getattr(response2.usage, 'cache_creation_input_tokens', 0)} tokens")
print(f"Cache read: {getattr(response2.usage, 'cache_read_input_tokens', 0)} tokens")
print(f"Input tokens: {response2.usage.input_tokens}")
# Expected: cache_read > 0, cache_creation = 0 (already cached)
print(f"\nReview: {response2.content[0].text[:200]}...")

Running this produces the following output — the first call writes 1,374 tokens to cache, and the second call reads them at 90% discount (only 68 fresh input tokens for the new user message):

--- First call (cache creation) ---
Cache created: 1374 tokens
Cache read: 0 tokens
Input tokens: 82

--- Second call (cache read) ---
Cache created: 0 tokens
Cache read: 1374 tokens
Input tokens: 68

Review: ## Code Review: `GET /users/{user_id}`

### Finding 1 — Missing Authorization Check
**Line 2 | Severity: CRITICAL**

Any authenticated (or unauthenticated) caller can fetch any user's record by guessi...

7.4 Caching Tool Definitions

Tool definitions can be cached by placing cache_control on the last tool in your tools array. All tools defined before and including that tool are cached as a single prefix. This is especially valuable when you have many complex tool schemas:

import anthropic

client = anthropic.Anthropic()

# Cache tool definitions — cached as the first segment in the hierarchy
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[
        {
            "name": "search_documents",
            "description": "Search through the knowledge base for relevant documents.",
            "input_schema": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "Search query"}},
                "required": ["query"]
            }
        },
        {
            "name": "get_document",
            "description": "Retrieve a specific document by its unique ID.",
            "input_schema": {
                "type": "object",
                "properties": {"doc_id": {"type": "string", "description": "Document ID"}},
                "required": ["doc_id"]
            },
            "cache_control": {"type": "ephemeral"}  # Caches ALL tools above + this one
        }
    ],
    messages=[{"role": "user", "content": "Search for information about Mars rovers."}]
)

print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)} tokens")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)} tokens")
# On first request: cache_creation > 0 (tool schemas written)
# On subsequent requests: cache_read > 0 (tool schemas read from cache)

7.5 Multiple Cache Breakpoints

You can define up to 4 cache breakpoints to cache different sections that change at different frequencies. This gives maximum flexibility for complex applications:

4-Breakpoint Strategy

flowchart LR
    T["① Tools\n(rarely change)"] --> S["② System Instructions\n(stable)"]
    S --> R["③ RAG Context\n(daily updates)"]
    R --> M["④ Conversation\n(every turn)"]
    T -.- TC["cache_control"]
    S -.- SC["cache_control"]
    R -.- RC["cache_control"]
    M -.- MC["cache_control"]

import anthropic

client = anthropic.Anthropic()

# 4-breakpoint strategy: tools, instructions, RAG context, conversation
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[
        {
            "name": "search_documents",
            "description": "Search through the knowledge base",
            "input_schema": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"]
            }
        },
        {
            "name": "get_document",
            "description": "Retrieve a specific document by ID",
            "input_schema": {
                "type": "object",
                "properties": {"doc_id": {"type": "string"}},
                "required": ["doc_id"]
            },
            "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
        }
    ],
    system=[
        {
            "type": "text",
            "text": "You are a helpful research assistant.\n\n# Instructions\n- Search before answering\n- Provide citations\n- Be objective and accurate",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Instructions
        },
        {
            "type": "text",
            "text": "# Knowledge Base\n\n## Doc 1: Solar System\nThe solar system consists of...\n\n## Doc 2: Mars\nMars has been a target of exploration...\n\n[50+ pages of context]",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 3: RAG context
        }
    ],
    messages=[
        {"role": "user", "content": "Can you search for Mars rovers?"},
        {"role": "assistant", "content": [
            {"type": "tool_use", "id": "tool_1", "name": "search_documents", "input": {"query": "Mars rovers"}}
        ]},
        {"role": "user", "content": [
            {"type": "tool_result", "tool_use_id": "tool_1", "content": "Found 3 documents about Mars rovers."}
        ]},
        {"role": "assistant", "content": [
            {"type": "text", "text": "I found 3 relevant documents. What would you like to know?"}
        ]},
        {"role": "user", "content": [
            {
                "type": "text",
                "text": "Tell me about the Perseverance rover.",
                "cache_control": {"type": "ephemeral"}  # Breakpoint 4: Conversation
            }
        ]}
    ]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache created: {getattr(response.usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read: {getattr(response.usage, 'cache_read_input_tokens', 0)}")

                        
                        How multiple breakpoints help: If you append a new turn without changing earlier content, all four segments are reused. If you update RAG documents but keep tools/instructions, the first two segments are still cached. Changes at any breakpoint invalidate that segment and everything after it, but earlier cached segments remain valid.
                    

7.6 Pre-Warming the Cache

Cache pre-warming eliminates the cache-miss latency penalty on the first user interaction. Set max_tokens: 0 to load your system prompt into cache without generating output — reducing time-to-first-token (TTFT) for latency-sensitive applications:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = [
    {
        "type": "text",
        "text": "You are an expert software engineer with deep knowledge of distributed systems, microservices architecture, and cloud-native patterns...",
        "cache_control": {"type": "ephemeral"},
    }
]


def prewarm_cache() -> None:
    """Call at application startup or on a scheduled interval."""
    result = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=0,  # No output generated — just warms the cache
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": "warmup"}],
    )
    print(f"Cache warmed: {getattr(result.usage, 'cache_creation_input_tokens', 0)} tokens written")
    print(f"Stop reason: {result.stop_reason}")  # "max_tokens"
    print(f"Content: {result.content}")  # [] (empty — no output generated)


def respond(user_message: str) -> str:
    """Real user request — benefits from warm cache."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_message}],
    )
    return response.content[0].text


# Warm cache before user traffic arrives
prewarm_cache()

# Later — system prompt is already cached, reducing TTFT
answer = respond("How do I implement a circuit breaker pattern?")
print(answer)

                        
                        TTL & Limitations: Default cache lifetime is 5 minutes, refreshed each time cached content is used. For longer gaps, use 1-hour TTL: {"type": "ephemeral", "ttl": "1h"} (costs 2× base input price). max_tokens: 0 is rejected if stream: true, extended thinking, structured outputs, or forced tool choice is set.
                    

7.7 What Invalidates the Cache

The cache follows the hierarchy tools → system → messages. Changes at any level invalidate that level and all subsequent levels:

What Changes	Tools Cache	System Cache	Messages Cache
Tool definitions	✘	✘	✘
Web search toggle	✔	✘	✘
Citations toggle	✔	✘	✘
Tool choice	✔	✔	✘
Images	✔	✔	✘
Thinking parameters	✔	✔	✘

7.8 Tracking Cache Performance

Monitor cache effectiveness using three usage fields returned in every response:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    cache_control={"type": "ephemeral"},
    system="You are a helpful coding assistant with expertise in Python.",
    messages=[{"role": "user", "content": "Explain list comprehensions."}]
)

# Token accounting breakdown
usage = response.usage
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
uncached = usage.input_tokens

print(f"Cache read tokens:     {cache_read}  (tokens from cache — 10% cost)")
print(f"Cache creation tokens: {cache_write}  (tokens written to cache — 125% cost)")
print(f"Input tokens:          {uncached}  (tokens AFTER last breakpoint — full cost)")
print(f"Output tokens:         {usage.output_tokens}")
print(f"Total input:           {cache_read + cache_write + uncached}")

# Cost estimation (Sonnet pricing: $3/MTok input, $15/MTok output)
base_input_price = 3.0  # per million tokens
cost_cache_read = cache_read * (base_input_price * 0.10) / 1_000_000
cost_cache_write = cache_write * (base_input_price * 1.25) / 1_000_000
cost_uncached = uncached * base_input_price / 1_000_000
cost_output = usage.output_tokens * 15.0 / 1_000_000

print(f"\nEstimated cost: ${cost_cache_read + cost_cache_write + cost_uncached + cost_output:.6f}")

7.9 Cache Limitations & Minimums

Minimum cacheable prompt lengths vary by model:

Model	Minimum Tokens
Claude Sonnet 4.6 / Sonnet 4.5 / Opus 4.8	1,024 tokens
Claude Opus 4.6 / Opus 4.5	4,096 tokens
Claude Haiku 4.5	4,096 tokens

Shorter prompts cannot be cached, even if marked with cache_control. Requests below the minimum are processed normally without caching (no error returned). Verify caching by checking cache_creation_input_tokens and cache_read_input_tokens — if both are 0, the prompt was not cached.

                        
                        Best Practices: (1) Start with automatic caching for multi-turn conversations. (2) Use explicit breakpoints when sections change at different frequencies. (3) Place static content (tools, system instructions, large context) at the beginning. (4) Place the breakpoint on the last block that stays identical across requests. (5) Monitor cache_read_input_tokens to confirm cache hits. (6) For concurrent requests, wait for the first response before sending parallel requests (cache entry is only available after the first response begins).
                    

                        
                        CCA Exam Pattern (0.5): Questions test: (1) Cache reads cost 10% of base input price (90% savings). (2) Cache writes cost 125% for 5-min TTL, 200% for 1-hour TTL. (3) Automatic caching places breakpoint on last cacheable block. (4) Cache hierarchy: tools → system → messages (changes invalidate downstream). (5) Maximum 4 breakpoints per request. (6) max_tokens: 0 pre-warms cache without generating output.
                    

Next in the SDK Track

In Part 3: Agentic Loops & Task Decomposition, we’ll use stop_reason to build autonomous agentic loops — the foundation of the Claude Agent SDK. Covers CCA Domain 1 Task 1.1 (agentic loop lifecycle) and Task 1.6 (task decomposition strategies).