Anthropic SDK Track Part 14: Extended Thinking

                        
                        What You’ll Learn: Extended thinking lets Claude “think out loud” before responding — working through complex problems step-by-step in a private scratchpad before giving its final answer. This dramatically improves performance on reasoning-heavy tasks like math, code debugging, and strategic planning. Think of it like a student showing their work on an exam: the process of writing out steps helps avoid mistakes.
                    

                        
                        Version Note: This section reflects Anthropic’s docs as of May 2026. Extended-thinking behavior is especially version-sensitive right now: Claude 4.6 models still document manual budget_tokens mode, but Anthropic recommends adaptive thinking on newer/current Opus lines and marks manual mode as deprecated on Sonnet 4.6 and Opus 4.6.
                    

1. Enabling Extended Thinking

Extended thinking gives Claude extra reasoning budget for complex problems. When enabled, the model emits thinking blocks before the final text response. On current Claude 4 models, what you receive is typically summarized thinking by default, although some models default to omitted thinking unless you request summarized output explicitly.

import anthropic

client = anthropic.Anthropic()

# On Sonnet 4.6, adaptive thinking is the recommended mode.
# Manual budget mode still works here, but Anthropic marks it as deprecated.
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,
        "display": "summarized"
    },
    messages=[{
        "role": "user",
        "content": "Analyze this distributed system design for potential race conditions and suggest fixes."
    }]
)

# Response contains thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print(f"[THINKING] ({len(block.thinking)} chars)")
        # On Claude 4 models this is usually a summarized view, not raw chain-of-thought.
    elif block.type == "text":
        print(f"[RESPONSE] {block.text}")
        # Final response: polished output based on thinking

2. Budget Tokens

budget_tokens controls the maximum number of tokens Claude can use for thinking. Higher budgets allow more thorough reasoning but increase latency and cost:

import anthropic

client = anthropic.Anthropic()

# Budget tokens guide:
# - 1024-2048:  Simple reasoning, quick decisions
# - 4096-8192:  Multi-step problems, code analysis
# - 10000-16000: Complex architecture, deep debugging
# - 32000+:     Research-level reasoning, novel problems

# Example: Adaptive budget based on task complexity
def choose_budget(task_type: str) -> int:
    """Select thinking budget based on task complexity."""
    budgets = {
        "simple_fix": 2048,      # Fix a typo, rename variable
        "bug_fix": 8192,         # Diagnose and fix a bug
        "architecture": 16000,   # Design a new system component
        "migration": 32000       # Plan a complex refactoring
    }
    return budgets.get(task_type, 8192)

# Important constraints:
# - budget_tokens MUST be less than max_tokens
# - max_tokens includes BOTH thinking + response tokens
# - If thinking uses all budget, response may be truncated
# - Minimum budget_tokens for manual mode: 1024

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=20000,  # Total budget: thinking + response
    thinking={
        "type": "enabled",
        "budget_tokens": 12000  # Up to 12K for thinking, rest for response
    },
    messages=[{"role": "user", "content": "Debug this race condition..."}]
)

# Check token usage
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Thinking tokens: {response.usage.output_tokens_details.thinking_tokens}")
# output_tokens includes both thinking and final response tokens

Real-World Application

Debugging Production Outages

An SRE team uses extended thinking for their incident-response agent. When a production alert fires, the agent thinks through possible causes systematically (recent deployments, infrastructure changes, dependency failures) before recommending actions. The thinking trace serves as documentation for the post-mortem.

Extended ThinkingIncident Response

3. Streaming Thinking Blocks

import anthropic

client = anthropic.Anthropic()

# Stream extended thinking for real-time visibility
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Plan the refactoring of our auth module"}]
) as stream:
    current_block = None

    for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "thinking":
                current_block = "thinking"
                print("\n--- THINKING ---")
            elif event.content_block.type == "text":
                current_block = "text"
                print("\n--- RESPONSE ---")

        elif event.type == "content_block_delta":
            if current_block == "thinking" and hasattr(event.delta, "thinking"):
                print(event.delta.thinking, end="", flush=True)
            elif current_block == "text" and hasattr(event.delta, "text"):
                print(event.delta.text, end="", flush=True)

4. When to Use Extended Thinking

Use Case	Extended Thinking?	Reason
Simple file edits	No	Adds latency without benefit
Multi-file refactoring	Yes (8K budget)	Needs to reason about dependencies
Architecture design	Yes (16K budget)	Complex tradeoff analysis
Bug diagnosis	Yes (8K budget)	Hypothesis testing benefits from thinking
Classification/extraction	No	Pattern matching, not deep reasoning
Security review	Yes (12K budget)	Need thorough systematic analysis

                        
                        CCA Tasks 5.3 & 5.4: The exam tests: (1) budget_tokens must be less than max_tokens, (2) thinking blocks are separate content blocks (type: “thinking”), (3) thinking content is NOT shown to end users in production, (4) extended thinking adds latency proportional to budget, (5) minimum budget is 1024 tokens, (6) in multi-turn with thinking, you must include thinking blocks from the previous response when continuing.
                    

                        
                        Critical Multi-Turn Rule: When using extended thinking in agentic loops, you MUST include the thinking blocks from the assistant’s response when sending the next user message. Stripping thinking blocks between turns will cause errors. The full response (thinking + text + tool_use) must be preserved in conversation history.
                    

                        
                        Try It Yourself: Compare Claude’s performance on 3 complex tasks with and without extended thinking: (1) a multi-step math word problem, (2) finding a bug in a 50-line Python function, (3) analyzing the pros/cons of 3 architectural approaches. Measure accuracy and note how the thinking traces reveal the reasoning process.
                    

5. Reducing Latency (CCA 3.3)

Extended thinking adds reasoning power but increases latency. In production, you need to balance quality against speed. This section covers techniques for making Claude faster — from prompt caching to model selection to parallel requests.

Analogy: Reducing latency is like optimizing a restaurant kitchen. You can pre-prep ingredients (prompt caching), use the right chef for each dish (model selection), run multiple orders simultaneously (parallel requests), and choose between dine-in quality and fast food speed (batch vs real-time).

5.1 Prompt Caching

import anthropic

client = anthropic.Anthropic()

# Prompt caching: reuse processed prompt prefixes across requests
# Cache reads cost 10% of the base input-token price and often reduce TTFT

# WITHOUT caching: every request re-processes the entire system prompt
# WITH caching: first request caches, subsequent requests skip processing

# Enable caching with cache_control on content blocks:
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    system=[
        {
            "type": "text",
            "text": "You are a support agent. [2000 words of instructions, knowledge base, policies...]",
            "cache_control": {"type": "ephemeral"}  # Cache this block!
        }
    ],
    messages=[{"role": "user", "content": "How do I reset my password?"}]
)

# First request: cache miss (normal speed, caches the system prompt)
# Subsequent requests: cache hit (faster, cheaper — system prompt already processed)

# Check if cache was used:
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.get('cache_creation_input_tokens', 0)}")
print(f"Cache read tokens: {response.usage.get('cache_read_input_tokens', 0)}")

# Default cache lifetime is 5 minutes and refreshes on reuse
# Anthropic also offers a 1-hour TTL for longer gaps between requests
# Best for: large system prompts, few-shot examples, document context
# Not useful for: unique one-off requests

print("\nCache sweet spot: reusable prompt prefixes above the model's cache minimum")
print("Cache reads are much cheaper than uncached input and usually reduce TTFT")

5.2 Model Selection for Latency

import anthropic

client = anthropic.Anthropic()

# Different models = different speed/quality tradeoffs
# Choose based on task complexity:

# FASTEST / CHEAPEST (Haiku) — Simple tasks: classification, routing, extraction
# Exact latency varies by workload and platform; treat these as rough examples
haiku_response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=100,
    temperature=0,
    messages=[{"role": "user", "content": "Classify: 'I was charged twice' → billing/technical/sales"}]
)

# BALANCED (Sonnet) — Most tasks: coding, analysis, conversation
# Often the default choice for quality/cost balance
sonnet_response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1000,
    messages=[{"role": "user", "content": "Review this function for bugs: def auth(token): ..."}]
)

# QUALITY (Opus) — Complex reasoning: architecture, research, novel problems
# Highest capability, but typically slower and more expensive
# Use with extended thinking for maximum reasoning power

# Decision framework — see diagram below:

Model Selection by Task Complexity

                            flowchart LR
                                TASK{"Task Type?"} -->|"Classification
Routing
Extraction"| H["Haiku
Fastest · $"]
                                TASK -->|"Coding
Analysis
Conversation"| S["Sonnet
Balanced · $$"]
                                TASK -->|"Architecture
Research
Novel Problems"| O["Opus + Thinking
Highest Capability · $$$"]

                                style H fill:#3B9797,color:#fff
                                style S fill:#16476A,color:#fff
                                style O fill:#132440,color:#fff

print("Haiku: lowest latency/cost for simple structured tasks")
print("Sonnet: general default for coding and analysis")
print("Opus: best fit for the highest-complexity reasoning")

5.3 Parallel Requests & Batch Tradeoffs

import anthropic
import asyncio
import time

client = anthropic.Anthropic()

# Parallel requests: send multiple independent requests simultaneously
# instead of waiting for each one sequentially

async def process_batch_parallel(items: list, system: str) -> list:
    """Process multiple items in parallel (independent tasks only!)."""

    async def process_one(item):
        # Note: Use the async client for true parallelism
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=200,
            temperature=0,
            system=system,
            messages=[{"role": "user", "content": item}]
        )
        return response.content[0].text

    # Run all in parallel
    results = await asyncio.gather(*[process_one(item) for item in items])
    return results

# Batch vs Real-time tradeoffs:

# REAL-TIME: user is waiting for response
# - Streaming (show tokens as they arrive)
# - Prompt caching (reduce TTFT)
# - Haiku for simple tasks
# - Timeout: 5-30 seconds acceptable

# BATCH: no user waiting (background processing)
# - Message Batches API (50% cost savings!)
# - No streaming needed
# - 24-hour processing window
# - Use for: bulk classification, nightly reports, dataset processing

# STREAMING: best UX for real-time (user sees response immediately)
# with client.messages.stream(...) as stream:
#     for text in stream.text_stream:
#         print(text, end="", flush=True)

# Combined strategy for production:
# 1. Route to Haiku for simple classification (fast)
# 2. Stream Sonnet for conversational responses (good UX)
# 3. Batch API for bulk processing (cheap)
# 4. Cache system prompts for repeated patterns (efficient)

print("Latency reduction stack:")
print("  1. Right model (Haiku for simple, Sonnet for complex)")
print("  2. Prompt caching (reuse system prompts)")
print("  3. Streaming (user sees first token immediately)")
print("  4. Parallel requests (independent tasks simultaneously)")
print("  5. Batch API (50% savings for non-urgent work)")

                        
                        CCA Exam Pattern (3.3): Questions test: (1) Prompt caching uses cache_control: {type: "ephemeral"} on content blocks. (2) Haiku for classification/routing, Sonnet for general tasks, Opus for complex reasoning. (3) Streaming reduces perceived latency (TTFT matters more than total time). (4) Batch API gives 50% cost savings with 24-hour SLA. (5) Parallel requests require independent tasks (don’t parallelize dependent operations).