budget_tokens mode, but Anthropic recommends adaptive thinking on newer/current Opus lines and marks manual mode as deprecated on Sonnet 4.6 and Opus 4.6.
1. Enabling Extended Thinking
Extended thinking gives Claude extra reasoning budget for complex problems. When enabled, the model emits thinking blocks before the final text response. On current Claude 4 models, what you receive is typically summarized thinking by default, although some models default to omitted thinking unless you request summarized output explicitly.
import anthropic
client = anthropic.Anthropic()
# On Sonnet 4.6, adaptive thinking is the recommended mode.
# Manual budget mode still works here, but Anthropic marks it as deprecated.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000,
"display": "summarized"
},
messages=[{
"role": "user",
"content": "Analyze this distributed system design for potential race conditions and suggest fixes."
}]
)
# Response contains thinking and text blocks
for block in response.content:
if block.type == "thinking":
print(f"[THINKING] ({len(block.thinking)} chars)")
# On Claude 4 models this is usually a summarized view, not raw chain-of-thought.
elif block.type == "text":
print(f"[RESPONSE] {block.text}")
# Final response: polished output based on thinking
2. Budget Tokens
budget_tokens controls the maximum number of tokens Claude can use for thinking. Higher budgets allow more thorough reasoning but increase latency and cost:
import anthropic
client = anthropic.Anthropic()
# Budget tokens guide:
# - 1024-2048: Simple reasoning, quick decisions
# - 4096-8192: Multi-step problems, code analysis
# - 10000-16000: Complex architecture, deep debugging
# - 32000+: Research-level reasoning, novel problems
# Example: Adaptive budget based on task complexity
def choose_budget(task_type: str) -> int:
"""Select thinking budget based on task complexity."""
budgets = {
"simple_fix": 2048, # Fix a typo, rename variable
"bug_fix": 8192, # Diagnose and fix a bug
"architecture": 16000, # Design a new system component
"migration": 32000 # Plan a complex refactoring
}
return budgets.get(task_type, 8192)
# Important constraints:
# - budget_tokens MUST be less than max_tokens
# - max_tokens includes BOTH thinking + response tokens
# - If thinking uses all budget, response may be truncated
# - Minimum budget_tokens for manual mode: 1024
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=20000, # Total budget: thinking + response
thinking={
"type": "enabled",
"budget_tokens": 12000 # Up to 12K for thinking, rest for response
},
messages=[{"role": "user", "content": "Debug this race condition..."}]
)
# Check token usage
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Thinking tokens: {response.usage.output_tokens_details.thinking_tokens}")
# output_tokens includes both thinking and final response tokens
Debugging Production Outages
An SRE team uses extended thinking for their incident-response agent. When a production alert fires, the agent thinks through possible causes systematically (recent deployments, infrastructure changes, dependency failures) before recommending actions. The thinking trace serves as documentation for the post-mortem.
3. Streaming Thinking Blocks
import anthropic
client = anthropic.Anthropic()
# Stream extended thinking for real-time visibility
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": "Plan the refactoring of our auth module"}]
) as stream:
current_block = None
for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "thinking":
current_block = "thinking"
print("\n--- THINKING ---")
elif event.content_block.type == "text":
current_block = "text"
print("\n--- RESPONSE ---")
elif event.type == "content_block_delta":
if current_block == "thinking" and hasattr(event.delta, "thinking"):
print(event.delta.thinking, end="", flush=True)
elif current_block == "text" and hasattr(event.delta, "text"):
print(event.delta.text, end="", flush=True)
4. When to Use Extended Thinking
| Use Case | Extended Thinking? | Reason |
|---|---|---|
| Simple file edits | No | Adds latency without benefit |
| Multi-file refactoring | Yes (8K budget) | Needs to reason about dependencies |
| Architecture design | Yes (16K budget) | Complex tradeoff analysis |
| Bug diagnosis | Yes (8K budget) | Hypothesis testing benefits from thinking |
| Classification/extraction | No | Pattern matching, not deep reasoning |
| Security review | Yes (12K budget) | Need thorough systematic analysis |
budget_tokens must be less than max_tokens, (2) thinking blocks are separate content blocks (type: “thinking”), (3) thinking content is NOT shown to end users in production, (4) extended thinking adds latency proportional to budget, (5) minimum budget is 1024 tokens, (6) in multi-turn with thinking, you must include thinking blocks from the previous response when continuing.
5. Reducing Latency (CCA 3.3)
Extended thinking adds reasoning power but increases latency. In production, you need to balance quality against speed. This section covers techniques for making Claude faster — from prompt caching to model selection to parallel requests.
Analogy: Reducing latency is like optimizing a restaurant kitchen. You can pre-prep ingredients (prompt caching), use the right chef for each dish (model selection), run multiple orders simultaneously (parallel requests), and choose between dine-in quality and fast food speed (batch vs real-time).
5.1 Prompt Caching
import anthropic
client = anthropic.Anthropic()
# Prompt caching: reuse processed prompt prefixes across requests
# Cache reads cost 10% of the base input-token price and often reduce TTFT
# WITHOUT caching: every request re-processes the entire system prompt
# WITH caching: first request caches, subsequent requests skip processing
# Enable caching with cache_control on content blocks:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system=[
{
"type": "text",
"text": "You are a support agent. [2000 words of instructions, knowledge base, policies...]",
"cache_control": {"type": "ephemeral"} # Cache this block!
}
],
messages=[{"role": "user", "content": "How do I reset my password?"}]
)
# First request: cache miss (normal speed, caches the system prompt)
# Subsequent requests: cache hit (faster, cheaper — system prompt already processed)
# Check if cache was used:
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.get('cache_creation_input_tokens', 0)}")
print(f"Cache read tokens: {response.usage.get('cache_read_input_tokens', 0)}")
# Default cache lifetime is 5 minutes and refreshes on reuse
# Anthropic also offers a 1-hour TTL for longer gaps between requests
# Best for: large system prompts, few-shot examples, document context
# Not useful for: unique one-off requests
print("\nCache sweet spot: reusable prompt prefixes above the model's cache minimum")
print("Cache reads are much cheaper than uncached input and usually reduce TTFT")
5.2 Model Selection for Latency
import anthropic
client = anthropic.Anthropic()
# Different models = different speed/quality tradeoffs
# Choose based on task complexity:
# FASTEST / CHEAPEST (Haiku) — Simple tasks: classification, routing, extraction
# Exact latency varies by workload and platform; treat these as rough examples
haiku_response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
temperature=0,
messages=[{"role": "user", "content": "Classify: 'I was charged twice' → billing/technical/sales"}]
)
# BALANCED (Sonnet) — Most tasks: coding, analysis, conversation
# Often the default choice for quality/cost balance
sonnet_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
messages=[{"role": "user", "content": "Review this function for bugs: def auth(token): ..."}]
)
# QUALITY (Opus) — Complex reasoning: architecture, research, novel problems
# Highest capability, but typically slower and more expensive
# Use with extended thinking for maximum reasoning power
# Decision framework — see diagram below:
flowchart LR
TASK{"Task Type?"} -->|"Classification
Routing
Extraction"| H["Haiku
Fastest · $"]
TASK -->|"Coding
Analysis
Conversation"| S["Sonnet
Balanced · $$"]
TASK -->|"Architecture
Research
Novel Problems"| O["Opus + Thinking
Highest Capability · $$$"]
style H fill:#3B9797,color:#fff
style S fill:#16476A,color:#fff
style O fill:#132440,color:#fff
print("Haiku: lowest latency/cost for simple structured tasks")
print("Sonnet: general default for coding and analysis")
print("Opus: best fit for the highest-complexity reasoning")
5.3 Parallel Requests & Batch Tradeoffs
import anthropic
import asyncio
import time
client = anthropic.Anthropic()
# Parallel requests: send multiple independent requests simultaneously
# instead of waiting for each one sequentially
async def process_batch_parallel(items: list, system: str) -> list:
"""Process multiple items in parallel (independent tasks only!)."""
async def process_one(item):
# Note: Use the async client for true parallelism
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
temperature=0,
system=system,
messages=[{"role": "user", "content": item}]
)
return response.content[0].text
# Run all in parallel
results = await asyncio.gather(*[process_one(item) for item in items])
return results
# Batch vs Real-time tradeoffs:
# REAL-TIME: user is waiting for response
# - Streaming (show tokens as they arrive)
# - Prompt caching (reduce TTFT)
# - Haiku for simple tasks
# - Timeout: 5-30 seconds acceptable
# BATCH: no user waiting (background processing)
# - Message Batches API (50% cost savings!)
# - No streaming needed
# - 24-hour processing window
# - Use for: bulk classification, nightly reports, dataset processing
# STREAMING: best UX for real-time (user sees response immediately)
# with client.messages.stream(...) as stream:
# for text in stream.text_stream:
# print(text, end="", flush=True)
# Combined strategy for production:
# 1. Route to Haiku for simple classification (fast)
# 2. Stream Sonnet for conversational responses (good UX)
# 3. Batch API for bulk processing (cheap)
# 4. Cache system prompts for repeated patterns (efficient)
print("Latency reduction stack:")
print(" 1. Right model (Haiku for simple, Sonnet for complex)")
print(" 2. Prompt caching (reuse system prompts)")
print(" 3. Streaming (user sees first token immediately)")
print(" 4. Parallel requests (independent tasks simultaneously)")
print(" 5. Batch API (50% savings for non-urgent work)")
cache_control: {type: "ephemeral"} on content blocks. (2) Haiku for classification/routing, Sonnet for general tasks, Opus for complex reasoning. (3) Streaming reduces perceived latency (TTFT matters more than total time). (4) Batch API gives 50% cost savings with 24-hour SLA. (5) Parallel requests require independent tasks (don’t parallelize dependent operations).