1. Reasoning Model Overview
Reasoning models use internal chain-of-thought tokens before generating their final response. These tokens are consumed from your context window and billed as output tokens, but they dramatically improve performance on tasks requiring analysis, math, coding, planning, and multi-step logic. The model essentially “thinks out loud” internally before committing to an answer.
| Model | Best For | Context Window | Reasoning Tokens | Relative Cost |
|---|---|---|---|---|
| GPT-5.5-pro | Highest intelligence — hardest math, science, coding | 200K | Up to 128K | $$$$ |
| GPT-5.5 | Recommended default — strong reasoning at good cost | 200K | Up to 64K | $$$ |
| GPT-5.4 | Cost-effective reasoning for routine tasks | 200K | Up to 32K | $$ |
| GPT-5.4-mini | Lightweight reasoning — fastest in the family | 128K | Up to 16K | $ |
| GPT-5 | Previous generation — still capable | 128K | Up to 32K | $$ |
The key distinction is that reasoning tokens are generated before the visible output. You pay for them as output tokens, and they occupy context window space. A model with a 200K context window and 64K reasoning tokens has effectively 136K tokens available for your input and the visible response combined.
reasoning_effort parameter to control how many tokens the model spends thinking — lower effort means fewer reasoning tokens, faster responses, and lower cost.
flowchart LR
A[User Input] --> B[Model Receives Request]
B --> C{Reasoning Effort?}
C -->|none| D[Direct Response]
C -->|low/medium| E[Brief Internal Reasoning]
C -->|high/xhigh| F[Extended Internal Reasoning]
E --> G[Generate Reasoning Tokens]
F --> G
G --> H[Reasoning Summary Created]
H --> I[Visible Output Generated]
D --> J[Response Returned]
I --> J
J --> K[Usage: input + reasoning + output tokens]
2. Reasoning Effort Parameter
The reasoning parameter controls how much internal thinking the model performs. Lower effort levels produce faster, cheaper responses suitable for simple tasks, while higher levels invest more reasoning tokens for thorough analysis on complex problems. Models reason adaptively — even at “high” effort, simple questions won’t consume excessive tokens.
| Effort Level | Behavior | Best For |
|---|---|---|
none | No reasoning tokens — behaves like a standard model | Simple retrieval, formatting, classification |
minimal | Bare-minimum reasoning — near-instant responses | Quick lookups, trivial transformations |
low | Brief chain-of-thought — fast with light analysis | Routine tasks, simple Q&A, summaries |
medium | Balanced reasoning — good quality at moderate cost | General-purpose tasks, coding, writing |
high | Thorough reasoning — detailed analysis and verification | Complex problems, math, multi-step logic |
xhigh | Maximum reasoning — exhaustive exploration of possibilities | Hardest problems, research, competition math |
from openai import OpenAI
client = OpenAI()
# Basic reasoning call with effort parameter
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "high"},
input="Prove that the square root of 2 is irrational.",
)
print(f"Response: {response.output_text}")
print(f"\nToken usage:")
print(f" Input tokens: {response.usage.input_tokens}")
print(f" Output tokens: {response.usage.output_tokens}")
print(f" Reasoning tokens: {response.usage.output_tokens_details.reasoning_tokens}")
Comparing effort levels on the same problem demonstrates the cost/quality tradeoff concretely. A quick heuristic: start with medium for new tasks, then dial up if quality is insufficient or dial down if the task is simpler than expected.
from openai import OpenAI
import time
client = OpenAI()
question = "A farmer has 17 sheep. All but 9 die. How many are left?"
# Compare low vs high effort on the same question
for effort in ["low", "medium", "high"]:
start = time.time()
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": effort},
input=question,
)
elapsed = time.time() - start
reasoning_tokens = response.usage.output_tokens_details.reasoning_tokens
print(f"\nEffort: {effort}")
print(f" Answer: {response.output_text[:100]}")
print(f" Reasoning tokens: {reasoning_tokens}")
print(f" Total output tokens: {response.usage.output_tokens}")
print(f" Latency: {elapsed:.2f}s")
3. Reasoning Items in Output
When a reasoning model generates a response, it returns reasoning items in the output alongside the visible text. These items represent the model’s internal thought process and must be preserved when building multi-turn conversations. If you strip reasoning items from the conversation history, the model loses continuity and quality degrades significantly.
from openai import OpenAI
client = OpenAI()
# Turn 1: Ask a complex question
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "high"},
input="Analyze the time complexity of merge sort and explain why it's O(n log n).",
)
print(f"Answer: {response.output_text[:200]}...")
print(f"\nOutput items ({len(response.output)} total):")
for item in response.output:
print(f" Type: {item.type}", end="")
if item.type == "reasoning":
print(f" (id: {item.id})")
elif item.type == "message":
print(f" (text length: {len(item.content[0].text)})")
else:
print()
# Turn 2: Follow up — MUST include previous output (with reasoning items)
follow_up = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "high"},
input=[
{"role": "user", "content": "Analyze the time complexity of merge sort and explain why it's O(n log n)."},
*response.output, # Preserves reasoning items!
{"role": "user", "content": "Now compare this with quicksort's average and worst case."},
],
)
print(f"\nFollow-up answer: {follow_up.output_text[:200]}...")
The reasoning_summary parameter provides visibility into what the model was thinking without exposing raw reasoning tokens. This is useful for debugging, logging, and building user-facing “show your work” experiences.
from openai import OpenAI
client = OpenAI()
# Request reasoning summary for visibility into the thought process
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "high", "summary": "auto"},
input="What is the probability of getting exactly 3 heads in 5 fair coin flips?",
)
print(f"Final answer: {response.output_text}")
# Extract reasoning summary from output items
for item in response.output:
if item.type == "reasoning":
print(f"\nReasoning summary:")
for summary in item.summary:
print(f" {summary.text}")
Legal Contract Analysis Pipeline
A legal-tech startup uses reasoning models with effort: "high" for initial contract review, extracting risks and obligations with chain-of-thought analysis. They preserve reasoning items across a 4-turn conversation: (1) identify key clauses, (2) analyze risk exposure, (3) compare to standard terms, (4) generate recommendations. The reasoning continuity across turns means the model’s final recommendations reference specific analyses from earlier turns without repetition.
4. Tool Calling with Reasoning Models
Reasoning models can call tools just like standard models, but with an important distinction: when you pass tool results back, you must also include the reasoning items from the previous turn. The model needs its prior reasoning context to properly interpret tool outputs and decide whether to call more tools or produce a final response.
reasoning is set to "none". If you need tool calling without reasoning, either use the Responses API or set effort to at least "minimal".
from openai import OpenAI
import json
client = OpenAI()
# Define tools
tools = [
{
"type": "function",
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "Stock ticker symbol (e.g., AAPL)"},
},
"required": ["ticker"],
},
},
{
"type": "function",
"name": "get_company_financials",
"description": "Get key financial metrics for a company.",
"parameters": {
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "Stock ticker symbol"},
"metrics": {
"type": "array",
"items": {"type": "string"},
"description": "Metrics to retrieve: pe_ratio, market_cap, revenue, profit_margin",
},
},
"required": ["ticker", "metrics"],
},
},
]
# Step 1: Initial request — model reasons about what data it needs
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "medium"},
tools=tools,
input="Should I invest in NVDA? Compare its valuation to the semiconductor sector average.",
)
# Step 2: Process tool calls and pass results back WITH reasoning items
tool_results = []
for item in response.output:
if item.type == "function_call":
# Simulate tool execution
if item.name == "get_stock_price":
result = json.dumps({"ticker": "NVDA", "price": 892.50, "change": "+2.3%"})
elif item.name == "get_company_financials":
result = json.dumps({"ticker": "NVDA", "pe_ratio": 45.2, "market_cap": "2.2T", "revenue": "96B", "profit_margin": 0.57})
else:
result = json.dumps({"error": "Unknown function"})
tool_results.append({
"type": "function_call_output",
"call_id": item.call_id,
"output": result,
})
# Step 3: Send tool results back — include ALL output items (reasoning + function_calls + results)
final_response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "medium"},
tools=tools,
input=[
{"role": "user", "content": "Should I invest in NVDA? Compare its valuation to the semiconductor sector average."},
*response.output, # Includes reasoning items + function_call items
*tool_results, # Tool outputs
],
)
print(f"Investment analysis:\n{final_response.output_text}")
5. Multi-Step Reasoning Patterns
Multi-step reasoning leverages the model’s ability to decompose complex problems internally. Rather than manually splitting a problem into sub-tasks (which you’d do with standard models), you can present the full problem and let the reasoning model’s internal chain-of-thought handle decomposition, verification, and synthesis.
Pattern 1: Problem Decomposition
For problems that benefit from explicit sub-task structure, combine reasoning effort with structured instructions that guide the decomposition.
from openai import OpenAI
client = OpenAI()
# Multi-step decomposition: model reasons through sub-problems internally
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "high"},
instructions="""You are an expert systems analyst. When analyzing complex problems:
1. Identify all sub-problems and dependencies
2. Solve each sub-problem in order
3. Verify your solution is consistent across sub-problems
4. Present the final integrated answer with confidence level""",
input="""Design a database schema for an e-commerce platform that handles:
- Multi-vendor marketplace (vendors have products, ratings, tiers)
- Real-time inventory across 5 warehouses
- Dynamic pricing (time-of-day, demand, competitor matching)
- Customer loyalty program with points, tiers, and expiring rewards
- Order splitting across vendors with consolidated shipping
Provide the schema with table definitions, key relationships, and indexes.""",
)
print(response.output_text)
print(f"\nReasoning tokens used: {response.usage.output_tokens_details.reasoning_tokens}")
Pattern 2: Self-Verification
Reasoning models naturally self-verify at higher effort levels. You can further encourage this by asking the model to check its own work, which causes additional reasoning tokens to be spent on validation.
from openai import OpenAI
client = OpenAI()
# Self-verification pattern: model checks its own work
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "xhigh"},
instructions="After solving the problem, verify your answer by working backwards. If you find an error, correct it before responding.",
input="""A train leaves Station A at 9:00 AM traveling east at 80 km/h.
Another train leaves Station B (400 km east of A) at 9:30 AM traveling west at 120 km/h.
At what time do they meet, and how far from Station A?
Also: if a bird flying at 200 km/h starts at Station A at 9:00 AM and flies back
and forth between the two trains until they meet, what total distance does the bird fly?""",
)
print(f"Solution:\n{response.output_text}")
print(f"\nReasoning tokens (reflects verification): {response.usage.output_tokens_details.reasoning_tokens}")
Pattern 3: Reasoning with Context Window Management
When building multi-turn reasoning conversations, manage context carefully. Reasoning items accumulate and consume context window space. For long conversations, you may need to periodically summarize earlier reasoning and start fresh turns.
from openai import OpenAI
client = OpenAI()
def reasoning_conversation(questions: list[str], model: str = "gpt-5.5") -> list[str]:
"""Multi-turn reasoning conversation that preserves context."""
conversation_input = []
answers = []
for i, question in enumerate(questions):
# Add the new question
conversation_input.append({"role": "user", "content": question})
response = client.responses.create(
model=model,
reasoning={"effort": "high", "summary": "auto"},
input=conversation_input,
)
answers.append(response.output_text)
# Preserve ALL output items (reasoning + message) for next turn
conversation_input.extend(response.output)
# Monitor context usage
total_tokens = response.usage.input_tokens + response.usage.output_tokens
print(f"Turn {i+1}: {response.usage.output_tokens_details.reasoning_tokens} reasoning tokens, {total_tokens} total")
return answers
# Multi-turn analysis that builds on previous reasoning
results = reasoning_conversation([
"What are the key factors that caused the 2008 financial crisis?",
"Which of those factors are present in today's economy?",
"Based on your analysis, what's the probability of a similar crisis in the next 5 years?",
])
for i, answer in enumerate(results, 1):
print(f"\n--- Turn {i} ---")
print(answer[:300] + "...")
6. Performance vs Cost Tradeoffs
Choosing the right reasoning effort is the primary lever for optimizing cost, latency, and quality. The relationship is not linear — many tasks see diminishing returns above medium effort, while others genuinely need high or xhigh to get correct answers.
none for retrieval/formatting only. Use low for simple Q&A and summaries. Use medium as your default for most tasks. Use high for math, code generation, and multi-step analysis. Reserve xhigh for competition-level problems or when correctness matters more than cost.
from openai import OpenAI
import time
client = OpenAI()
def benchmark_reasoning_effort(question: str, efforts: list[str]) -> dict:
"""Benchmark a question across multiple effort levels."""
results = {}
for effort in efforts:
start = time.time()
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": effort},
input=question,
)
elapsed = time.time() - start
results[effort] = {
"answer_preview": response.output_text[:150],
"latency_seconds": round(elapsed, 2),
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"reasoning_tokens": response.usage.output_tokens_details.reasoning_tokens,
"total_tokens": response.usage.input_tokens + response.usage.output_tokens,
}
return results
# Benchmark a complex coding question across effort levels
question = "Write a Python function to find the longest increasing subsequence in O(n log n) time."
results = benchmark_reasoning_effort(question, ["low", "medium", "high", "xhigh"])
print(f"Question: {question}\n")
print(f"{'Effort':<10} {'Latency':<10} {'Reasoning':<12} {'Total':<10}")
print("-" * 42)
for effort, data in results.items():
print(f"{effort:<10} {data['latency_seconds']:<10} {data['reasoning_tokens']:<12} {data['total_tokens']:<10}")
Effort Level Impact on GSM8K (Grade School Math)
Testing GPT-5.5 on 200 math word problems at different effort levels shows the quality/cost curve:
- low: 82% accuracy, ~150 reasoning tokens avg, 1.2s latency
- medium: 91% accuracy, ~400 reasoning tokens avg, 2.8s latency
- high: 96% accuracy, ~900 reasoning tokens avg, 5.1s latency
- xhigh: 97% accuracy, ~2100 reasoning tokens avg, 9.4s latency
The jump from low to medium gives the best ROI (9% accuracy gain for ~2.5x tokens). Moving from high to xhigh gives only 1% more accuracy for 2.3x the reasoning tokens — only justified when every percentage point matters.
7. Integration with Responses API
Reasoning models integrate seamlessly with the Responses API. The reasoning parameter works alongside all other Responses API features: structured outputs, tool calling, streaming, and multi-turn conversations. The key difference is understanding how reasoning items interact with these features.
Disabling Reasoning
Setting effort: "none" disables reasoning entirely, making the model behave like a standard (non-reasoning) model. This is useful when you want the same model for both simple and complex tasks in a unified pipeline, toggling reasoning based on task complexity.
from openai import OpenAI
client = OpenAI()
# Disable reasoning for simple tasks — behaves like a standard model
simple_response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "none"},
input="What is the capital of France?",
)
print(f"Simple answer: {simple_response.output_text}")
print(f"Reasoning tokens: {simple_response.usage.output_tokens_details.reasoning_tokens}")
# Output: 0 reasoning tokens — no internal thinking
# Enable reasoning for the same model on a complex task
complex_response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "high"},
input="Prove that for all positive integers n, the sum 1+2+...+n equals n(n+1)/2 using mathematical induction.",
)
print(f"\nComplex answer: {complex_response.output_text[:200]}...")
print(f"Reasoning tokens: {complex_response.usage.output_tokens_details.reasoning_tokens}")
# Output: hundreds of reasoning tokens spent on proof construction
Reasoning with Structured Outputs
Reasoning models work with structured outputs — the model reasons internally and then conforms its response to the requested schema. This combines deep analysis with predictable output formatting.
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class CodeReview(BaseModel):
has_bugs: bool
severity: str # "low", "medium", "high", "critical"
issues: list[str]
suggested_fix: str
confidence: float # 0.0 to 1.0
# Reasoning model + structured output = deep analysis in predictable format
response = client.responses.parse(
model="gpt-5.5",
reasoning={"effort": "high"},
input="""Review this Python code for bugs:
def binary_search(arr, target):
left, right = 0, len(arr)
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1""",
text_format=CodeReview,
)
review = response.output_parsed
print(f"Has bugs: {review.has_bugs}")
print(f"Severity: {review.severity}")
print(f"Issues:")
for issue in review.issues:
print(f" - {issue}")
print(f"Suggested fix: {review.suggested_fix}")
print(f"Confidence: {review.confidence:.0%}")
print(f"\nReasoning tokens used: {response.usage.output_tokens_details.reasoning_tokens}")
effort: "high" for thorough analysis, (3) Return structured output with bug list, severity, and fixes, (4) Compare results at medium vs high effort — measure how often higher effort catches bugs that lower effort misses, (5) Add a cost calculator showing reasoning token spend per review.
Next in the Series
In Part 12: Context & State Management, we’ll cover prompt caching, previous_response_id, the Conversations API, and multi-turn state patterns for building production conversation systems.