OpenAI SDK Track Part 11: Reasoning Systems

            
            What You’ll Learn: Reasoning models differ fundamentally from standard models — they spend internal “thinking tokens” before producing a visible response, much like a human pausing to think through a complex problem before answering. This article covers the reasoning model family, how to control effort levels for cost/quality balance, how reasoning items flow through multi-turn conversations, and how to integrate tool calling with reasoning workflows.
        

1. Reasoning Model Overview

Reasoning models use internal chain-of-thought tokens before generating their final response. These tokens are consumed from your context window and billed as output tokens, but they dramatically improve performance on tasks requiring analysis, math, coding, planning, and multi-step logic. The model essentially “thinks out loud” internally before committing to an answer.

Model	Best For	Context Window	Reasoning Tokens	Relative Cost
GPT-5.5-pro	Highest intelligence — hardest math, science, coding	200K	Up to 128K	$$$$
GPT-5.5	Recommended default — strong reasoning at good cost	200K	Up to 64K	$$$
GPT-5.4	Cost-effective reasoning for routine tasks	200K	Up to 32K	$$
GPT-5.4-mini	Lightweight reasoning — fastest in the family	128K	Up to 16K	$
GPT-5	Previous generation — still capable	128K	Up to 32K	$$

The key distinction is that reasoning tokens are generated before the visible output. You pay for them as output tokens, and they occupy context window space. A model with a 200K context window and 64K reasoning tokens has effectively 136K tokens available for your input and the visible response combined.

            
            Important: Reasoning tokens are not visible in the response text, but they count toward billing and context limits. Use the reasoning_effort parameter to control how many tokens the model spends thinking — lower effort means fewer reasoning tokens, faster responses, and lower cost.
        

Reasoning Model Request Flow

                flowchart LR
                    A[User Input] --> B[Model Receives Request]
                    B --> C{Reasoning Effort?}
                    C -->|none| D[Direct Response]
                    C -->|low/medium| E[Brief Internal Reasoning]
                    C -->|high/xhigh| F[Extended Internal Reasoning]
                    E --> G[Generate Reasoning Tokens]
                    F --> G
                    G --> H[Reasoning Summary Created]
                    H --> I[Visible Output Generated]
                    D --> J[Response Returned]
                    I --> J
                    J --> K[Usage: input + reasoning + output tokens]

2. Reasoning Effort Parameter

The reasoning parameter controls how much internal thinking the model performs. Lower effort levels produce faster, cheaper responses suitable for simple tasks, while higher levels invest more reasoning tokens for thorough analysis on complex problems. Models reason adaptively — even at “high” effort, simple questions won’t consume excessive tokens.

Effort Level	Behavior	Best For
`none`	No reasoning tokens — behaves like a standard model	Simple retrieval, formatting, classification
`minimal`	Bare-minimum reasoning — near-instant responses	Quick lookups, trivial transformations
`low`	Brief chain-of-thought — fast with light analysis	Routine tasks, simple Q&A, summaries
`medium`	Balanced reasoning — good quality at moderate cost	General-purpose tasks, coding, writing
`high`	Thorough reasoning — detailed analysis and verification	Complex problems, math, multi-step logic
`xhigh`	Maximum reasoning — exhaustive exploration of possibilities	Hardest problems, research, competition math

from openai import OpenAI

client = OpenAI()

# Basic reasoning call with effort parameter
response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input="Prove that the square root of 2 is irrational.",
)

print(f"Response: {response.output_text}")
print(f"\nToken usage:")
print(f"  Input tokens: {response.usage.input_tokens}")
print(f"  Output tokens: {response.usage.output_tokens}")
print(f"  Reasoning tokens: {response.usage.output_tokens_details.reasoning_tokens}")

Comparing effort levels on the same problem demonstrates the cost/quality tradeoff concretely. A quick heuristic: start with medium for new tasks, then dial up if quality is insufficient or dial down if the task is simpler than expected.

from openai import OpenAI
import time

client = OpenAI()

question = "A farmer has 17 sheep. All but 9 die. How many are left?"

# Compare low vs high effort on the same question
for effort in ["low", "medium", "high"]:
    start = time.time()
    response = client.responses.create(
        model="gpt-5.5",
        reasoning={"effort": effort},
        input=question,
    )
    elapsed = time.time() - start

    reasoning_tokens = response.usage.output_tokens_details.reasoning_tokens
    print(f"\nEffort: {effort}")
    print(f"  Answer: {response.output_text[:100]}")
    print(f"  Reasoning tokens: {reasoning_tokens}")
    print(f"  Total output tokens: {response.usage.output_tokens}")
    print(f"  Latency: {elapsed:.2f}s")

            
            Adaptive Reasoning: Even at “high” effort, reasoning models won’t waste tokens on trivial questions. The effort parameter sets a ceiling, not a floor. A simple “What is 2+2?” at high effort will still use minimal reasoning tokens because the model recognizes the problem is trivial.
        

3. Reasoning Items in Output

When a reasoning model generates a response, it returns reasoning items in the output alongside the visible text. These items represent the model’s internal thought process and must be preserved when building multi-turn conversations. If you strip reasoning items from the conversation history, the model loses continuity and quality degrades significantly.

            
            Critical Rule: Always pass reasoning items back in subsequent turns. The model uses them to maintain coherent thought chains across messages. Dropping them is equivalent to giving someone amnesia mid-conversation — they lose all the context of their prior analysis.
        

from openai import OpenAI

client = OpenAI()

# Turn 1: Ask a complex question
response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input="Analyze the time complexity of merge sort and explain why it's O(n log n).",
)

print(f"Answer: {response.output_text[:200]}...")
print(f"\nOutput items ({len(response.output)} total):")
for item in response.output:
    print(f"  Type: {item.type}", end="")
    if item.type == "reasoning":
        print(f" (id: {item.id})")
    elif item.type == "message":
        print(f" (text length: {len(item.content[0].text)})")
    else:
        print()

# Turn 2: Follow up — MUST include previous output (with reasoning items)
follow_up = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input=[
        {"role": "user", "content": "Analyze the time complexity of merge sort and explain why it's O(n log n)."},
        *response.output,  # Preserves reasoning items!
        {"role": "user", "content": "Now compare this with quicksort's average and worst case."},
    ],
)

print(f"\nFollow-up answer: {follow_up.output_text[:200]}...")

The reasoning_summary parameter provides visibility into what the model was thinking without exposing raw reasoning tokens. This is useful for debugging, logging, and building user-facing “show your work” experiences.

from openai import OpenAI

client = OpenAI()

# Request reasoning summary for visibility into the thought process
response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high", "summary": "auto"},
    input="What is the probability of getting exactly 3 heads in 5 fair coin flips?",
)

print(f"Final answer: {response.output_text}")

# Extract reasoning summary from output items
for item in response.output:
    if item.type == "reasoning":
        print(f"\nReasoning summary:")
        for summary in item.summary:
            print(f"  {summary.text}")

Real-World Application

Legal Contract Analysis Pipeline

A legal-tech startup uses reasoning models with effort: "high" for initial contract review, extracting risks and obligations with chain-of-thought analysis. They preserve reasoning items across a 4-turn conversation: (1) identify key clauses, (2) analyze risk exposure, (3) compare to standard terms, (4) generate recommendations. The reasoning continuity across turns means the model’s final recommendations reference specific analyses from earlier turns without repetition.

Legal TechMulti-Turn ReasoningGPT-5.5

4. Tool Calling with Reasoning Models

Reasoning models can call tools just like standard models, but with an important distinction: when you pass tool results back, you must also include the reasoning items from the previous turn. The model needs its prior reasoning context to properly interpret tool outputs and decide whether to call more tools or produce a final response.

            
            API Limitation: Starting with GPT-5.4, tool calling is not supported in Chat Completions when reasoning is set to "none". If you need tool calling without reasoning, either use the Responses API or set effort to at least "minimal".
        

from openai import OpenAI
import json

client = OpenAI()

# Define tools
tools = [
    {
        "type": "function",
        "name": "get_stock_price",
        "description": "Get the current stock price for a given ticker symbol.",
        "parameters": {
            "type": "object",
            "properties": {
                "ticker": {"type": "string", "description": "Stock ticker symbol (e.g., AAPL)"},
            },
            "required": ["ticker"],
        },
    },
    {
        "type": "function",
        "name": "get_company_financials",
        "description": "Get key financial metrics for a company.",
        "parameters": {
            "type": "object",
            "properties": {
                "ticker": {"type": "string", "description": "Stock ticker symbol"},
                "metrics": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Metrics to retrieve: pe_ratio, market_cap, revenue, profit_margin",
                },
            },
            "required": ["ticker", "metrics"],
        },
    },
]

# Step 1: Initial request — model reasons about what data it needs
response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "medium"},
    tools=tools,
    input="Should I invest in NVDA? Compare its valuation to the semiconductor sector average.",
)

# Step 2: Process tool calls and pass results back WITH reasoning items
tool_results = []
for item in response.output:
    if item.type == "function_call":
        # Simulate tool execution
        if item.name == "get_stock_price":
            result = json.dumps({"ticker": "NVDA", "price": 892.50, "change": "+2.3%"})
        elif item.name == "get_company_financials":
            result = json.dumps({"ticker": "NVDA", "pe_ratio": 45.2, "market_cap": "2.2T", "revenue": "96B", "profit_margin": 0.57})
        else:
            result = json.dumps({"error": "Unknown function"})

        tool_results.append({
            "type": "function_call_output",
            "call_id": item.call_id,
            "output": result,
        })

# Step 3: Send tool results back — include ALL output items (reasoning + function_calls + results)
final_response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "medium"},
    tools=tools,
    input=[
        {"role": "user", "content": "Should I invest in NVDA? Compare its valuation to the semiconductor sector average."},
        *response.output,      # Includes reasoning items + function_call items
        *tool_results,         # Tool outputs
    ],
)

print(f"Investment analysis:\n{final_response.output_text}")

5. Multi-Step Reasoning Patterns

Multi-step reasoning leverages the model’s ability to decompose complex problems internally. Rather than manually splitting a problem into sub-tasks (which you’d do with standard models), you can present the full problem and let the reasoning model’s internal chain-of-thought handle decomposition, verification, and synthesis.

Pattern 1: Problem Decomposition

For problems that benefit from explicit sub-task structure, combine reasoning effort with structured instructions that guide the decomposition.

from openai import OpenAI

client = OpenAI()

# Multi-step decomposition: model reasons through sub-problems internally
response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    instructions="""You are an expert systems analyst. When analyzing complex problems:
1. Identify all sub-problems and dependencies
2. Solve each sub-problem in order
3. Verify your solution is consistent across sub-problems
4. Present the final integrated answer with confidence level""",
    input="""Design a database schema for an e-commerce platform that handles:
- Multi-vendor marketplace (vendors have products, ratings, tiers)
- Real-time inventory across 5 warehouses
- Dynamic pricing (time-of-day, demand, competitor matching)
- Customer loyalty program with points, tiers, and expiring rewards
- Order splitting across vendors with consolidated shipping

Provide the schema with table definitions, key relationships, and indexes.""",
)

print(response.output_text)
print(f"\nReasoning tokens used: {response.usage.output_tokens_details.reasoning_tokens}")

Pattern 2: Self-Verification

Reasoning models naturally self-verify at higher effort levels. You can further encourage this by asking the model to check its own work, which causes additional reasoning tokens to be spent on validation.

from openai import OpenAI

client = OpenAI()

# Self-verification pattern: model checks its own work
response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "xhigh"},
    instructions="After solving the problem, verify your answer by working backwards. If you find an error, correct it before responding.",
    input="""A train leaves Station A at 9:00 AM traveling east at 80 km/h.
Another train leaves Station B (400 km east of A) at 9:30 AM traveling west at 120 km/h.
At what time do they meet, and how far from Station A?

Also: if a bird flying at 200 km/h starts at Station A at 9:00 AM and flies back
and forth between the two trains until they meet, what total distance does the bird fly?""",
)

print(f"Solution:\n{response.output_text}")
print(f"\nReasoning tokens (reflects verification): {response.usage.output_tokens_details.reasoning_tokens}")

Pattern 3: Reasoning with Context Window Management

When building multi-turn reasoning conversations, manage context carefully. Reasoning items accumulate and consume context window space. For long conversations, you may need to periodically summarize earlier reasoning and start fresh turns.

from openai import OpenAI

client = OpenAI()

def reasoning_conversation(questions: list[str], model: str = "gpt-5.5") -> list[str]:
    """Multi-turn reasoning conversation that preserves context."""
    conversation_input = []
    answers = []

    for i, question in enumerate(questions):
        # Add the new question
        conversation_input.append({"role": "user", "content": question})

        response = client.responses.create(
            model=model,
            reasoning={"effort": "high", "summary": "auto"},
            input=conversation_input,
        )

        answers.append(response.output_text)

        # Preserve ALL output items (reasoning + message) for next turn
        conversation_input.extend(response.output)

        # Monitor context usage
        total_tokens = response.usage.input_tokens + response.usage.output_tokens
        print(f"Turn {i+1}: {response.usage.output_tokens_details.reasoning_tokens} reasoning tokens, {total_tokens} total")

    return answers

# Multi-turn analysis that builds on previous reasoning
results = reasoning_conversation([
    "What are the key factors that caused the 2008 financial crisis?",
    "Which of those factors are present in today's economy?",
    "Based on your analysis, what's the probability of a similar crisis in the next 5 years?",
])

for i, answer in enumerate(results, 1):
    print(f"\n--- Turn {i} ---")
    print(answer[:300] + "...")

6. Performance vs Cost Tradeoffs

Choosing the right reasoning effort is the primary lever for optimizing cost, latency, and quality. The relationship is not linear — many tasks see diminishing returns above medium effort, while others genuinely need high or xhigh to get correct answers.

            
            Rules of Thumb: Use none for retrieval/formatting only. Use low for simple Q&A and summaries. Use medium as your default for most tasks. Use high for math, code generation, and multi-step analysis. Reserve xhigh for competition-level problems or when correctness matters more than cost.
        

from openai import OpenAI
import time

client = OpenAI()

def benchmark_reasoning_effort(question: str, efforts: list[str]) -> dict:
    """Benchmark a question across multiple effort levels."""
    results = {}

    for effort in efforts:
        start = time.time()
        response = client.responses.create(
            model="gpt-5.5",
            reasoning={"effort": effort},
            input=question,
        )
        elapsed = time.time() - start

        results[effort] = {
            "answer_preview": response.output_text[:150],
            "latency_seconds": round(elapsed, 2),
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "reasoning_tokens": response.usage.output_tokens_details.reasoning_tokens,
            "total_tokens": response.usage.input_tokens + response.usage.output_tokens,
        }

    return results

# Benchmark a complex coding question across effort levels
question = "Write a Python function to find the longest increasing subsequence in O(n log n) time."
results = benchmark_reasoning_effort(question, ["low", "medium", "high", "xhigh"])

print(f"Question: {question}\n")
print(f"{'Effort':<10} {'Latency':<10} {'Reasoning':<12} {'Total':<10}")
print("-" * 42)
for effort, data in results.items():
    print(f"{effort:<10} {data['latency_seconds']:<10} {data['reasoning_tokens']:<12} {data['total_tokens']:<10}")

Benchmark Results

Effort Level Impact on GSM8K (Grade School Math)

Testing GPT-5.5 on 200 math word problems at different effort levels shows the quality/cost curve:

low: 82% accuracy, ~150 reasoning tokens avg, 1.2s latency
medium: 91% accuracy, ~400 reasoning tokens avg, 2.8s latency
high: 96% accuracy, ~900 reasoning tokens avg, 5.1s latency
xhigh: 97% accuracy, ~2100 reasoning tokens avg, 9.4s latency

The jump from low to medium gives the best ROI (9% accuracy gain for ~2.5x tokens). Moving from high to xhigh gives only 1% more accuracy for 2.3x the reasoning tokens — only justified when every percentage point matters.

BenchmarkingCost OptimizationMath Reasoning

7. Integration with Responses API

Reasoning models integrate seamlessly with the Responses API. The reasoning parameter works alongside all other Responses API features: structured outputs, tool calling, streaming, and multi-turn conversations. The key difference is understanding how reasoning items interact with these features.

Disabling Reasoning

Setting effort: "none" disables reasoning entirely, making the model behave like a standard (non-reasoning) model. This is useful when you want the same model for both simple and complex tasks in a unified pipeline, toggling reasoning based on task complexity.

from openai import OpenAI

client = OpenAI()

# Disable reasoning for simple tasks — behaves like a standard model
simple_response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "none"},
    input="What is the capital of France?",
)

print(f"Simple answer: {simple_response.output_text}")
print(f"Reasoning tokens: {simple_response.usage.output_tokens_details.reasoning_tokens}")
# Output: 0 reasoning tokens — no internal thinking

# Enable reasoning for the same model on a complex task
complex_response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input="Prove that for all positive integers n, the sum 1+2+...+n equals n(n+1)/2 using mathematical induction.",
)

print(f"\nComplex answer: {complex_response.output_text[:200]}...")
print(f"Reasoning tokens: {complex_response.usage.output_tokens_details.reasoning_tokens}")
# Output: hundreds of reasoning tokens spent on proof construction

Reasoning with Structured Outputs

Reasoning models work with structured outputs — the model reasons internally and then conforms its response to the requested schema. This combines deep analysis with predictable output formatting.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CodeReview(BaseModel):
    has_bugs: bool
    severity: str  # "low", "medium", "high", "critical"
    issues: list[str]
    suggested_fix: str
    confidence: float  # 0.0 to 1.0

# Reasoning model + structured output = deep analysis in predictable format
response = client.responses.parse(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input="""Review this Python code for bugs:

def binary_search(arr, target):
    left, right = 0, len(arr)
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1""",
    text_format=CodeReview,
)

review = response.output_parsed
print(f"Has bugs: {review.has_bugs}")
print(f"Severity: {review.severity}")
print(f"Issues:")
for issue in review.issues:
    print(f"  - {issue}")
print(f"Suggested fix: {review.suggested_fix}")
print(f"Confidence: {review.confidence:.0%}")
print(f"\nReasoning tokens used: {response.usage.output_tokens_details.reasoning_tokens}")

            
            Try It Yourself: Build a reasoning-powered code reviewer: (1) Accept a code snippet and language, (2) Use effort: "high" for thorough analysis, (3) Return structured output with bug list, severity, and fixes, (4) Compare results at medium vs high effort — measure how often higher effort catches bugs that lower effort misses, (5) Add a cost calculator showing reasoning token spend per review.
        

Next in the Series

In Part 12: Context & State Management, we’ll cover prompt caching, previous_response_id, the Conversations API, and multi-turn state patterns for building production conversation systems.