OpenAI SDK Track Part 2: Responses API & Text Generation

                        
                        What You’ll Learn: The Responses API is OpenAI’s unified endpoint for text generation — from simple completions to multi-turn conversations with tool use. This article teaches you the fundamentals: how to structure requests, interpret responses, control output with parameters (temperature, max_tokens, stop sequences), and handle streaming for real-time applications. Think of it like learning SQL for a database — once you master the query language, you can build anything.
                    

1. The Responses API

The Responses API is OpenAI’s primary interface for text generation. It accepts a sequence of messages (system, user, assistant) and returns a model-generated completion. The API supports tool calling, structured outputs, and streaming — all through the same endpoint.

1.1 Message Structure

The first thing to internalize is that input can be either a full message array or a shorthand single string. Use the array form whenever you need explicit role control or multimodal content; use the shorthand form for small, one-off prompts.

from openai import OpenAI

client = OpenAI()

# Basic request with message array
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."},
    ],
)
print(response.output_text)

from openai import OpenAI

client = OpenAI()

# Shorthand for single user message
response = client.responses.create(
    model="gpt-4.1-mini",
    input="Explain the difference between REST and GraphQL in 3 sentences.",
)
print(response.output_text)
print(f"Tokens used: {response.usage.input_tokens} in, {response.usage.output_tokens} out")

1.2 System Instructions

Instructions are where you define stable behavior: tone, output discipline, constraints, and role. If users should be able to change a behavior, put it in the user input; if it should stay fixed across requests, put it in instructions.

from openai import OpenAI

client = OpenAI()

# System instructions set behavior, persona, and constraints
response = client.responses.create(
    model="gpt-4.1",
    instructions="You are a senior Python developer. Always include type hints. Explain code briefly after writing it. Never use global variables.",
    input="Write a function that validates email addresses using regex.",
)
print(response.output_text)

                        
                        Key Insight: The instructions parameter in the Responses API is equivalent to a system message. It’s placed before user input automatically. Use it for persistent behavioral constraints that apply to the entire conversation.
                    

1.3 Response Object Anatomy

A response is more than just text. The object carries usage metrics, status, model identity, and structured output items that become essential once you add tools, reasoning summaries, or server-managed conversation state.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4.1-mini",
    input="What is 2 + 2?",
)

# Key response fields
print(f"ID: {response.id}")                       # Unique response identifier
print(f"Model: {response.model}")                 # Actual model used
print(f"Output: {response.output_text}")          # Generated text
print(f"Status: {response.status}")               # completed, failed, incomplete
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

# Output items (can contain multiple items for tool calls)
for item in response.output:
    print(f"  Type: {item.type}, Role: {item.role}")

2. Generation Controls

2.1 Temperature & Top-P

Temperature and top-p both control variation, but they do it differently. Temperature changes how sharply the model prefers likely tokens, while top-p limits the candidate pool itself. In practice, teams usually tune one of them aggressively and keep the other conservative.

from openai import OpenAI

client = OpenAI()

# Temperature: 0 = deterministic, 2 = very creative
# Low temperature for factual/analytical tasks
response = client.responses.create(
    model="gpt-4.1-mini",
    input="What is the capital of France?",
    temperature=0,  # Always "Paris" — no randomness
)
print(f"Deterministic: {response.output_text}")

# High temperature for creative tasks
response = client.responses.create(
    model="gpt-4.1-mini",
    input="Write a creative metaphor for time.",
    temperature=1.2,  # More varied, surprising outputs
)
print(f"Creative: {response.output_text}")

This second snippet isolates top-p so you can see it separately from temperature. That is a useful testing habit because it keeps prompt experiments easier to interpret.

from openai import OpenAI

client = OpenAI()

# Top-p (nucleus sampling): consider only top P% probability mass
# top_p=0.1 means only top 10% likely tokens are considered
response = client.responses.create(
    model="gpt-4.1-mini",
    input="Suggest a name for a tech startup focused on AI safety.",
    top_p=0.9,       # Consider top 90% probability mass
    temperature=0.8, # Moderate creativity within that range
)
print(response.output_text)

2.2 Token Limits & Stop Sequences

Output limits are part cost control and part UX control. They prevent a model from running away with verbose completions and they give your application a predictable ceiling when you budget latency or downstream post-processing.

from openai import OpenAI

client = OpenAI()

# max_output_tokens caps response length
response = client.responses.create(
    model="gpt-4.1-mini",
    input="Write a comprehensive essay about climate change.",
    max_output_tokens=200,  # Hard cap at 200 output tokens
)
print(f"Output ({response.usage.output_tokens} tokens): {response.output_text}")

# Check if response was truncated
if response.status == "incomplete":
    print("Response was truncated due to token limit")

2.3 Deterministic Outputs

Perfect reproducibility is never the right mental model for generative systems, but you can make outputs much more stable. Low temperature, constrained prompts, structured outputs, and consistent server-side context all work together better than any single setting alone.

from openai import OpenAI

client = OpenAI()

# For reproducible outputs: temperature=0 + seed
response1 = client.responses.create(
    model="gpt-4.1-mini",
    input="Generate a 5-word product tagline for running shoes.",
    temperature=0,
    store=False,  # Don't store for training
)

response2 = client.responses.create(
    model="gpt-4.1-mini",
    input="Generate a 5-word product tagline for running shoes.",
    temperature=0,
    store=False,
)

print(f"Response 1: {response1.output_text}")
print(f"Response 2: {response2.output_text}")
print(f"Identical: {response1.output_text == response2.output_text}")

Real-World Application

Content Generation at Scale

A media company generates 500 article summaries daily using the Responses API. Their optimization: batching similar requests, caching common prefixes, and using temperature=0.3 for factual content vs 0.9 for creative headlines. Result: 60% cost reduction through smart parameter tuning.

Content GenerationCost Optimization

3. Streaming Responses

3.1 Basic Streaming

Streaming changes the feel of an application more than almost any model knob. Instead of waiting for the whole answer, you can render partial text immediately and keep the UI responsive even for longer generations.

from openai import OpenAI

client = OpenAI()

# Stream tokens as they're generated — essential for responsive UIs
stream = client.responses.create(
    model="gpt-4.1-mini",
    input="Explain quantum entanglement step by step.",
    stream=True,
)

full_response = ""
for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
        full_response += event.delta

print(f"\n\nTotal length: {len(full_response)} chars")

3.2 Async Streaming

Async streaming matters once you have concurrent sessions or websockets in your own app. It lets you consume events without blocking unrelated work, which is exactly what you need for chat servers and realtime dashboards.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def stream_response(prompt: str):
    """Stream response tokens asynchronously."""
    stream = await client.responses.create(
        model="gpt-4.1-mini",
        input=prompt,
        stream=True,
    )

    full_text = ""
    async for event in stream:
        if event.type == "response.output_text.delta":
            print(event.delta, end="", flush=True)
            full_text += event.delta

    print(f"\n--- Done ({len(full_text)} chars) ---")
    return full_text

asyncio.run(stream_response("Write a haiku about programming."))

3.3 Streaming with Event Types

Real production streams are event streams, not just token streams. Handle lifecycle events explicitly so you can measure usage, surface errors correctly, and know when to persist the final answer versus interim deltas.

from openai import OpenAI

client = OpenAI()

# Handle different event types during streaming
stream = client.responses.create(
    model="gpt-4.1-mini",
    input="List 5 Python design patterns with brief descriptions.",
    stream=True,
)

for event in stream:
    match event.type:
        case "response.created":
            print(f"[Started] Response ID: {event.response.id}")
        case "response.output_text.delta":
            print(event.delta, end="", flush=True)
        case "response.completed":
            usage = event.response.usage
            print(f"\n[Done] {usage.input_tokens} in, {usage.output_tokens} out")
        case "response.failed":
            print(f"[Error] {event.response.error}")

4. Reasoning Models

4.1 o3 & o4-mini

Reasoning models are best reserved for tasks where deliberate intermediate reasoning pays for itself: complex coding, math, multi-step decisions, and situations where a shallow fast answer is often wrong.

from openai import OpenAI

client = OpenAI()

# Reasoning models "think" before responding — better for complex tasks
response = client.responses.create(
    model="o4-mini",
    input="A farmer has 17 sheep. All but 9 die. How many are left?",
)
print(f"Answer: {response.output_text}")

# Access reasoning summary (when available)
for item in response.output:
    if item.type == "reasoning":
        print(f"Reasoning: {item.summary}")

4.2 Reasoning Effort

Reasoning effort is effectively a budget for deliberation. That gives you a way to match task complexity to latency and cost instead of assuming every prompt deserves the same level of computational work.

from openai import OpenAI

client = OpenAI()

# Control how much "thinking" the model does
# low = fast but shallow, medium = balanced, high = thorough

# Quick factual lookups — low effort
response = client.responses.create(
    model="o4-mini",
    input="What year was Python created?",
    reasoning={"effort": "low"},
)
print(f"Low effort: {response.output_text}")

# Complex multi-step problem — high effort
response = client.responses.create(
    model="o4-mini",
    input="""Given a sorted array of integers and a target sum, find all unique
    pairs that sum to the target. Analyze time and space complexity.
    Provide the optimal solution.""",
    reasoning={"effort": "high"},
)
print(f"High effort: {response.output_text}")

4.3 When to Use Reasoning Models

Use Case	Model	Why
Simple Q&A, chat	gpt-4.1-mini	Fast, cheap, sufficient quality
Code generation	gpt-4.1	Good balance of speed and accuracy
Math/logic puzzles	o4-mini	Deliberative reasoning needed
Research analysis	o3	Deep multi-step reasoning
Complex code review	o4-mini (high)	Catches subtle bugs
Scientific reasoning	o3	Multi-step verification

5. Multi-Turn Conversations

5.1 Stateless Conversation Pattern

The simplest mental model is still the most reliable one: every request is self-contained unless you explicitly carry context forward. That keeps application behavior easy to reason about and makes truncation or summarization strategies visible in your own code.

from openai import OpenAI

client = OpenAI()

# The API is stateless — you must send the full conversation each time
conversation = [
    {"role": "user", "content": "My name is Alice and I'm building a chatbot."},
]

# Turn 1
response = client.responses.create(model="gpt-4.1-mini", input=conversation)
assistant_msg = response.output_text
conversation.append({"role": "assistant", "content": assistant_msg})
print(f"Assistant: {assistant_msg}")

# Turn 2 — include full history
conversation.append({"role": "user", "content": "What's my name?"})
response = client.responses.create(model="gpt-4.1-mini", input=conversation)
print(f"Assistant: {response.output_text}")  # Should remember "Alice"

5.2 Context Window Management

As conversations grow, the hard problem is no longer generation quality but context hygiene. You need rules for trimming, summarizing, or offloading older state so useful context survives without wasting tokens on stale turns.

from openai import OpenAI

client = OpenAI()

def count_tokens_approx(messages: list[dict]) -> int:
    """Rough token estimate: ~4 chars per token."""
    total_chars = sum(len(m["content"]) for m in messages)
    return total_chars // 4

def trim_conversation(messages: list[dict], max_tokens: int = 100000) -> list[dict]:
    """Keep system message + most recent messages within token budget."""
    if not messages:
        return messages

    # Always keep system message (if present)
    system_msgs = [m for m in messages if m["role"] == "system"]
    other_msgs = [m for m in messages if m["role"] != "system"]

    # Trim oldest messages until within budget
    while count_tokens_approx(system_msgs + other_msgs) > max_tokens and len(other_msgs) > 2:
        other_msgs.pop(0)  # Remove oldest non-system message

    return system_msgs + other_msgs

# Usage
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi! How can I help you today?"},
    # ... many more turns ...
]

trimmed = trim_conversation(conversation, max_tokens=50000)
response = client.responses.create(model="gpt-4.1-mini", input=trimmed)
print(response.output_text)

                        
                        Tip: For long conversations, consider using the previous_response_id parameter to chain responses server-side, letting OpenAI manage conversation state. This avoids sending the full history each time and enables prompt caching.
                    

Server-managed state is especially helpful when you want cleaner request payloads or better caching on repeated prefixes. The pattern below shows the minimum viable form: start one response, then continue the thread by referencing its response ID instead of resending every past message.

from openai import OpenAI

client = OpenAI()

# Chain responses using previous_response_id (server-managed conversation)
response1 = client.responses.create(
    model="gpt-4.1-mini",
    input="My name is Bob. I'm working on a RAG system.",
)
print(f"Turn 1: {response1.output_text}")

# Second turn references the first — no need to resend history
response2 = client.responses.create(
    model="gpt-4.1-mini",
    input="What embedding model would you recommend for my project?",
    previous_response_id=response1.id,  # Server remembers context
)
print(f"Turn 2: {response2.output_text}")

                        
                        Try It Yourself: Build a ‘writing style analyzer’ that takes a text sample and returns: (1) a description of the writing style, (2) 3 example sentences in that style about a given topic, (3) a confidence score. Use temperature=0 for analysis and temperature=0.8 for generation. Compare the outputs and explain why different temperatures suit different tasks.
                    

Next in the SDK Track

In OA Part 3: Structured Outputs & Code Generation, we’ll enforce JSON schemas, integrate with Pydantic for typed responses, and build deterministic code generation pipelines.