1. The Responses API
The Responses API is OpenAI’s primary interface for text generation. It accepts a sequence of messages (system, user, assistant) and returns a model-generated completion. The API supports tool calling, structured outputs, and streaming — all through the same endpoint.
1.1 Message Structure
The first thing to internalize is that input can be either a full message array or a shorthand single string. Use the array form whenever you need explicit role control or multimodal content; use the shorthand form for small, one-off prompts.
from openai import OpenAI
client = OpenAI()
# Basic request with message array
response = client.responses.create(
model="gpt-4.1",
input=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."},
],
)
print(response.output_text)
from openai import OpenAI
client = OpenAI()
# Shorthand for single user message
response = client.responses.create(
model="gpt-4.1-mini",
input="Explain the difference between REST and GraphQL in 3 sentences.",
)
print(response.output_text)
print(f"Tokens used: {response.usage.input_tokens} in, {response.usage.output_tokens} out")
1.2 System Instructions
Instructions are where you define stable behavior: tone, output discipline, constraints, and role. If users should be able to change a behavior, put it in the user input; if it should stay fixed across requests, put it in instructions.
from openai import OpenAI
client = OpenAI()
# System instructions set behavior, persona, and constraints
response = client.responses.create(
model="gpt-4.1",
instructions="You are a senior Python developer. Always include type hints. Explain code briefly after writing it. Never use global variables.",
input="Write a function that validates email addresses using regex.",
)
print(response.output_text)
instructions parameter in the Responses API is equivalent to a system message. It’s placed before user input automatically. Use it for persistent behavioral constraints that apply to the entire conversation.
1.3 Response Object Anatomy
A response is more than just text. The object carries usage metrics, status, model identity, and structured output items that become essential once you add tools, reasoning summaries, or server-managed conversation state.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4.1-mini",
input="What is 2 + 2?",
)
# Key response fields
print(f"ID: {response.id}") # Unique response identifier
print(f"Model: {response.model}") # Actual model used
print(f"Output: {response.output_text}") # Generated text
print(f"Status: {response.status}") # completed, failed, incomplete
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
# Output items (can contain multiple items for tool calls)
for item in response.output:
print(f" Type: {item.type}, Role: {item.role}")
2. Generation Controls
2.1 Temperature & Top-P
Temperature and top-p both control variation, but they do it differently. Temperature changes how sharply the model prefers likely tokens, while top-p limits the candidate pool itself. In practice, teams usually tune one of them aggressively and keep the other conservative.
from openai import OpenAI
client = OpenAI()
# Temperature: 0 = deterministic, 2 = very creative
# Low temperature for factual/analytical tasks
response = client.responses.create(
model="gpt-4.1-mini",
input="What is the capital of France?",
temperature=0, # Always "Paris" — no randomness
)
print(f"Deterministic: {response.output_text}")
# High temperature for creative tasks
response = client.responses.create(
model="gpt-4.1-mini",
input="Write a creative metaphor for time.",
temperature=1.2, # More varied, surprising outputs
)
print(f"Creative: {response.output_text}")
This second snippet isolates top-p so you can see it separately from temperature. That is a useful testing habit because it keeps prompt experiments easier to interpret.
from openai import OpenAI
client = OpenAI()
# Top-p (nucleus sampling): consider only top P% probability mass
# top_p=0.1 means only top 10% likely tokens are considered
response = client.responses.create(
model="gpt-4.1-mini",
input="Suggest a name for a tech startup focused on AI safety.",
top_p=0.9, # Consider top 90% probability mass
temperature=0.8, # Moderate creativity within that range
)
print(response.output_text)
2.2 Token Limits & Stop Sequences
Output limits are part cost control and part UX control. They prevent a model from running away with verbose completions and they give your application a predictable ceiling when you budget latency or downstream post-processing.
from openai import OpenAI
client = OpenAI()
# max_output_tokens caps response length
response = client.responses.create(
model="gpt-4.1-mini",
input="Write a comprehensive essay about climate change.",
max_output_tokens=200, # Hard cap at 200 output tokens
)
print(f"Output ({response.usage.output_tokens} tokens): {response.output_text}")
# Check if response was truncated
if response.status == "incomplete":
print("Response was truncated due to token limit")
2.3 Deterministic Outputs
Perfect reproducibility is never the right mental model for generative systems, but you can make outputs much more stable. Low temperature, constrained prompts, structured outputs, and consistent server-side context all work together better than any single setting alone.
from openai import OpenAI
client = OpenAI()
# For reproducible outputs: temperature=0 + seed
response1 = client.responses.create(
model="gpt-4.1-mini",
input="Generate a 5-word product tagline for running shoes.",
temperature=0,
store=False, # Don't store for training
)
response2 = client.responses.create(
model="gpt-4.1-mini",
input="Generate a 5-word product tagline for running shoes.",
temperature=0,
store=False,
)
print(f"Response 1: {response1.output_text}")
print(f"Response 2: {response2.output_text}")
print(f"Identical: {response1.output_text == response2.output_text}")
Content Generation at Scale
A media company generates 500 article summaries daily using the Responses API. Their optimization: batching similar requests, caching common prefixes, and using temperature=0.3 for factual content vs 0.9 for creative headlines. Result: 60% cost reduction through smart parameter tuning.
3. Streaming Responses
3.1 Basic Streaming
Streaming changes the feel of an application more than almost any model knob. Instead of waiting for the whole answer, you can render partial text immediately and keep the UI responsive even for longer generations.
from openai import OpenAI
client = OpenAI()
# Stream tokens as they're generated — essential for responsive UIs
stream = client.responses.create(
model="gpt-4.1-mini",
input="Explain quantum entanglement step by step.",
stream=True,
)
full_response = ""
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
full_response += event.delta
print(f"\n\nTotal length: {len(full_response)} chars")
3.2 Async Streaming
Async streaming matters once you have concurrent sessions or websockets in your own app. It lets you consume events without blocking unrelated work, which is exactly what you need for chat servers and realtime dashboards.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def stream_response(prompt: str):
"""Stream response tokens asynchronously."""
stream = await client.responses.create(
model="gpt-4.1-mini",
input=prompt,
stream=True,
)
full_text = ""
async for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
full_text += event.delta
print(f"\n--- Done ({len(full_text)} chars) ---")
return full_text
asyncio.run(stream_response("Write a haiku about programming."))
3.3 Streaming with Event Types
Real production streams are event streams, not just token streams. Handle lifecycle events explicitly so you can measure usage, surface errors correctly, and know when to persist the final answer versus interim deltas.
from openai import OpenAI
client = OpenAI()
# Handle different event types during streaming
stream = client.responses.create(
model="gpt-4.1-mini",
input="List 5 Python design patterns with brief descriptions.",
stream=True,
)
for event in stream:
match event.type:
case "response.created":
print(f"[Started] Response ID: {event.response.id}")
case "response.output_text.delta":
print(event.delta, end="", flush=True)
case "response.completed":
usage = event.response.usage
print(f"\n[Done] {usage.input_tokens} in, {usage.output_tokens} out")
case "response.failed":
print(f"[Error] {event.response.error}")
4. Reasoning Models
4.1 o3 & o4-mini
Reasoning models are best reserved for tasks where deliberate intermediate reasoning pays for itself: complex coding, math, multi-step decisions, and situations where a shallow fast answer is often wrong.
from openai import OpenAI
client = OpenAI()
# Reasoning models "think" before responding — better for complex tasks
response = client.responses.create(
model="o4-mini",
input="A farmer has 17 sheep. All but 9 die. How many are left?",
)
print(f"Answer: {response.output_text}")
# Access reasoning summary (when available)
for item in response.output:
if item.type == "reasoning":
print(f"Reasoning: {item.summary}")
4.2 Reasoning Effort
Reasoning effort is effectively a budget for deliberation. That gives you a way to match task complexity to latency and cost instead of assuming every prompt deserves the same level of computational work.
from openai import OpenAI
client = OpenAI()
# Control how much "thinking" the model does
# low = fast but shallow, medium = balanced, high = thorough
# Quick factual lookups — low effort
response = client.responses.create(
model="o4-mini",
input="What year was Python created?",
reasoning={"effort": "low"},
)
print(f"Low effort: {response.output_text}")
# Complex multi-step problem — high effort
response = client.responses.create(
model="o4-mini",
input="""Given a sorted array of integers and a target sum, find all unique
pairs that sum to the target. Analyze time and space complexity.
Provide the optimal solution.""",
reasoning={"effort": "high"},
)
print(f"High effort: {response.output_text}")
4.3 When to Use Reasoning Models
| Use Case | Model | Why |
|---|---|---|
| Simple Q&A, chat | gpt-4.1-mini | Fast, cheap, sufficient quality |
| Code generation | gpt-4.1 | Good balance of speed and accuracy |
| Math/logic puzzles | o4-mini | Deliberative reasoning needed |
| Research analysis | o3 | Deep multi-step reasoning |
| Complex code review | o4-mini (high) | Catches subtle bugs |
| Scientific reasoning | o3 | Multi-step verification |
5. Multi-Turn Conversations
5.1 Stateless Conversation Pattern
The simplest mental model is still the most reliable one: every request is self-contained unless you explicitly carry context forward. That keeps application behavior easy to reason about and makes truncation or summarization strategies visible in your own code.
from openai import OpenAI
client = OpenAI()
# The API is stateless — you must send the full conversation each time
conversation = [
{"role": "user", "content": "My name is Alice and I'm building a chatbot."},
]
# Turn 1
response = client.responses.create(model="gpt-4.1-mini", input=conversation)
assistant_msg = response.output_text
conversation.append({"role": "assistant", "content": assistant_msg})
print(f"Assistant: {assistant_msg}")
# Turn 2 — include full history
conversation.append({"role": "user", "content": "What's my name?"})
response = client.responses.create(model="gpt-4.1-mini", input=conversation)
print(f"Assistant: {response.output_text}") # Should remember "Alice"
5.2 Context Window Management
As conversations grow, the hard problem is no longer generation quality but context hygiene. You need rules for trimming, summarizing, or offloading older state so useful context survives without wasting tokens on stale turns.
from openai import OpenAI
client = OpenAI()
def count_tokens_approx(messages: list[dict]) -> int:
"""Rough token estimate: ~4 chars per token."""
total_chars = sum(len(m["content"]) for m in messages)
return total_chars // 4
def trim_conversation(messages: list[dict], max_tokens: int = 100000) -> list[dict]:
"""Keep system message + most recent messages within token budget."""
if not messages:
return messages
# Always keep system message (if present)
system_msgs = [m for m in messages if m["role"] == "system"]
other_msgs = [m for m in messages if m["role"] != "system"]
# Trim oldest messages until within budget
while count_tokens_approx(system_msgs + other_msgs) > max_tokens and len(other_msgs) > 2:
other_msgs.pop(0) # Remove oldest non-system message
return system_msgs + other_msgs
# Usage
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help you today?"},
# ... many more turns ...
]
trimmed = trim_conversation(conversation, max_tokens=50000)
response = client.responses.create(model="gpt-4.1-mini", input=trimmed)
print(response.output_text)
previous_response_id parameter to chain responses server-side, letting OpenAI manage conversation state. This avoids sending the full history each time and enables prompt caching.
Server-managed state is especially helpful when you want cleaner request payloads or better caching on repeated prefixes. The pattern below shows the minimum viable form: start one response, then continue the thread by referencing its response ID instead of resending every past message.
from openai import OpenAI
client = OpenAI()
# Chain responses using previous_response_id (server-managed conversation)
response1 = client.responses.create(
model="gpt-4.1-mini",
input="My name is Bob. I'm working on a RAG system.",
)
print(f"Turn 1: {response1.output_text}")
# Second turn references the first — no need to resend history
response2 = client.responses.create(
model="gpt-4.1-mini",
input="What embedding model would you recommend for my project?",
previous_response_id=response1.id, # Server remembers context
)
print(f"Turn 2: {response2.output_text}")
Next in the SDK Track
In OA Part 3: Structured Outputs & Code Generation, we’ll enforce JSON schemas, integrate with Pydantic for typed responses, and build deterministic code generation pipelines.