OpenAI SDK Track Part 12: Context & State Management

            
            What You’ll Learn: Every API call carries a context window — the total token budget shared between your input and the model’s output. Managing this context efficiently determines your application’s cost, latency, and conversational quality. This article covers OpenAI’s built-in state management features (prompt caching, previous_response_id, and the Conversations API), plus client-side strategies for handling long sessions that push against context limits.
        

1. Context Window Fundamentals

Every model has a fixed context window measured in tokens. This window is shared between your input (system instructions + conversation history + tools) and the model’s output (reasoning tokens + visible response). Understanding this budget is critical for designing multi-turn applications that don’t silently degrade when conversations grow long.

Model	Context Window	Max Output Tokens	Input Cost (per 1M)	Output Cost (per 1M)
GPT-5.5	200K	32K	$2.00	$8.00
GPT-5.4	200K	16K	$1.00	$4.00
GPT-5.4-mini	128K	16K	$0.15	$0.60
GPT-4.1	1M	32K	$2.00	$8.00
GPT-4.1-mini	1M	32K	$0.40	$1.60
GPT-4.1-nano	1M	32K	$0.10	$0.40

The cost equation for multi-turn conversations grows quadratically if you resend all prior messages each turn. In a 20-turn conversation, turn 20 resends all 19 prior exchanges as input tokens — you pay for the same content repeatedly. This is exactly why OpenAI provides server-side state management features: to avoid resending history you’ve already paid for.

            
            Cost Trap: A naive chatbot resending full history at every turn spends O(n²) tokens over n turns. A 50-turn conversation with ~500 tokens per turn costs ~12.5M input tokens total — versus ~25K input tokens per turn with previous_response_id. That’s a 500x cost difference for long conversations.
        

Context & State Management Architecture

                flowchart TD
                    A[User Message] --> B{State Strategy?}
                    B -->|Stateless| C[Resend Full History]
                    B -->|previous_response_id| D[Server Recalls Prior State]
                    B -->|Conversations API| E[Persistent Conversation Object]
                    C --> F[Pay for All Input Tokens]
                    D --> G[Pay Only for New Input]
                    E --> G
                    F --> H[Model Generates Response]
                    G --> H
                    H --> I{Cache Hit?}
                    I -->|Yes| J[40-80% Discount on Cached Prefix]
                    I -->|No| K[Full Price Input Tokens]
                    J --> L[Response Returned]
                    K --> L
                    L --> M[Store Response with store: true]
                    M --> N[Available for Caching + Evals]

2. Prompt Caching

Prompt caching automatically reduces costs when multiple requests share the same prefix (system instructions, few-shot examples, or static context). When you send a request with store: true, OpenAI caches the input prefix. Subsequent requests matching that prefix receive a 40-80% discount on cached input tokens. Caching happens transparently — you don’t manage cache keys or eviction.

            
            How Caching Works: OpenAI looks for the longest matching prefix across your requests. The system instructions, tool definitions, and any static content at the beginning of your input are the most common cache targets. The more tokens that match, the greater your discount. Cached prefixes persist for 5-60 minutes depending on traffic patterns.
        

from openai import OpenAI

client = OpenAI()

# System instructions that remain constant across requests — ideal cache target
SYSTEM_INSTRUCTIONS = """You are a senior financial analyst specializing in
equity research. When analyzing companies:
1. Start with the business model and competitive moat
2. Analyze revenue growth, margins, and cash flow trends
3. Compare valuation multiples to sector peers
4. Identify key risks and catalysts
5. Provide a clear investment thesis with price target rationale

Always cite specific numbers and timeframes. Use conservative assumptions
for projections. Flag any data that seems anomalous or unreliable."""

# Request 1: First call establishes the cache
response1 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,  # Enable caching and storage
    instructions=SYSTEM_INSTRUCTIONS,
    input="Analyze Apple's Q1 2026 earnings report. Revenue was $124B, up 8% YoY.",
)

print(f"Request 1 - Input tokens: {response1.usage.input_tokens}")
print(f"  Cached tokens: {response1.usage.input_tokens_details.cached_tokens}")
# First request: cached_tokens = 0 (cache miss — prefix now stored)

# Request 2: Same system instructions prefix — cache hit!
response2 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    instructions=SYSTEM_INSTRUCTIONS,
    input="Analyze Microsoft's Q1 2026 earnings. Revenue was $72B, up 16% YoY.",
)

print(f"\nRequest 2 - Input tokens: {response2.usage.input_tokens}")
print(f"  Cached tokens: {response2.usage.input_tokens_details.cached_tokens}")
# Second request: cached_tokens > 0 (cache hit — 50% discount on cached portion)

The store: true parameter does two things: (1) enables prompt caching for cost reduction, and (2) stores the request/response pair for later use in evaluations, fine-tuning, and distillation. Both features are opt-in because stored data persists on OpenAI’s servers.

Cost Optimization

Prompt Caching ROI in Production

A customer support bot with a 2,000-token system prompt handling 10,000 requests/day saves significantly with caching. Without caching: 2,000 × 10,000 = 20M input tokens/day at full price. With caching (assuming 95% hit rate): 1M tokens at full price + 19M tokens at 50% discount = ~10.5M effective tokens. That’s a 47% cost reduction on input tokens with zero code changes beyond adding store: true.

Cost OptimizationPrompt CachingProduction

3. previous_response_id

The previous_response_id parameter enables server-side multi-turn conversations without resending the full history. After receiving a response, pass its id to the next request — the server automatically prepends the prior conversation context. This eliminates the quadratic cost growth of client-managed state and ensures the model sees the complete conversation including any reasoning items or tool calls.

            
            How It Works: When you pass previous_response_id, the server reconstructs the conversation from the stored response chain. You only send the new user message — all prior context (system instructions, messages, reasoning items, tool calls) is retrieved server-side. The input token count reflects the full conversation, but you get cache discounts on the repeated prefix automatically.
        

from openai import OpenAI

client = OpenAI()

# Turn 1: Start a conversation
response1 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    instructions="You are a helpful travel planning assistant. Remember all preferences the user mentions.",
    input="I'm planning a trip to Japan in October. I prefer quiet temples over busy tourist spots.",
)

print(f"Turn 1: {response1.output_text[:150]}...")
print(f"Response ID: {response1.id}")

# Turn 2: Chain using previous_response_id — no need to resend history
response2 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    previous_response_id=response1.id,  # Server recalls full conversation
    input="What about food? I'm vegetarian and love traditional cuisine.",
)

print(f"\nTurn 2: {response2.output_text[:150]}...")
print(f"  Input tokens: {response2.usage.input_tokens}")
print(f"  Cached tokens: {response2.usage.input_tokens_details.cached_tokens}")

# Turn 3: Continue the chain — server has full 3-turn context
response3 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    previous_response_id=response2.id,
    input="Can you suggest a 5-day itinerary based on everything I've told you?",
)

print(f"\nTurn 3: {response3.output_text[:200]}...")
print(f"  Input tokens: {response3.usage.input_tokens}")
print(f"  Cached tokens: {response3.usage.input_tokens_details.cached_tokens}")
# The model knows: Japan, October, quiet temples, vegetarian, traditional cuisine

The response chain is immutable — each response points to its predecessor, forming a linked list. You cannot modify earlier messages in the chain. If you need to branch the conversation (e.g., for A/B testing different responses), create a new chain from any point by using that response’s ID as the previous_response_id.

            
            Branching Conversations: You can fork a conversation at any point. If response3 has id = "resp_abc", two different follow-up requests can both use previous_response_id: "resp_abc" to create two diverging conversation branches from the same point. This is useful for exploring alternative paths or implementing retry logic.
        

4. Conversations API

The Conversations API provides a higher-level abstraction over previous_response_id for applications that need persistent, named conversation objects. Instead of tracking response IDs yourself, you create a conversation and append messages to it. The API manages the history, supports metadata, and provides list/delete operations for conversation lifecycle management.

from openai import OpenAI

client = OpenAI()

# Create a persistent conversation object
conversation = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    instructions="You are a Python tutor helping a beginner learn programming concepts.",
    input="What is a list comprehension?",
    metadata={"user_id": "student_42", "session": "python-basics"},
)

print(f"Turn 1: {conversation.output_text[:150]}...")
print(f"Conversation started with ID: {conversation.id}")

# Continue the conversation — server manages full state
turn2 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    previous_response_id=conversation.id,
    input="Can you show me a more complex example with filtering and transformation?",
)

print(f"\nTurn 2: {turn2.output_text[:150]}...")

# Continue further — the tutor remembers all prior explanations
turn3 = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    previous_response_id=turn2.id,
    input="How does this compare to using map() and filter() functions?",
)

print(f"\nTurn 3: {turn3.output_text[:200]}...")
print(f"  Total input tokens (full context): {turn3.usage.input_tokens}")
print(f"  Cached prefix tokens: {turn3.usage.input_tokens_details.cached_tokens}")

# The conversation persists — retrieve it later by ID for continuation
# Even across different sessions or server restarts

Architecture Pattern

Conversation Management in Production

A SaaS customer support platform uses the Conversations API to maintain per-ticket conversation threads. Each support ticket maps to a conversation ID stored in their database. When a customer returns to an existing ticket, the system passes the stored conversation ID as previous_response_id, and the AI agent instantly has full context of the prior interaction — no need to re-inject ticket history from a separate database. Conversations auto-expire after 30 days of inactivity, matching their ticket archival policy.

Conversations APISaaSCustomer Support

5. Stateful vs Stateless Strategies

Choosing between server-managed state (previous_response_id) and client-managed state (resending messages) involves tradeoffs around control, cost, privacy, and flexibility. Neither approach is universally superior — production systems often combine both depending on the conversation phase.

Factor	Server-Managed (previous_response_id)	Client-Managed (Resend Messages)
Cost	Lower — automatic cache hits on prior context	Higher — full history billed as input every turn
Latency	Lower — server skips re-processing cached prefix	Higher — model re-reads all prior messages
Control	Limited — cannot edit/remove prior messages	Full — can trim, summarize, or reorder history
Privacy	Data stored on OpenAI servers (store: true required)	Data stays client-side until sent per-request
Branching	Fork from any response ID	Arbitrary manipulation of history
Context Limit	Still bounded by model window — long chains eventually truncate	You control truncation strategy

from openai import OpenAI

client = OpenAI()

# Strategy 1: Server-managed state (recommended for most apps)
# Pros: Cheaper, simpler, automatic caching
# Cons: Cannot edit history, requires store: true

def server_managed_chat(messages: list[str]) -> list[str]:
    """Multi-turn chat using previous_response_id chain."""
    responses = []
    prev_id = None

    for msg in messages:
        kwargs = {
            "model": "gpt-4.1-mini",
            "store": True,
            "input": msg,
        }
        if prev_id:
            kwargs["previous_response_id"] = prev_id

        response = client.responses.create(**kwargs)
        responses.append(response.output_text)
        prev_id = response.id

    return responses


# Strategy 2: Client-managed state (for privacy-sensitive or edited conversations)
# Pros: Full control, no server storage, can modify history
# Cons: Higher cost, must implement truncation manually

def client_managed_chat(messages: list[str], max_history: int = 10) -> list[str]:
    """Multi-turn chat resending history with sliding window."""
    history = []
    responses = []

    for msg in messages:
        history.append({"role": "user", "content": msg})

        # Truncate to last N exchanges to stay within context limits
        truncated = history[-(max_history * 2):]

        response = client.responses.create(
            model="gpt-4.1-mini",
            input=truncated,
        )

        assistant_msg = {"role": "assistant", "content": response.output_text}
        history.append(assistant_msg)
        responses.append(response.output_text)

    return responses


# Example usage — both produce similar results, different cost profiles
questions = [
    "What is photosynthesis?",
    "How does it differ in C3 vs C4 plants?",
    "Which crops use C4 photosynthesis?",
]

print("=== Server-Managed ===")
results_server = server_managed_chat(questions)
for q, a in zip(questions, results_server):
    print(f"Q: {q}\nA: {a[:100]}...\n")

            
            Encrypted Reasoning & Privacy: When using reasoning models with store: false, reasoning summaries are encrypted and cannot be inspected. If you need stateless operation with reasoning models, the model returns encrypted reasoning items that you must pass back opaquely — you cannot read them, but the model can decrypt them on the next turn to maintain reasoning continuity.
        

6. Context Window Management

Even with server-managed state, conversations eventually hit context limits. A 200K-token window accommodates roughly 50-100 detailed back-and-forth exchanges before truncation is needed. For longer sessions, you need explicit strategies to manage what stays in context and what gets compressed or dropped.

Summarization Strategy

When a conversation approaches context limits, summarize older exchanges into a compact representation and use that summary as the new context anchor. This preserves the essential information while dramatically reducing token count.

from openai import OpenAI

client = OpenAI()

def summarize_conversation(messages: list[dict], max_summary_tokens: int = 500) -> str:
    """Compress a conversation history into a concise summary."""
    # Format messages for summarization
    formatted = "\n".join(
        f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}"
        for m in messages
    )

    response = client.responses.create(
        model="gpt-4.1-nano",  # Use cheapest model for summarization
        input=[
            {
                "role": "user",
                "content": f"""Summarize this conversation in under {max_summary_tokens} tokens.
Preserve: key facts, user preferences, decisions made, and open questions.
Discard: pleasantries, repetition, and exploratory tangents.

Conversation:
{formatted}""",
            }
        ],
    )
    return response.output_text


def managed_conversation(user_messages: list[str], context_budget: int = 50000):
    """Conversation with automatic summarization when context grows too large."""
    history = []
    summary = None
    responses = []

    for msg in user_messages:
        # Build input with optional summary prefix
        input_messages = []
        if summary:
            input_messages.append({
                "role": "user",
                "content": f"[Prior conversation summary: {summary}]",
            })
            input_messages.append({
                "role": "assistant",
                "content": "Understood. I have the context from our prior conversation.",
            })

        input_messages.extend(history[-20:])  # Last 10 exchanges
        input_messages.append({"role": "user", "content": msg})

        response = client.responses.create(
            model="gpt-4.1-mini",
            input=input_messages,
        )

        # Track history
        history.append({"role": "user", "content": msg})
        history.append({"role": "assistant", "content": response.output_text})
        responses.append(response.output_text)

        # Check if we need to summarize (approaching budget)
        total_tokens = response.usage.input_tokens + response.usage.output_tokens
        if total_tokens > context_budget * 0.7:
            # Summarize older messages, keep recent ones
            older_messages = history[:-10]
            summary = summarize_conversation(older_messages)
            history = history[-10:]  # Keep last 5 exchanges
            print(f"  [Summarized {len(older_messages)} messages into {len(summary)} chars]")

    return responses


# Long conversation that triggers automatic summarization
long_chat = [
    "I'm building a recipe recommendation app. It should learn user preferences.",
    "The main features are: dietary restrictions, cuisine preferences, and skill level.",
    "I want to use a vector database for recipe similarity search.",
    "Should I use Pinecone or Weaviate for this use case?",
    "Let's go with Weaviate. How do I structure the schema?",
    "Now I need to add user preference learning. What ML approach works best?",
    "Can you design the full system architecture including the recommendation pipeline?",
]

results = managed_conversation(long_chat)
for q, a in zip(long_chat, results):
    print(f"Q: {q}\nA: {a[:120]}...\n")

Sliding Window with Priority Truncation

A more sophisticated approach assigns priority levels to messages and truncates low-priority content first. System instructions and recent messages are always preserved; older user messages are compressed or dropped before older assistant responses.

from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class PrioritizedMessage:
    role: str
    content: str
    priority: int  # 1=highest (system), 2=recent, 3=medium, 4=droppable
    tokens: int = 0

    def to_dict(self) -> dict:
        return {"role": self.role, "content": self.content}


def estimate_tokens(text: str) -> int:
    """Rough token estimate: ~4 chars per token for English."""
    return len(text) // 4


def build_context_window(
    messages: list[PrioritizedMessage],
    max_tokens: int = 100000,
) -> list[dict]:
    """Build context window respecting token budget and priorities."""
    # Sort by priority (keep highest priority), then by recency for same priority
    # Priority 1 (system) always included, then 2 (recent), then 3, then 4
    budget_used = 0
    included = []

    # Pass 1: Include all priority-1 messages (system instructions)
    for msg in messages:
        msg.tokens = estimate_tokens(msg.content)
        if msg.priority == 1:
            budget_used += msg.tokens
            included.append(msg)

    # Pass 2: Include recent messages (priority 2) — last 6 exchanges
    recent = [m for m in messages if m.priority == 2]
    for msg in recent:
        if budget_used + msg.tokens <= max_tokens:
            budget_used += msg.tokens
            included.append(msg)

    # Pass 3: Fill remaining budget with priority 3 (medium importance)
    medium = [m for m in messages if m.priority == 3]
    for msg in medium:
        if budget_used + msg.tokens <= max_tokens * 0.8:  # Reserve 20% for output
            budget_used += msg.tokens
            included.append(msg)

    # Priority 4 messages are dropped entirely when space is tight
    remaining_budget = max_tokens - budget_used
    if remaining_budget > max_tokens * 0.3:
        low = [m for m in messages if m.priority == 4]
        for msg in low:
            if budget_used + msg.tokens <= max_tokens * 0.8:
                budget_used += msg.tokens
                included.append(msg)

    print(f"Context: {budget_used} tokens used, {len(included)}/{len(messages)} messages included")
    return [m.to_dict() for m in included]


# Example: Build prioritized conversation
messages = [
    PrioritizedMessage("user", "You are an expert Python developer...", priority=1),
    PrioritizedMessage("user", "Earlier we discussed the database schema...", priority=4),
    PrioritizedMessage("assistant", "Yes, we decided on PostgreSQL with...", priority=4),
    PrioritizedMessage("user", "Then we moved to the API layer...", priority=3),
    PrioritizedMessage("assistant", "The API uses FastAPI with...", priority=3),
    PrioritizedMessage("user", "Now let's implement the auth middleware.", priority=2),
    PrioritizedMessage("assistant", "Here's the JWT middleware...", priority=2),
    PrioritizedMessage("user", "Add rate limiting to this middleware.", priority=2),
]

context = build_context_window(messages, max_tokens=50000)
response = client.responses.create(model="gpt-4.1-mini", input=context)
print(f"\nResponse: {response.output_text[:150]}...")

7. Multi-Turn Best Practices

Building reliable multi-turn applications requires attention to several patterns that prevent common failures: context pollution, instruction drift, and state confusion. These patterns apply regardless of whether you use server-managed or client-managed state.

System Instructions vs Per-Turn Instructions

System instructions (instructions parameter) are evaluated once and apply to the entire conversation. Per-turn instructions are injected as user messages and can be overridden or forgotten. Use system instructions for persistent behavior and per-turn instructions for task-specific guidance.

            
            Instruction Hierarchy: System instructions are the model’s “constitution” — always honored. Per-turn user messages are requests that can conflict with prior turns. If you notice the model “forgetting” instructions after many turns, it’s likely because older instructions have scrolled beyond the effective attention window. Move critical rules into instructions rather than initial user messages.
        

Context Pollution Avoidance

Context pollution occurs when irrelevant or contradictory information accumulates in the conversation history, degrading response quality. Common sources include: verbose tool outputs that aren’t summarized, user tangents that confuse the model about the task, and stale information that was corrected in later turns but remains visible in history.

from openai import OpenAI

client = OpenAI()

# Anti-pattern: Polluted context with verbose tool outputs
# DON'T do this — raw API responses bloat context unnecessarily
polluted_history = [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    {"role": "assistant", "content": '{"api_response": {"coord": {"lon": 139.69, "lat": 35.69}, "weather": [{"id": 800, "main": "Clear", "description": "clear sky", "icon": "01d"}], "base": "stations", "main": {"temp": 22.5, "feels_like": 21.8, "temp_min": 20.1, "temp_max": 24.3, "pressure": 1013, "humidity": 45}, "visibility": 10000, "wind": {"speed": 3.6, "deg": 350}, "clouds": {"all": 0}, "dt": 1716652800, "sys": {"type": 2, "id": 2038398, "country": "JP", "sunrise": 1716578940, "sunset": 1716631020}, "timezone": 32400, "id": 1850144, "name": "Tokyo", "cod": 200}}'},
]

# Better pattern: Summarize tool outputs before adding to context
clean_history = [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    {"role": "assistant", "content": "Tokyo is currently 22.5°C with clear skies, humidity 45%, light wind at 3.6 m/s."},
]

# Best pattern: Use previous_response_id and let the model handle tool output naturally
response = client.responses.create(
    model="gpt-4.1-mini",
    store=True,
    instructions="""You are a travel assistant. When reporting data from tools:
- Summarize key facts in natural language
- Never include raw JSON in your responses
- Mention only information relevant to the user's question""",
    input="What's the weather like in Tokyo today? Should I bring an umbrella?",
)

print(f"Clean response: {response.output_text}")
# Now chain with previous_response_id — no context pollution

Production Pattern

Hybrid State Architecture

A production coding assistant uses a hybrid approach: previous_response_id for the active session (turns 1-20), then when context grows large, it performs a background summarization call, starts a new chain with the summary injected as context, and continues with previous_response_id from the fresh start. Users experience seamless multi-hour sessions while costs remain bounded. The system monitors usage.input_tokens and triggers summarization at 60% of the model’s context window.

Hybrid StateSummarizationProduction

            
            Try It Yourself: Build a conversation system that: (1) Uses previous_response_id for efficient multi-turn, (2) Monitors token usage per turn via response.usage, (3) Triggers automatic summarization when input tokens exceed 60% of the model’s context window, (4) Starts a fresh chain with the summary injected as the first message, (5) Logs cache hit rates to measure the cost savings from prompt caching over time.
        

Next in the Series

In Part 13: Prompt Engineering, we’ll cover systematic prompt design patterns, instruction hierarchy, few-shot strategies, and evaluation-driven prompt iteration for the Responses API.