1. OpenAI Platform Overview
The OpenAI platform is organized into a hierarchy: Organizations contain Projects, which contain API keys with scoped permissions. Understanding this structure is critical for managing costs, access control, and rate limits across teams.
1.1 Organizations & Projects
flowchart TD
A["Organization"] --> B["Project: Production"]
A --> C["Project: Development"]
A --> D["Project: Research"]
B --> E["API Key: prod-key-001"]
B --> F["API Key: prod-key-002"]
C --> G["API Key: dev-key-001"]
D --> H["API Key: research-key-001"]
E --> I["Rate Limits & Billing"]
F --> I
| Concept | Purpose | Key Actions |
|---|---|---|
| Organization | Top-level account (company/team) | Manage members, billing, usage limits |
| Project | Isolated environment within org | Separate keys, limits, and usage tracking |
| API Key | Authentication credential | Scoped to project, rotatable, revocable |
| Service Account | Machine-to-machine auth | For CI/CD, production deployments |
1.2 Billing & Rate Limits
| Tier | RPM (Requests) | TPM (Tokens) | Access |
|---|---|---|---|
| Free | 3 | 40,000 | Limited models |
| Tier 1 | 500 | 200,000 | Most models |
| Tier 2 | 5,000 | 2,000,000 | All models |
| Tier 3 | 5,000 | 10,000,000 | All models + higher context |
| Tier 4+ | 10,000+ | 50,000,000+ | Custom limits, dedicated capacity |
1.3 Model Tiers & Access
| Category | Models | Best For |
|---|---|---|
| Frontier Reasoning | gpt-5.5, gpt-5.5-pro | Complex analysis, math, code, agentic workflows |
| Reasoning (Cost-Effective) | gpt-5.4, gpt-5.4-mini | Balanced reasoning, supports tool_search |
| Previous Reasoning | gpt-5 | Strong reasoning, coding, broad tool support |
| Non-Reasoning (Fast) | gpt-4.1, gpt-4.1-mini, gpt-4.1-nano | High throughput, 1M context, low latency |
| Multimodal | gpt-4.1 (vision), gpt-image-1 | Image understanding + generation |
| Audio | whisper-1, tts-1, tts-1-hd | Speech-to-text, text-to-speech |
| Embeddings | text-embedding-3-small/large | Semantic search, RAG |
| Realtime | gpt-realtime-2, gpt-realtime-translate, gpt-realtime-whisper | Voice agents, live translation, live transcription |
reasoning.effort parameter on reasoning models lets you trade speed for quality ("low" to "xhigh").
2. SDK Installation
2.1 Python SDK
Start with the official SDK so your code stays aligned with the latest API surface, retries, streaming primitives, and typed response objects. The shell snippet below installs the package, while the Python snippet immediately verifies that your credentials and network path are correct.
# Install the OpenAI Python SDK
pip install openai
# Or with optional dependencies
pip install "openai[datalib]" # Includes pandas for batch operations
# Verify installation
python -c "import openai; print(openai.__version__)"
Once the package is installed, create a client as early as possible in your startup path. A fast model-list check is a practical smoke test because it validates authentication before you wire the SDK into larger workflows.
from openai import OpenAI
# Basic client initialization
client = OpenAI(api_key="sk-...")
# Or use environment variable (recommended)
# export OPENAI_API_KEY="sk-..."
client = OpenAI() # Reads from OPENAI_API_KEY env var
# Test the connection
response = client.models.list()
print(f"Available models: {len(response.data)}")
for model in response.data[:5]:
print(f" - {model.id}")
2.2 TypeScript SDK
The TypeScript SDK follows the same mental model as Python: create one client, reuse it, and keep credentials on the server. This makes it easy to share architectural patterns across backend services even when your team uses multiple languages.
# Install the OpenAI TypeScript SDK
npm install openai
# Or with pnpm/yarn
pnpm add openai
This example shows the minimal Node.js startup path. In a real service, you would usually create the client once during application boot and inject it into route handlers or service classes.
import OpenAI from 'openai';
// Initialize client
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// Test the connection
async function listModels() {
const models = await client.models.list();
console.log(`Available models: ${models.data.length}`);
for (const model of models.data.slice(0, 5)) {
console.log(` - ${model.id}`);
}
}
listModels();
2.3 API Key Management
Authentication is straightforward, but the operational detail matters: API keys are secrets, project scoping affects usage attribution, and organization and project headers let you route requests explicitly when you work across multiple environments.
import os
from openai import OpenAI
# NEVER hardcode API keys — use environment variables or secrets managers
# Option 1: Environment variable
client = OpenAI() # Reads OPENAI_API_KEY
# Option 2: Explicit (for multi-project setups)
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
organization=os.environ.get("OPENAI_ORG_ID"), # Optional org override
project=os.environ.get("OPENAI_PROJECT_ID"), # Optional project scope
)
# Option 3: Azure OpenAI (different endpoint)
from openai import AzureOpenAI
azure_client = AzureOpenAI(
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2024-10-21",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)
.env files (with .gitignore), environment variables, or a secrets manager (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault). Rotate keys regularly and use project-scoped keys with minimum required permissions.
From Prototype to 10K RPM
A startup scaled from 10 requests/minute during prototyping to 10,000 RPM in production. Key lessons: project-scoped API keys for billing isolation, tier progression through usage milestones, and a client singleton pattern that prevented connection pool exhaustion.
3. Client Architecture
3.1 Sync vs Async Clients
Choose the sync client for scripts, CLIs, and one-off background jobs. Choose the async client for web servers, fan-out workloads, and any system that must keep many requests in flight without blocking the event loop.
from openai import OpenAI, AsyncOpenAI
# Synchronous client — blocks until response is ready
sync_client = OpenAI()
response = sync_client.responses.create(
model="gpt-4.1-mini",
input="Hello! What can you help me with?",
)
print(response.output_text)
The async variant uses the same API shape, which keeps the learning curve low. That symmetry is useful when a prototype starts as a script and later becomes a FastAPI or asyncio-based service.
import asyncio
from openai import AsyncOpenAI
# Async client — non-blocking, ideal for web servers and concurrent workloads
async_client = AsyncOpenAI()
async def main():
response = await async_client.responses.create(
model="gpt-4.1-mini",
input="Hello! What can you help me with?",
)
print(response.output_text)
asyncio.run(main())
The real advantage of async appears when you parallelize independent prompts. That pattern matters for grading, evaluation, enrichment, and other batch workloads where latency is dominated by waiting on many remote calls.
import asyncio
from openai import AsyncOpenAI
# Concurrent requests with async — 5x faster than sequential
async_client = AsyncOpenAI()
async def generate_batch(prompts: list[str]) -> list[str]:
"""Send multiple prompts concurrently."""
tasks = [
async_client.responses.create(
model="gpt-4.1-mini",
input=prompt,
)
for prompt in prompts
]
responses = await asyncio.gather(*tasks)
return [r.output_text for r in responses]
async def main():
prompts = [
"Summarize quantum computing in 2 sentences.",
"Explain REST APIs in 2 sentences.",
"Define machine learning in 2 sentences.",
]
results = await generate_batch(prompts)
for prompt, result in zip(prompts, results):
print(f"Q: {prompt}\nA: {result}\n")
asyncio.run(main())
3.2 Request Lifecycle
sequenceDiagram
participant App as Your App
participant SDK as OpenAI SDK
participant API as OpenAI API
participant Model as Model
App->>SDK: client.responses.create(...)
SDK->>SDK: Validate params, build request
SDK->>API: POST /v1/responses
API->>API: Auth, rate limit check
API->>Model: Inference
Model-->>API: Generated tokens
API-->>SDK: Response object (JSON)
SDK-->>App: Response object
Keep this lifecycle in mind when debugging production issues: some failures happen before inference even starts, others happen while streaming tokens back, and the right retry strategy depends on which phase failed.
3.3 Client Configuration
Production configuration is where a simple demo becomes a durable service. Timeouts, retries, and default headers should be explicit so your deployment behaves predictably under network noise, user spikes, and support investigations.
import httpx
from openai import OpenAI
# Full client configuration for production
client = OpenAI(
api_key="sk-...",
organization="org-...",
project="proj-...",
timeout=httpx.Timeout(60.0, connect=5.0), # 60s total, 5s connect
max_retries=3, # Auto-retry on transient errors
default_headers={
"X-Request-Source": "my-app-v2", # Custom tracking header
},
)
# Per-request overrides
response = client.responses.create(
model="gpt-4.1-mini",
input="Hello",
timeout=30.0, # Override timeout for this request
)
print(response.output_text)
4. Error Handling & Retries
4.1 Error Types
Before you add generic retry middleware, understand which failures are safe to retry and which indicate a coding or configuration problem. Authentication and malformed-request errors need a fix, while rate limits and transient server failures usually need controlled backoff.
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
from openai import AuthenticationError, BadRequestError, NotFoundError
client = OpenAI()
try:
response = client.responses.create(
model="gpt-4.1-mini",
input="Hello",
)
print(response.output_text)
except AuthenticationError as e:
# 401: Invalid API key or permissions
print(f"Auth failed: {e.message}")
except RateLimitError as e:
# 429: Too many requests — back off and retry
print(f"Rate limited: {e.message}")
# SDK auto-retries with exponential backoff (up to max_retries)
except BadRequestError as e:
# 400: Malformed request (bad params, too many tokens, etc.)
print(f"Bad request: {e.message}")
except NotFoundError as e:
# 404: Model not found or deprecated
print(f"Not found: {e.message}")
except APIConnectionError as e:
# Network error — DNS failure, timeout, connection refused
print(f"Connection error: {e.message}")
except APIError as e:
# 500+: Server error — transient, auto-retried
print(f"API error ({e.status_code}): {e.message}")
4.2 Retry Strategies
The official SDK already retries a useful set of transient failures. Your main job is to choose sane limits, match timeout budgets to user expectations, and disable retries for flows where duplicate work would be harmful or confusing.
import httpx
from openai import OpenAI
# The SDK handles retries automatically for:
# - 429 (Rate Limit) — exponential backoff
# - 500, 502, 503, 504 (Server errors) — exponential backoff
# - Connection errors — immediate retry
# Configure retry behavior
client = OpenAI(
max_retries=5, # Default is 2
timeout=httpx.Timeout(120.0, connect=10.0), # Generous timeout for retries
)
# Disable retries for a specific request
response = client.with_options(max_retries=0).responses.create(
model="gpt-4.1-mini",
input="Time-sensitive request",
)
print(response.output_text)
4.3 Rate Limit Handling
Rate limiting becomes a design concern once traffic grows. The pattern below adds explicit exponential backoff on top of the SDK so you can centralize policy, emit logs, and tune user-facing behavior for busy periods.
import asyncio
import time
from openai import AsyncOpenAI, RateLimitError
client = AsyncOpenAI()
async def call_with_backoff(input_text: str, max_retries: int = 5) -> str:
"""Custom retry logic with exponential backoff for rate limits."""
for attempt in range(max_retries):
try:
response = await client.responses.create(
model="gpt-4.1-mini",
input=input_text,
)
return response.output_text
except RateLimitError as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait_time}s (attempt {attempt + 1})")
await asyncio.sleep(wait_time)
return ""
async def main():
result = await call_with_backoff(
"Explain rate limiting in one paragraph."
)
print(result)
asyncio.run(main())
4.4 Debugging Requests
OpenAI’s reference docs recommend logging request identifiers in production. The server returns an x-request-id, and you can also supply your own ASCII-only X-Client-Request-Id header so support and internal observability systems can correlate a failing request even when the response never reaches your app.
import uuid
from openai import OpenAI
request_id = str(uuid.uuid4())
client = OpenAI(
default_headers={
"X-Client-Request-Id": request_id,
},
)
response = client.responses.create(
model="gpt-4.1-mini",
input="Summarize the importance of request tracing in one paragraph.",
)
# The Python SDK exposes the server request ID on the top-level response object.
print(f"Client request ID: {request_id}")
print(f"Server request ID: {response._request_id}")
print(response.output_text)
If you capture request IDs, organization, project, and rate-limit headers in your logs, production debugging becomes much faster. It lets you distinguish authentication errors, exhausted quotas, and latency regressions without guessing.
5. SDK Design Patterns
5.1 Singleton Client
Because the SDK manages an underlying HTTP connection pool, recreating a client on every request wastes sockets and increases latency. A singleton or application-scoped client is usually the cleanest default for server-side code.
from openai import OpenAI
# Module-level singleton — reuse across your application
# The client maintains an HTTP connection pool internally
_client: OpenAI | None = None
def get_openai_client() -> OpenAI:
"""Get or create the singleton OpenAI client."""
global _client
if _client is None:
_client = OpenAI(max_retries=3)
return _client
# Usage across your app
def summarize(text: str) -> str:
client = get_openai_client()
response = client.responses.create(
model="gpt-4.1-mini",
instructions="Summarize the following text concisely.",
input=text,
)
return response.output_text
result = summarize("The OpenAI SDK provides both sync and async clients...")
print(result)
5.2 Service Layer Abstraction
A thin service layer keeps OpenAI-specific details out of controllers, route handlers, and business logic. That separation makes testing easier, centralizes model selection, and gives you one place to evolve prompts or swap models later.
from dataclasses import dataclass
from openai import OpenAI
@dataclass
class LLMConfig:
model: str = "gpt-4.1-mini"
temperature: float = 0.7
max_tokens: int = 1024
class AIService:
"""Service layer wrapping OpenAI SDK for your application."""
def __init__(self, config: LLMConfig | None = None):
self.client = OpenAI()
self.config = config or LLMConfig()
def generate(self, system: str, user: str) -> str:
response = self.client.responses.create(
model=self.config.model,
temperature=self.config.temperature,
max_output_tokens=self.config.max_tokens,
instructions=system,
input=user,
)
return response.output_text
def classify(self, text: str, categories: list[str]) -> str:
response = self.client.responses.create(
model=self.config.model,
temperature=0,
instructions=f"Classify into one of: {categories}. Reply with just the category.",
input=text,
)
return response.output_text
# Usage
ai = AIService(LLMConfig(model="gpt-4.1-mini", temperature=0))
category = ai.classify("My order hasn't arrived", ["shipping", "billing", "technical"])
print(f"Category: {category}")
5.3 Testing Patterns
Mocking SDK calls is the fastest way to test your orchestration logic without paying for API calls or introducing nondeterminism. Save live model checks for a smaller integration suite, and keep unit tests focused on your code paths and failure handling.
from unittest.mock import patch, MagicMock
from openai import OpenAI
def create_mock_response(text: str):
"""Create a mock Response object for testing."""
mock = MagicMock()
mock.id = "resp_test123"
mock.output_text = text
mock.status = "completed"
mock.usage.input_tokens = 10
mock.usage.output_tokens = 20
mock.usage.total_tokens = 30
return mock
# Test with mocked responses
@patch("openai.resources.responses.Responses.create")
def test_summarize(mock_create):
mock_create.return_value = create_mock_response("This is a summary.")
client = OpenAI(api_key="test-key")
response = client.responses.create(
model="gpt-4.1-mini",
input="Summarize this.",
)
assert response.output_text == "This is a summary."
mock_create.assert_called_once()
test_summarize()
print("Test passed!")
5.4 Compatibility Strategy
The API reference also stresses backwards compatibility discipline: new fields and event types can appear over time, and model behavior can shift between snapshots. Pin explicit model versions when consistency matters, write parsers that ignore unknown response fields, and back every critical prompt with evals before you upgrade.
6. Production Checklist
| Category | Action | Why |
|---|---|---|
| Auth | Use project-scoped keys with minimum permissions | Blast radius reduction |
| Auth | Rotate keys quarterly; use secrets manager | Limit exposure window |
| Resilience | Set max_retries=3 with appropriate timeouts | Handle transient failures |
| Resilience | Implement circuit breaker for prolonged outages | Fail gracefully |
| Cost | Set monthly budget alerts in platform dashboard | Prevent bill shock |
| Cost | Use max_tokens to cap response length | Cost predictability |
| Observability | Log request IDs (response.id) for debugging | Trace issues with support |
| Observability | Track token usage (response.usage) | Monitor costs and efficiency |
| Performance | Use async client for web servers | Don’t block event loop |
| Performance | Reuse client instances (singleton) | Connection pool efficiency |
Next in the SDK Track
In OA Part 2: Responses API & Text Generation, we’ll dive into the Responses API — message structures, generation controls, streaming patterns, reasoning models (o-series), and building multi-turn conversations.