1. Fine-Tuning
Fine-tuning customizes a base model with your domain-specific data. It improves quality for specialized tasks, reduces prompt length (saving tokens), and creates consistent behavior that prompt engineering alone can’t achieve.
The workflow is easiest to reason about in three phases: create clean JSONL training data, launch a job against a base model that is already close to your use case, and then evaluate the new model before letting it anywhere near production traffic.
import json
# Step 1: Prepare training data in JSONL format
training_data = [
{"messages": [
{"role": "system", "content": "You are a customer support agent for TechCorp."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password: 1) Go to settings.techcorp.com 2) Click 'Forgot Password' 3) Enter your email 4) Check inbox for reset link. Need more help?"},
]},
{"messages": [
{"role": "system", "content": "You are a customer support agent for TechCorp."},
{"role": "user", "content": "My subscription didn't renew"},
{"role": "assistant", "content": "I'll help with your subscription. First, let me check: 1) Is your payment method current? 2) Did you receive any failed payment emails? Please share your account email and I'll investigate further."},
]},
]
# Write training file
with open("training_data.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")
print(f"Training file created with {len(training_data)} examples")
This second example turns the JSONL file into a real training job. Notice that the job itself is lightweight to create; most of the real quality work happened in dataset design, coverage, and consistency before the upload step ever started.
from openai import OpenAI
client = OpenAI()
# Step 2: Upload training file
with open("training_data.jsonl", "rb") as f:
training_file = client.files.create(file=f, purpose="fine-tune")
print(f"File uploaded: {training_file.id}")
# Step 3: Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4.1-mini",
hyperparameters={
"n_epochs": 3,
"learning_rate_multiplier": 1.8,
"batch_size": "auto",
},
suffix="techcorp-support", # Custom model name suffix
)
print(f"Fine-tuning job created: {job.id}")
print(f"Status: {job.status}")
Training is asynchronous, so monitoring matters. In practice you should store the job ID, poll it from background workers, and archive both the events and the resulting model name so future regressions can be traced back to the exact fine-tuning run that introduced them.
from openai import OpenAI
client = OpenAI()
# Step 4: Monitor training progress
job_id = "ftjob-abc123" # From previous step
# Check status
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status}")
print(f"Model: {job.fine_tuned_model}")
# List training events
events = client.fine_tuning.jobs.list_events(job_id, limit=10)
for event in events.data:
print(f" [{event.created_at}] {event.message}")
# Step 5: Use fine-tuned model
if job.fine_tuned_model:
response = client.responses.create(
model=job.fine_tuned_model,
input="How do I update my billing address?",
)
print(f"\nFine-tuned response: {response.output_text}")
2. Evaluation & Prompt Optimizer
Evaluation is how you keep model changes honest. The docs now emphasize eval-first development: define the task schema, score samples with rubric models or graders, compare candidates, and only then promote a prompt or fine-tuned model. This same discipline is what makes prompt optimizer output useful rather than just interesting.
from openai import OpenAI
client = OpenAI()
# Create an evaluation to measure model quality
eval_job = client.evals.create(
name="support-quality-v1",
data_source_config={
"type": "custom",
"item_schema": {
"type": "object",
"properties": {
"input": {"type": "string"},
"expected_output": {"type": "string"},
},
"required": ["input", "expected_output"],
},
"include_sample_schema": True,
},
testing_criteria=[
{
"type": "score_model",
"name": "helpfulness",
"model": "gpt-4.1",
"input": [
{"role": "system", "content": "Rate the helpfulness of the response on a scale of 1-5."},
{"role": "user", "content": "User: {{item.input}}\nExpected: {{item.expected_output}}\nActual: {{sample.output_text}}\n\nScore (1-5):"},
],
"pass_threshold": 4.0,
},
],
)
print(f"Eval created: {eval_job.id}")
Once the eval exists, feed it realistic test cases that reflect the edge cases your users actually trigger. That is also where graders become valuable: instead of asking only “did the output look okay?”, you can score groundedness, policy adherence, latency, or tool accuracy with repeatable criteria.
from openai import OpenAI
client = OpenAI()
# Run the evaluation with test data
eval_run = client.evals.runs.create(
eval_id="eval-abc123",
model="gpt-4.1-mini",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": [
{"item": {"input": "Reset my password", "expected_output": "Go to settings, click Forgot Password, enter email, check inbox."}},
{"item": {"input": "Cancel subscription", "expected_output": "I can help cancel. Please confirm your account email and reason for canceling."}},
],
},
},
)
print(f"Eval run started: {eval_run.id}")
print(f"Status: {eval_run.status}")
Domain-Specific Medical Coding
A healthcare company fine-tuned GPT-4-mini on 10,000 examples of medical procedure descriptions mapped to ICD-10 codes. The fine-tuned model achieved 97% accuracy vs 78% for the base model with prompt engineering alone. Cost: $200 for training, saving $50K/year in manual coding labor.
3. Batch API
Batch is not just a cost lever; it is also a workflow separation tool. Anything that does not need an immediate user response should be evaluated for batch execution first, especially enrichment jobs, backfills, offline grading, and large-scale prompt comparisons.
from openai import OpenAI
import json
client = OpenAI()
# Step 1: Prepare batch requests in JSONL format
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/responses",
"body": {"model": "gpt-4.1-mini", "input": f"Classify this review sentiment: '{review}'"}}
for i, review in enumerate([
"Great product, highly recommend!",
"Terrible experience, want a refund.",
"It's okay, nothing special.",
])
]
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Step 2: Upload and create batch
with open("batch_requests.jsonl", "rb") as f:
batch_file = client.files.create(file=f, purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/responses",
completion_window="24h",
)
print(f"Batch created: {batch.id} (status: {batch.status})")
After submission, treat the batch result file like any other pipeline artifact. Parse it, join it back to your source IDs, and persist failures separately so you can replay only the bad items instead of rerunning the entire workload.
from openai import OpenAI
import json
client = OpenAI()
# Step 3: Check batch status and retrieve results
batch_id = "batch_abc123"
batch = client.batches.retrieve(batch_id)
print(f"Status: {batch.status}")
print(f"Completed: {batch.request_counts.completed}/{batch.request_counts.total}")
# When complete, download results
if batch.status == "completed" and batch.output_file_id:
content = client.files.content(batch.output_file_id)
results = [json.loads(line) for line in content.text.strip().split("\n")]
for result in results:
print(f" {result['custom_id']}: {result['response']['body']['output'][0]['content'][0]['text'][:60]}...")
4. Prompt Caching
Prompt caching rewards stable prefixes. Long instructions, repeated policy blocks, and shared context turn from a cost burden into an optimization opportunity when you design your request structure deliberately and keep the reusable prefix identical between calls.
from openai import OpenAI
client = OpenAI()
# Prompt caching is automatic for responses with identical prefixes
# The system instruction and early messages get cached on first call
long_system_prompt = """You are an expert legal assistant specializing in contract review.
You follow these rules:
1. Always cite specific clauses
2. Flag potential risks in red
3. Suggest alternative language
4. Rate overall risk (low/medium/high)
...(imagine 2000+ tokens of detailed instructions)..."""
# First call: full price (cache miss)
response1 = client.responses.create(
model="gpt-4.1",
instructions=long_system_prompt,
input="Review clause 4.2 about liability limitations.",
)
print(f"Call 1 - Cached tokens: {response1.usage.input_tokens_details.cached_tokens}")
# Second call with same prefix: ~50% cheaper (cache hit)
response2 = client.responses.create(
model="gpt-4.1",
instructions=long_system_prompt, # Same prefix = cache hit
input="Now review clause 7.1 about termination.",
)
print(f"Call 2 - Cached tokens: {response2.usage.input_tokens_details.cached_tokens}")
5. Moderation API
Moderation belongs on both sides of generation. Input checks stop abusive requests before they hit expensive models or tools, and output checks catch unsafe or policy-violating generations before they reach end users. The pair is much stronger than either one alone.
from openai import OpenAI
client = OpenAI()
# Check content for policy violations
moderation = client.moderations.create(
model="omni-moderation-latest",
input="This is a normal message about programming best practices.",
)
result = moderation.results[0]
print(f"Flagged: {result.flagged}")
print(f"Categories:")
for category, flagged in result.categories.__dict__.items():
if flagged:
score = getattr(result.category_scores, category)
print(f" {category}: {score:.4f}")
from openai import OpenAI
client = OpenAI()
def safe_generate(user_input: str) -> str:
"""Generate response with input/output moderation."""
# Check input
input_mod = client.moderations.create(model="omni-moderation-latest", input=user_input)
if input_mod.results[0].flagged:
return "I can't help with that request."
# Generate response
response = client.responses.create(
model="gpt-4.1-mini",
input=user_input,
)
output = response.output_text
# Check output
output_mod = client.moderations.create(model="omni-moderation-latest", input=output)
if output_mod.results[0].flagged:
return "I generated a response but it was flagged by content moderation."
return output
result = safe_generate("Explain how encryption works")
print(result)
6. Enterprise Features
Enterprise controls are where successful prototypes usually break down. Once multiple teams, projects, budgets, and audit requirements are involved, you need admin APIs, RBAC, usage tracking, and predictable project boundaries so operational ownership is clear.
| Feature | Purpose | API Endpoint |
|---|---|---|
| Admin API | Manage users, projects, API keys programmatically | /v1/organization/* |
| Usage Tracking | Per-project/key token usage and costs | /v1/usage/* |
| Audit Logs | Track all API access and admin actions | /v1/organization/audit_logs |
| RBAC | Role-based access (owner, admin, member, reader) | Dashboard + API |
| Data Residency | Control where data is processed/stored | Project settings |
| SSO/SCIM | Enterprise identity integration | Dashboard config |
The final example shows how enterprise administration and usage reporting fit into day-two operations. This is the layer that connects technical quality to governance, finance, and incident response.
from openai import OpenAI
client = OpenAI()
# Admin API: List projects in organization
projects = client.projects.list()
for project in projects.data:
print(f" {project.name} (ID: {project.id}, status: {project.status})")
# Usage tracking: Get costs for current billing period
usage = client.usage.completions.list(
start_time=1716000000, # Unix timestamp
limit=10,
group_by=["project_id"],
)
for bucket in usage.data:
for result in bucket.results:
print(f" Project {result.project_id}: {result.input_tokens} in, {result.output_tokens} out")
Next in the SDK Track
In OA Part 11: Reasoning Systems, we’ll unlock GPT-5.5’s reasoning capabilities — reasoning effort parameter, reasoning items in multi-turn conversations, tool calling with reasoning models, and multi-step decomposition patterns.