OpenAI SDK Track Part 10: Fine-Tuning, Eval & Production

            
            What You’ll Learn: Fine-tuning creates a custom model that’s specialized for your specific task — it learns your terminology, format preferences, and domain patterns from example data. This article covers the complete lifecycle: when to fine-tune vs prompt-engineer, preparing training data, running training jobs, evaluating results, and deploying fine-tuned models in production. Think of it like the difference between hiring a generalist and training a specialist.
        

1. Fine-Tuning

Fine-tuning customizes a base model with your domain-specific data. It improves quality for specialized tasks, reduces prompt length (saving tokens), and creates consistent behavior that prompt engineering alone can’t achieve.

The workflow is easiest to reason about in three phases: create clean JSONL training data, launch a job against a base model that is already close to your use case, and then evaluate the new model before letting it anywhere near production traffic.

import json

# Step 1: Prepare training data in JSONL format
training_data = [
    {"messages": [
        {"role": "system", "content": "You are a customer support agent for TechCorp."},
        {"role": "user", "content": "How do I reset my password?"},
        {"role": "assistant", "content": "To reset your password: 1) Go to settings.techcorp.com 2) Click 'Forgot Password' 3) Enter your email 4) Check inbox for reset link. Need more help?"},
    ]},
    {"messages": [
        {"role": "system", "content": "You are a customer support agent for TechCorp."},
        {"role": "user", "content": "My subscription didn't renew"},
        {"role": "assistant", "content": "I'll help with your subscription. First, let me check: 1) Is your payment method current? 2) Did you receive any failed payment emails? Please share your account email and I'll investigate further."},
    ]},
]

# Write training file
with open("training_data.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

print(f"Training file created with {len(training_data)} examples")

This second example turns the JSONL file into a real training job. Notice that the job itself is lightweight to create; most of the real quality work happened in dataset design, coverage, and consistency before the upload step ever started.

from openai import OpenAI

client = OpenAI()

# Step 2: Upload training file
with open("training_data.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")
print(f"File uploaded: {training_file.id}")

# Step 3: Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4.1-mini",
    hyperparameters={
        "n_epochs": 3,
        "learning_rate_multiplier": 1.8,
        "batch_size": "auto",
    },
    suffix="techcorp-support",  # Custom model name suffix
)
print(f"Fine-tuning job created: {job.id}")
print(f"Status: {job.status}")

Training is asynchronous, so monitoring matters. In practice you should store the job ID, poll it from background workers, and archive both the events and the resulting model name so future regressions can be traced back to the exact fine-tuning run that introduced them.

from openai import OpenAI

client = OpenAI()

# Step 4: Monitor training progress
job_id = "ftjob-abc123"  # From previous step

# Check status
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status}")
print(f"Model: {job.fine_tuned_model}")

# List training events
events = client.fine_tuning.jobs.list_events(job_id, limit=10)
for event in events.data:
    print(f"  [{event.created_at}] {event.message}")

# Step 5: Use fine-tuned model
if job.fine_tuned_model:
    response = client.responses.create(
        model=job.fine_tuned_model,
        input="How do I update my billing address?",
    )
    print(f"\nFine-tuned response: {response.output_text}")

2. Evaluation & Prompt Optimizer

Evaluation is how you keep model changes honest. The docs now emphasize eval-first development: define the task schema, score samples with rubric models or graders, compare candidates, and only then promote a prompt or fine-tuned model. This same discipline is what makes prompt optimizer output useful rather than just interesting.

from openai import OpenAI

client = OpenAI()

# Create an evaluation to measure model quality
eval_job = client.evals.create(
    name="support-quality-v1",
    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {
                "input": {"type": "string"},
                "expected_output": {"type": "string"},
            },
            "required": ["input", "expected_output"],
        },
        "include_sample_schema": True,
    },
    testing_criteria=[
        {
            "type": "score_model",
            "name": "helpfulness",
            "model": "gpt-4.1",
            "input": [
                {"role": "system", "content": "Rate the helpfulness of the response on a scale of 1-5."},
                {"role": "user", "content": "User: {{item.input}}\nExpected: {{item.expected_output}}\nActual: {{sample.output_text}}\n\nScore (1-5):"},
            ],
            "pass_threshold": 4.0,
        },
    ],
)
print(f"Eval created: {eval_job.id}")

Once the eval exists, feed it realistic test cases that reflect the edge cases your users actually trigger. That is also where graders become valuable: instead of asking only “did the output look okay?”, you can score groundedness, policy adherence, latency, or tool accuracy with repeatable criteria.

from openai import OpenAI

client = OpenAI()

# Run the evaluation with test data
eval_run = client.evals.runs.create(
    eval_id="eval-abc123",
    model="gpt-4.1-mini",
    data_source={
        "type": "jsonl",
        "source": {
            "type": "file_content",
            "content": [
                {"item": {"input": "Reset my password", "expected_output": "Go to settings, click Forgot Password, enter email, check inbox."}},
                {"item": {"input": "Cancel subscription", "expected_output": "I can help cancel. Please confirm your account email and reason for canceling."}},
            ],
        },
    },
)
print(f"Eval run started: {eval_run.id}")
print(f"Status: {eval_run.status}")

Real-World Application

Domain-Specific Medical Coding

A healthcare company fine-tuned GPT-4-mini on 10,000 examples of medical procedure descriptions mapped to ICD-10 codes. The fine-tuned model achieved 97% accuracy vs 78% for the base model with prompt engineering alone. Cost: $200 for training, saving $50K/year in manual coding labor.

HealthcareFine-Tuning ROI

3. Batch API

            
            50% Cost Savings: The Batch API processes requests asynchronously within a 24-hour window at half the cost of synchronous calls. Ideal for data processing, evaluations, classification jobs, and any non-time-sensitive workload.
        

Batch is not just a cost lever; it is also a workflow separation tool. Anything that does not need an immediate user response should be evaluated for batch execution first, especially enrichment jobs, backfills, offline grading, and large-scale prompt comparisons.

from openai import OpenAI
import json

client = OpenAI()

# Step 1: Prepare batch requests in JSONL format
requests = [
    {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/responses",
     "body": {"model": "gpt-4.1-mini", "input": f"Classify this review sentiment: '{review}'"}}
    for i, review in enumerate([
        "Great product, highly recommend!",
        "Terrible experience, want a refund.",
        "It's okay, nothing special.",
    ])
]

with open("batch_requests.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Step 2: Upload and create batch
with open("batch_requests.jsonl", "rb") as f:
    batch_file = client.files.create(file=f, purpose="batch")

batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/responses",
    completion_window="24h",
)
print(f"Batch created: {batch.id} (status: {batch.status})")

After submission, treat the batch result file like any other pipeline artifact. Parse it, join it back to your source IDs, and persist failures separately so you can replay only the bad items instead of rerunning the entire workload.

from openai import OpenAI
import json

client = OpenAI()

# Step 3: Check batch status and retrieve results
batch_id = "batch_abc123"
batch = client.batches.retrieve(batch_id)
print(f"Status: {batch.status}")
print(f"Completed: {batch.request_counts.completed}/{batch.request_counts.total}")

# When complete, download results
if batch.status == "completed" and batch.output_file_id:
    content = client.files.content(batch.output_file_id)
    results = [json.loads(line) for line in content.text.strip().split("\n")]

    for result in results:
        print(f"  {result['custom_id']}: {result['response']['body']['output'][0]['content'][0]['text'][:60]}...")

4. Prompt Caching

Prompt caching rewards stable prefixes. Long instructions, repeated policy blocks, and shared context turn from a cost burden into an optimization opportunity when you design your request structure deliberately and keep the reusable prefix identical between calls.

from openai import OpenAI

client = OpenAI()

# Prompt caching is automatic for responses with identical prefixes
# The system instruction and early messages get cached on first call

long_system_prompt = """You are an expert legal assistant specializing in contract review.
You follow these rules:
1. Always cite specific clauses
2. Flag potential risks in red
3. Suggest alternative language
4. Rate overall risk (low/medium/high)
...(imagine 2000+ tokens of detailed instructions)..."""

# First call: full price (cache miss)
response1 = client.responses.create(
    model="gpt-4.1",
    instructions=long_system_prompt,
    input="Review clause 4.2 about liability limitations.",
)
print(f"Call 1 - Cached tokens: {response1.usage.input_tokens_details.cached_tokens}")

# Second call with same prefix: ~50% cheaper (cache hit)
response2 = client.responses.create(
    model="gpt-4.1",
    instructions=long_system_prompt,  # Same prefix = cache hit
    input="Now review clause 7.1 about termination.",
)
print(f"Call 2 - Cached tokens: {response2.usage.input_tokens_details.cached_tokens}")

5. Moderation API

Moderation belongs on both sides of generation. Input checks stop abusive requests before they hit expensive models or tools, and output checks catch unsafe or policy-violating generations before they reach end users. The pair is much stronger than either one alone.

from openai import OpenAI

client = OpenAI()

# Check content for policy violations
moderation = client.moderations.create(
    model="omni-moderation-latest",
    input="This is a normal message about programming best practices.",
)

result = moderation.results[0]
print(f"Flagged: {result.flagged}")
print(f"Categories:")
for category, flagged in result.categories.__dict__.items():
    if flagged:
        score = getattr(result.category_scores, category)
        print(f"  {category}: {score:.4f}")

from openai import OpenAI

client = OpenAI()

def safe_generate(user_input: str) -> str:
    """Generate response with input/output moderation."""
    # Check input
    input_mod = client.moderations.create(model="omni-moderation-latest", input=user_input)
    if input_mod.results[0].flagged:
        return "I can't help with that request."

    # Generate response
    response = client.responses.create(
        model="gpt-4.1-mini",
        input=user_input,
    )
    output = response.output_text

    # Check output
    output_mod = client.moderations.create(model="omni-moderation-latest", input=output)
    if output_mod.results[0].flagged:
        return "I generated a response but it was flagged by content moderation."

    return output

result = safe_generate("Explain how encryption works")
print(result)

6. Enterprise Features

Enterprise controls are where successful prototypes usually break down. Once multiple teams, projects, budgets, and audit requirements are involved, you need admin APIs, RBAC, usage tracking, and predictable project boundaries so operational ownership is clear.

Feature	Purpose	API Endpoint
Admin API	Manage users, projects, API keys programmatically	`/v1/organization/*`
Usage Tracking	Per-project/key token usage and costs	`/v1/usage/*`
Audit Logs	Track all API access and admin actions	`/v1/organization/audit_logs`
RBAC	Role-based access (owner, admin, member, reader)	Dashboard + API
Data Residency	Control where data is processed/stored	Project settings
SSO/SCIM	Enterprise identity integration	Dashboard config

            
            Deployment checklist: pin your model or snapshot where consistency matters, run evals before every upgrade, log request IDs and project context, define retention and data-handling rules, review RBAC quarterly, and document rollback paths for both prompts and fine-tuned models.
        

The final example shows how enterprise administration and usage reporting fit into day-two operations. This is the layer that connects technical quality to governance, finance, and incident response.

from openai import OpenAI

client = OpenAI()

# Admin API: List projects in organization
projects = client.projects.list()
for project in projects.data:
    print(f"  {project.name} (ID: {project.id}, status: {project.status})")

# Usage tracking: Get costs for current billing period
usage = client.usage.completions.list(
    start_time=1716000000,  # Unix timestamp
    limit=10,
    group_by=["project_id"],
)
for bucket in usage.data:
    for result in bucket.results:
        print(f"  Project {result.project_id}: {result.input_tokens} in, {result.output_tokens} out")

            
            Track Complete! You’ve completed the 10-part OpenAI SDK Track. You now have production-ready knowledge of the entire OpenAI platform — from basic API calls through agents, realtime voice, fine-tuning, and enterprise operations. Return to the Foundation Track to deepen your understanding of vendor-agnostic patterns.
        

            
            Try It Yourself: Fine-tune a model for a specific task: (1) Create 50 training examples for ‘customer email classification’ (categories: billing, technical, sales, general), (2) format as JSONL, (3) upload and start a fine-tuning job, (4) evaluate on a held-out test set of 10 emails, (5) compare accuracy and latency vs the base model with a detailed prompt.
        

Next in the SDK Track

In OA Part 11: Reasoning Systems, we’ll unlock GPT-5.5’s reasoning capabilities — reasoning effort parameter, reasoning items in multi-turn conversations, tool calling with reasoning models, and multi-step decomposition patterns.

OpenAI SDK Track Part 10: Fine-Tuning, Eval & Production

Table of Contents

1. Fine-Tuning

2. Evaluation & Prompt Optimizer

Domain-Specific Medical Coding

3. Batch API

4. Prompt Caching

5. Moderation API

6. Enterprise Features

Next in the SDK Track

Related Articles in This Series

OA Part 9: Realtime API

OA Part 11: Reasoning Systems

Foundation Part 15: Evaluation & LLMOps