1. Vision API Fundamentals
The core multimodal pattern is simple: send text instructions together with one or more image inputs, and make the prompt explicit about the kind of reasoning you want. The first example uses a public image URL, while the second shows the local-file workflow you will often use in backend systems.
from openai import OpenAI
client = OpenAI()
# Send an image URL for analysis
response = client.responses.create(
model="gpt-4.1",
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "What's in this image? Describe in detail."},
{"type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"},
],
}
],
)
print(response.output_text)
Base64 uploads are especially useful for private assets, generated screenshots, or preprocessing pipelines where the image never needs to be publicly reachable. That keeps the integration server-side and avoids URL hosting as a prerequisite.
import base64
from openai import OpenAI
client = OpenAI()
# Send a local image as base64
with open("diagram.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.responses.create(
model="gpt-4.1",
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Describe this architecture diagram. List all components and their connections."},
{"type": "input_image", "image_url": f"data:image/png;base64,{image_data}"},
],
}
],
)
print(response.output_text)
2. Image Understanding & OCR
OCR is stronger when you specify the structure you want back instead of asking for a vague summary. That is why receipts, invoices, documents, and screenshots are good candidates for structured extraction rather than plain-language description.
from openai import OpenAI
client = OpenAI()
# OCR: Extract text from an image (receipt, document, screenshot)
response = client.responses.create(
model="gpt-4.1",
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Extract all text from this receipt. Return as structured JSON with: store_name, date, items (name, qty, price), subtotal, tax, total."},
{"type": "input_image", "image_url": "https://example.com/receipt.jpg"},
],
}
],
text={"format": {"type": "json_object"}},
)
import json
receipt = json.loads(response.output_text)
print(json.dumps(receipt, indent=2))
Quality Control in Manufacturing
A factory uses GPT-4 Vision to inspect products on the assembly line. The system photographs each item, detects defects (scratches, misalignment, color inconsistencies), and classifies them by severity. Result: 99.2% defect detection rate, 30% faster than human inspectors.
3. Multi-Image Reasoning
Multiple image inputs let the model compare alternatives, detect changes between versions, or reason about before-and-after states. This is often more useful in product work than single-image captioning because real workflows involve comparison and choice.
from openai import OpenAI
client = OpenAI()
# Compare multiple images
response = client.responses.create(
model="gpt-4.1",
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Compare these two UI designs. Which is better for accessibility and why?"},
{"type": "input_image", "image_url": "https://example.com/design-a.png"},
{"type": "input_image", "image_url": "https://example.com/design-b.png"},
],
}
],
)
print(response.output_text)
4. Image Generation
Different image endpoints fit different workflows. One is optimized for direct hosted image generation, while the other is better when you want binary output, editing flows, or tighter integration with your own asset pipeline.
from openai import OpenAI
client = OpenAI()
# Generate with DALL-E 3
result = client.images.generate(
model="dall-e-3",
prompt="A serene Japanese garden with a koi pond, cherry blossoms, and a wooden bridge, watercolor style",
size="1024x1024",
quality="hd",
n=1,
)
print(f"Image URL: {result.data[0].url}")
print(f"Revised prompt: {result.data[0].revised_prompt}")
The binary-output path is often the more practical production path because it lets you store, transform, watermark, or route the image immediately without depending on temporary hosted URLs.
from openai import OpenAI
import base64
client = OpenAI()
# Generate with GPT-Image-1 (returns base64)
result = client.images.generate(
model="gpt-image-1",
prompt="A minimalist logo for a tech startup called 'NeuralFlow' - abstract neural network nodes forming a flowing river shape, blue and teal colors, white background",
size="1024x1024",
quality="high",
output_format="png",
)
# Save the image
image_bytes = base64.b64decode(result.data[0].b64_json)
with open("neuralflow-logo.png", "wb") as f:
f.write(image_bytes)
print("Logo saved to neuralflow-logo.png")
5. Editing Workflows
Image editing is where prompt specificity matters most. The clearer you are about what must stay unchanged versus what must be modified, the easier it is to turn this into a reliable creative or product-asset workflow.
from openai import OpenAI
client = OpenAI()
# Edit an existing image with GPT-Image-1
result = client.images.edit(
model="gpt-image-1",
image=open("office-photo.png", "rb"),
prompt="Add a potted plant on the desk and change the wall color to light blue",
)
import base64
edited_bytes = base64.b64decode(result.data[0].b64_json)
with open("office-edited.png", "wb") as f:
f.write(edited_bytes)
print("Edited image saved")
6. Production Pipelines
Production image systems rarely stop at describing a picture. They usually normalize files, call the model, validate the output shape, and then pass the result into search, moderation, tagging, or downstream UI components. That is why the pipeline example returns structured data rather than just a caption.
from openai import OpenAI
from pydantic import BaseModel
import base64
client = OpenAI()
class ImageAnalysis(BaseModel):
description: str
objects: list[str]
colors: list[str]
mood: str
text_content: str | None
def analyze_and_describe(image_path: str) -> ImageAnalysis:
"""Production pipeline: analyze image and return structured data."""
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.responses.parse(
model="gpt-4.1",
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Analyze this image comprehensively."},
{"type": "input_image", "image_url": f"data:image/png;base64,{image_data}"},
],
}
],
text_format=ImageAnalysis,
)
return response.output_parsed[0]
# Usage
analysis = analyze_and_describe("product-photo.png")
print(f"Description: {analysis.description}")
print(f"Objects: {', '.join(analysis.objects)}")
print(f"Mood: {analysis.mood}")
Next in the SDK Track
In OA Part 6: Audio, Speech & Video, we’ll implement Whisper transcription, TTS with streaming, and video generation with Sora.