Back to AI App Dev Series

OpenAI SDK Track Part 5: Multimodal — Vision & Image Generation

May 22, 2026Wasil Zafar40 min read

Use the OpenAI Vision API for image understanding, OCR, and diagram analysis. Generate images with DALL-E 3 and GPT-Image-1. Build multi-image reasoning, editing workflows, and production image pipelines.

Table of Contents

  1. Vision API
  2. Image Understanding
  3. Multi-Image Reasoning
  4. Image Generation
  5. Editing Workflows
  6. Production Pipelines
What You’ll Learn: Vision capabilities let GPT-4 ‘see’ — you can pass images alongside text and the model will describe, analyze, compare, and reason about visual content. Image generation with DALL-E lets you create images from text descriptions. This article covers both: from basic image understanding to building visual workflows that combine analysis and generation.

1. Vision API Fundamentals

The core multimodal pattern is simple: send text instructions together with one or more image inputs, and make the prompt explicit about the kind of reasoning you want. The first example uses a public image URL, while the second shows the local-file workflow you will often use in backend systems.

from openai import OpenAI

client = OpenAI()

# Send an image URL for analysis
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "What's in this image? Describe in detail."},
                {"type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"},
            ],
        }
    ],
)
print(response.output_text)

Base64 uploads are especially useful for private assets, generated screenshots, or preprocessing pipelines where the image never needs to be publicly reachable. That keeps the integration server-side and avoids URL hosting as a prerequisite.

import base64
from openai import OpenAI

client = OpenAI()

# Send a local image as base64
with open("diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Describe this architecture diagram. List all components and their connections."},
                {"type": "input_image", "image_url": f"data:image/png;base64,{image_data}"},
            ],
        }
    ],
)
print(response.output_text)

2. Image Understanding & OCR

OCR is stronger when you specify the structure you want back instead of asking for a vague summary. That is why receipts, invoices, documents, and screenshots are good candidates for structured extraction rather than plain-language description.

from openai import OpenAI

client = OpenAI()

# OCR: Extract text from an image (receipt, document, screenshot)
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Extract all text from this receipt. Return as structured JSON with: store_name, date, items (name, qty, price), subtotal, tax, total."},
                {"type": "input_image", "image_url": "https://example.com/receipt.jpg"},
            ],
        }
    ],
    text={"format": {"type": "json_object"}},
)

import json
receipt = json.loads(response.output_text)
print(json.dumps(receipt, indent=2))
Real-World Application

Quality Control in Manufacturing

A factory uses GPT-4 Vision to inspect products on the assembly line. The system photographs each item, detects defects (scratches, misalignment, color inconsistencies), and classifies them by severity. Result: 99.2% defect detection rate, 30% faster than human inspectors.

ManufacturingQuality Assurance

3. Multi-Image Reasoning

Multiple image inputs let the model compare alternatives, detect changes between versions, or reason about before-and-after states. This is often more useful in product work than single-image captioning because real workflows involve comparison and choice.

from openai import OpenAI

client = OpenAI()

# Compare multiple images
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Compare these two UI designs. Which is better for accessibility and why?"},
                {"type": "input_image", "image_url": "https://example.com/design-a.png"},
                {"type": "input_image", "image_url": "https://example.com/design-b.png"},
            ],
        }
    ],
)
print(response.output_text)

4. Image Generation

Different image endpoints fit different workflows. One is optimized for direct hosted image generation, while the other is better when you want binary output, editing flows, or tighter integration with your own asset pipeline.

from openai import OpenAI

client = OpenAI()

# Generate with DALL-E 3
result = client.images.generate(
    model="dall-e-3",
    prompt="A serene Japanese garden with a koi pond, cherry blossoms, and a wooden bridge, watercolor style",
    size="1024x1024",
    quality="hd",
    n=1,
)

print(f"Image URL: {result.data[0].url}")
print(f"Revised prompt: {result.data[0].revised_prompt}")

The binary-output path is often the more practical production path because it lets you store, transform, watermark, or route the image immediately without depending on temporary hosted URLs.

from openai import OpenAI
import base64

client = OpenAI()

# Generate with GPT-Image-1 (returns base64)
result = client.images.generate(
    model="gpt-image-1",
    prompt="A minimalist logo for a tech startup called 'NeuralFlow' - abstract neural network nodes forming a flowing river shape, blue and teal colors, white background",
    size="1024x1024",
    quality="high",
    output_format="png",
)

# Save the image
image_bytes = base64.b64decode(result.data[0].b64_json)
with open("neuralflow-logo.png", "wb") as f:
    f.write(image_bytes)
print("Logo saved to neuralflow-logo.png")

5. Editing Workflows

Image editing is where prompt specificity matters most. The clearer you are about what must stay unchanged versus what must be modified, the easier it is to turn this into a reliable creative or product-asset workflow.

from openai import OpenAI

client = OpenAI()

# Edit an existing image with GPT-Image-1
result = client.images.edit(
    model="gpt-image-1",
    image=open("office-photo.png", "rb"),
    prompt="Add a potted plant on the desk and change the wall color to light blue",
)

import base64
edited_bytes = base64.b64decode(result.data[0].b64_json)
with open("office-edited.png", "wb") as f:
    f.write(edited_bytes)
print("Edited image saved")

6. Production Pipelines

Production image systems rarely stop at describing a picture. They usually normalize files, call the model, validate the output shape, and then pass the result into search, moderation, tagging, or downstream UI components. That is why the pipeline example returns structured data rather than just a caption.

from openai import OpenAI
from pydantic import BaseModel
import base64

client = OpenAI()

class ImageAnalysis(BaseModel):
    description: str
    objects: list[str]
    colors: list[str]
    mood: str
    text_content: str | None

def analyze_and_describe(image_path: str) -> ImageAnalysis:
    """Production pipeline: analyze image and return structured data."""
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.responses.parse(
        model="gpt-4.1",
        input=[
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "Analyze this image comprehensively."},
                    {"type": "input_image", "image_url": f"data:image/png;base64,{image_data}"},
                ],
            }
        ],
        text_format=ImageAnalysis,
    )
    return response.output_parsed[0]

# Usage
analysis = analyze_and_describe("product-photo.png")
print(f"Description: {analysis.description}")
print(f"Objects: {', '.join(analysis.objects)}")
print(f"Mood: {analysis.mood}")
Try It Yourself: Build an ‘interior design assistant’: (1) Take a photo of a room, (2) have GPT-4 Vision analyze the current style, lighting, and furniture, (3) suggest 3 improvements, (4) use DALL-E to generate a visualization of the redesigned room. Chain all steps in a single workflow.

Next in the SDK Track

In OA Part 6: Audio, Speech & Video, we’ll implement Whisper transcription, TTS with streaming, and video generation with Sora.