Back to AI App Dev Series

PydanticAI SDK Track Part 8: Multimodal Input & Thinking

May 24, 2026 Wasil Zafar 35 min read

Process images, audio, video, and documents as agent inputs. Enable thinking/reasoning mode for complex tasks, configure HTTP request retry strategies, and handle transient errors gracefully.

Table of Contents

  1. Image Input
  2. Audio, Video & Documents
  3. Thinking / Reasoning Mode
  4. HTTP Request Retries
  5. Error Handling Patterns
What You’ll Learn: Testing AI agents is notoriously difficult — outputs are non-deterministic, API calls are expensive, and edge cases are hard to reproduce. PydanticAI makes testing tractable through model mocking, result capture, and dependency injection. This article teaches you to write fast, reliable, cheap test suites that catch regressions without making real API calls.

1. Image Input

PydanticAI supports multimodal inputs — you can pass images alongside text prompts for vision-capable models. Images can be provided via URL reference or inline as base64-encoded binary data.

1.1 ImageUrl & BinaryContent Types

Use ImageUrl for publicly-accessible images and BinaryContent for local files or dynamically generated images:

from pydantic_ai import Agent, ImageUrl

agent = Agent(
    "openai:gpt-4o",
    system_prompt="You are an image analysis assistant. Describe what you see in detail."
)

# Pass an image via URL
result = agent.run_sync([
    "What's in this image? Describe the main elements.",
    ImageUrl(url="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Camponotus_flavomarginatus_ant.jpg/320px-Camponotus_flavomarginatus_ant.jpg"),
])
print(result.output)

For local files, read them as binary and pass with the appropriate MIME type:

from pydantic_ai import Agent, BinaryContent
from pathlib import Path

agent = Agent(
    "openai:gpt-4o",
    system_prompt="You analyze architectural diagrams and identify components."
)

# Read a local image file
image_path = Path("./diagrams/system-architecture.png")
image_bytes = image_path.read_bytes()

result = agent.run_sync([
    "Analyze this system architecture diagram. List all components and their connections.",
    BinaryContent(data=image_bytes, media_type="image/png"),
])
print(result.output)

1.2 Multi-Image Prompts

Pass multiple images in a single prompt for comparison or comprehensive analysis:

from pydantic_ai import Agent, ImageUrl

agent = Agent(
    "openai:gpt-4o",
    system_prompt="You compare images and identify differences."
)

# Compare two images
result = agent.run_sync([
    "Compare these two UI mockups. What changed between version 1 and version 2?",
    ImageUrl(url="https://example.com/mockup-v1.png"),
    ImageUrl(url="https://example.com/mockup-v2.png"),
])
print(result.output)
Image Best Practices: (1) Use URLs for public images to avoid token overhead from base64 encoding. (2) Resize large images before sending — most models downsample anyway. (3) Use BinaryContent for private/local files. (4) Check model-specific limits on image count and total size per request.

2. Audio, Video & Document Input

Beyond images, PydanticAI supports audio, video, and document inputs for models with those capabilities (Gemini, GPT-4o with audio):

from pydantic_ai import Agent, BinaryContent
from pathlib import Path

agent = Agent(
    "gemini-2.0-flash",
    system_prompt="You transcribe and analyze audio content."
)

# Process an audio file
audio_bytes = Path("./recordings/meeting.mp3").read_bytes()

result = agent.run_sync([
    "Transcribe this audio and provide a summary of key points discussed.",
    BinaryContent(data=audio_bytes, media_type="audio/mp3"),
])
print(result.output)

2.1 Video Understanding

from pydantic_ai import Agent, BinaryContent
from pathlib import Path

agent = Agent(
    "gemini-2.0-flash",
    system_prompt="You analyze video content and describe key scenes."
)

# Process a short video clip
video_bytes = Path("./clips/product-demo.mp4").read_bytes()

result = agent.run_sync([
    "Describe what happens in this product demo video. "
    "List the main features being demonstrated.",
    BinaryContent(data=video_bytes, media_type="video/mp4"),
])
print(result.output)

2.2 PDF & Document Processing

from pydantic_ai import Agent, BinaryContent
from pydantic import BaseModel
from pathlib import Path

class InvoiceData(BaseModel):
    """Structured data extracted from an invoice."""
    vendor_name: str
    invoice_number: str
    total_amount: float
    due_date: str
    line_items: list[str]

agent = Agent(
    "openai:gpt-4o",
    output_type=InvoiceData,
    system_prompt="Extract structured data from invoice documents."
)

# Process a PDF invoice
pdf_bytes = Path("./documents/invoice-2026-001.pdf").read_bytes()

result = agent.run_sync([
    "Extract all key information from this invoice.",
    BinaryContent(data=pdf_bytes, media_type="application/pdf"),
])

print(f"Vendor: {result.output.vendor_name}")
print(f"Invoice #: {result.output.invoice_number}")
print(f"Total: ${result.output.total_amount:.2f}")
print(f"Due: {result.output.due_date}")
print(f"Items: {result.output.line_items}")
Model Support Matrix: Not all models support all media types. GPT-4o supports images and audio. Gemini models support images, audio, video, and PDFs. Always check provider documentation for current capabilities and size limits.

3. Thinking / Reasoning Mode

Thinking mode enables extended reasoning for models that support it (Claude with extended thinking, Gemini with thinking budget). The model spends additional “thinking tokens” reasoning about the problem before producing its final answer, significantly improving accuracy on complex tasks.

from pydantic_ai import Agent
from pydantic_ai.settings import ModelSettings

agent = Agent(
    "anthropic:claude-sonnet-4-20250514",
    system_prompt="You are a precise reasoning assistant.",
    model_settings=ModelSettings(thinking=True),
)

result = agent.run_sync(
    "A farmer has a fox, a chicken, and a bag of grain. "
    "He needs to cross a river in a boat that can only carry him and one item at a time. "
    "If left alone, the fox will eat the chicken, and the chicken will eat the grain. "
    "How does he get everything across safely? Show your reasoning step by step."
)
print(result.output)

3.1 Accessing Thinking Tokens

You can inspect the thinking content to understand the model’s reasoning process — useful for debugging and evaluation:

from pydantic_ai import Agent
from pydantic_ai.settings import ModelSettings

agent = Agent(
    "anthropic:claude-sonnet-4-20250514",
    model_settings=ModelSettings(thinking=True),
)

result = agent.run_sync("What is 847 * 293? Show your work.")

# Access the final output
print(f"Answer: {result.output}")

# Inspect thinking content from message history
for message in result.all_messages():
    if hasattr(message, "thinking"):
        print(f"\n--- Model Thinking ---")
        print(message.thinking[:500])
        print("...")

3.2 Thinking Budget Control

Real-World Application

CI/CD for AI Agents

A startup runs 200 agent tests in their CI pipeline (avg 3 seconds total, zero API costs). They mock models for unit tests, use a cheap model (GPT-4 mini) for integration tests, and run a nightly evaluation suite with the production model. Prompt regressions are caught before merge, and the test suite has prevented 12 production incidents in 6 months.

CI/CDTesting

Control how much reasoning effort the model invests. Higher budgets improve accuracy but increase cost and latency:

from pydantic_ai import Agent
from pydantic_ai.settings import ModelSettings

# Minimal thinking — fast and cheap for simple tasks
quick_agent = Agent(
    "anthropic:claude-sonnet-4-20250514",
    model_settings=ModelSettings(thinking=True, thinking_budget=1024),
)

# Deep thinking — maximum reasoning for complex tasks
deep_agent = Agent(
    "anthropic:claude-sonnet-4-20250514",
    model_settings=ModelSettings(thinking=True, thinking_budget=16384),
)

# Simple question — minimal thinking is sufficient
result_quick = quick_agent.run_sync("What is the capital of Japan?")
print(f"Quick: {result_quick.output}")

# Complex question — deep thinking improves accuracy
result_deep = deep_agent.run_sync(
    "Design a distributed consensus algorithm that handles Byzantine faults "
    "with at most f faulty nodes in a network of 3f+1 total nodes. "
    "Explain the message complexity."
)
print(f"Deep: {result_deep.output[:200]}...")
Cost Impact: Thinking tokens are billed at the output token rate. A thinking budget of 8,192 tokens adds ~$0.08 per request at Claude Sonnet rates. For high-volume applications, use thinking selectively — enable it for complex reasoning tasks and disable for simple lookups.

4. HTTP Request Retries

PydanticAI includes built-in retry handling for transient HTTP failures (rate limits, timeouts, server errors). Configure the retry strategy to match your reliability requirements:

from pydantic_ai import Agent
from pydantic_ai.settings import ModelSettings
from httpx import Timeout

agent = Agent(
    "openai:gpt-4o",
    model_settings=ModelSettings(
        timeout=Timeout(
            connect=5.0,    # Connection timeout
            read=30.0,      # Read timeout
            write=10.0,     # Write timeout
            pool=5.0,       # Pool timeout
        ),
        max_retries=3,      # Retry up to 3 times on transient errors
    ),
)

result = agent.run_sync("Generate a haiku about resilient systems.")
print(result.output)

4.1 Custom Retry Configuration

For fine-grained control over retry behavior, configure backoff strategy and retryable error codes:

from pydantic_ai import Agent
from pydantic_ai.settings import ModelSettings, RetryConfig

agent = Agent(
    "openai:gpt-4o",
    model_settings=ModelSettings(
        max_retries=5,
        retry_config=RetryConfig(
            initial_delay=1.0,      # First retry after 1 second
            max_delay=60.0,         # Cap backoff at 60 seconds
            backoff_factor=2.0,     # Exponential: 1s, 2s, 4s, 8s, 16s
            retryable_status_codes=[429, 500, 502, 503, 504],
        ),
    ),
)

# This will automatically retry on rate limits (429) with exponential backoff
result = agent.run_sync("Summarize the key principles of distributed systems.")
print(result.output)

5. Error Handling Patterns

Production agents must handle errors gracefully. PydanticAI raises specific exception types you can catch and respond to appropriately:

from pydantic_ai import Agent
from pydantic_ai.exceptions import (
    ModelHTTPError,
    UnexpectedModelBehavior,
    AgentRunError,
)
import logging

logger = logging.getLogger(__name__)

agent = Agent("openai:gpt-4o", system_prompt="You are a helpful assistant.")

async def safe_agent_call(prompt: str) -> str:
    """Execute an agent call with comprehensive error handling."""
    try:
        result = await agent.run(prompt)
        return result.output

    except ModelHTTPError as e:
        # Network/API errors (rate limits, server errors, auth failures)
        logger.error(f"Model API error: {e.status_code} - {e.message}")
        if e.status_code == 429:
            return "I'm experiencing high demand. Please try again in a moment."
        elif e.status_code == 401:
            return "Authentication error. Please check API key configuration."
        else:
            return f"Service temporarily unavailable (HTTP {e.status_code})."

    except UnexpectedModelBehavior as e:
        # Model returned unexpected format or invalid response
        logger.warning(f"Unexpected model behavior: {e}")
        return "I encountered an unexpected response. Let me try a simpler approach."

    except AgentRunError as e:
        # Agent-level errors (max retries exceeded, tool failures)
        logger.error(f"Agent run failed: {e}")
        return "I wasn't able to complete that request. Please try rephrasing."

    except Exception as e:
        # Catch-all for unexpected errors
        logger.exception(f"Unexpected error in agent call: {e}")
        return "An unexpected error occurred. Our team has been notified."

# Usage
import asyncio
response = asyncio.run(safe_agent_call("What's the weather today?"))
print(response)

5.1 Graceful Degradation with Fallback Models

For critical applications, implement fallback to a secondary model when the primary fails:

from pydantic_ai import Agent
from pydantic_ai.exceptions import ModelHTTPError
import logging

logger = logging.getLogger(__name__)

# Primary: high-quality model
primary_agent = Agent("openai:gpt-4o", system_prompt="You are a helpful assistant.")

# Fallback: faster, more available model
fallback_agent = Agent("openai:gpt-4o-mini", system_prompt="You are a helpful assistant.")

async def resilient_call(prompt: str) -> str:
    """Try primary model, fall back to secondary on failure."""
    try:
        result = await primary_agent.run(prompt)
        return result.output
    except ModelHTTPError as e:
        logger.warning(f"Primary model failed ({e.status_code}), falling back...")
        try:
            result = await fallback_agent.run(prompt)
            return result.output
        except ModelHTTPError as e2:
            logger.error(f"Fallback also failed ({e2.status_code})")
            return "All models are currently unavailable. Please try again later."

# Usage
import asyncio
response = asyncio.run(resilient_call("Explain microservices architecture briefly."))
print(response)
Production Checklist: (1) Always set explicit timeouts — never wait indefinitely. (2) Configure retries with exponential backoff for rate limits. (3) Implement fallback models for critical paths. (4) Log all errors with context for debugging. (5) Set up alerts for elevated error rates. (6) Use circuit breakers for persistent failures.
Try It Yourself: Write a complete test suite for a ‘product recommendation’ agent: (1) mock the model to return predefined responses, (2) test that the result validates against the output schema, (3) test error handling when the model returns invalid data, (4) test tool execution with mocked dependencies, (5) create a snapshot test that detects prompt regressions.

Next in the PydanticAI SDK Track

In Part 9: Multi-Agent Patterns & Testing, we’ll build multi-agent systems with delegation and handoff patterns, implement comprehensive testing strategies with TestModel, and deploy production-grade agent applications.