OpenAI SDK Track Part 6: Multimodal — Audio, Speech & Video

            
            What You’ll Learn: OpenAI’s audio capabilities span three areas: speech-to-text (Whisper), text-to-speech (TTS), and audio understanding. This article teaches you to build applications that hear, speak, and understand audio context — from transcription pipelines to voice-enabled assistants. Think of it as giving your application ears and a voice.
        

1. Whisper Speech-to-Text

Audio pipelines usually start with transcription because text becomes the bridge to everything else: search, summarization, classification, routing, and analytics. The snippets below move from the simplest transcription flow to richer timestamped output and translation.

from openai import OpenAI

client = OpenAI()

# Basic transcription
with open("meeting-recording.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
    )
print(transcript.text)

Timestamped output is what you want for editors, subtitle systems, QA review tools, or anything that needs to jump back to precise moments in an audio stream rather than just showing one large block of text.

from openai import OpenAI

client = OpenAI()

# Transcription with word-level timestamps
with open("interview.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"],
    )

print(f"Full text: {transcript.text}")
print(f"\nSegments:")
for segment in transcript.segments:
    print(f"  [{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

from openai import OpenAI

client = OpenAI()

# Translation (any language → English)
with open("french-podcast.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=audio_file,
    )
print(f"English translation: {translation.text}")

2. TTS API

Text-to-speech is the reverse bridge: it turns generated or retrieved text back into an audio experience. In production, the main choices are usually voice, speed, and whether you need instant low-latency output or higher-quality rendered audio.

from openai import OpenAI
from pathlib import Path

client = OpenAI()

# Generate speech from text
speech_file = Path("output.mp3")

response = client.audio.speech.create(
    model="tts-1-hd",      # tts-1 (fast) or tts-1-hd (quality)
    voice="nova",           # alloy, echo, fable, onyx, nova, shimmer
    input="Welcome to the OpenAI SDK tutorial. Today we'll explore text-to-speech capabilities and how to integrate them into your applications.",
    speed=1.0,              # 0.25 to 4.0
)

response.stream_to_file(speech_file)
print(f"Audio saved to {speech_file}")

Real-World Application

Multilingual Call Center Automation

A global support center uses Whisper for real-time transcription in 12 languages, GPT-4 for intent detection and response generation, and TTS for automated voice responses. The system handles 40% of calls without human agents and provides real-time translation for the rest.

Call CenterMultilingual

3. Streaming Audio

Streaming matters when the user should hear output immediately instead of waiting for the whole file to render. That is the right fit for voice assistants, live narration, or realtime conversational systems.

from openai import OpenAI

client = OpenAI()

# Stream TTS audio chunks for real-time playback
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="This is a streaming audio example. The audio chunks are delivered as they are generated, enabling real-time playback without waiting for the full generation to complete.",
    response_format="pcm",  # Raw PCM for streaming playback
)

# Write chunks as they arrive
with open("stream_output.pcm", "wb") as f:
    for chunk in response.iter_bytes(chunk_size=4096):
        f.write(chunk)

print("Streaming audio saved")

4. Video Generation (Sora)

Video generation is operationally different from text, images, and speech because it is inherently asynchronous. You submit a job, poll for state changes, and treat the result more like a render pipeline than an inline inference call.

from openai import OpenAI
import time

client = OpenAI()

# Generate a video with Sora
response = client.videos.generate(
    model="sora",
    prompt="A timelapse of a city transitioning from day to night, with lights gradually turning on across skyscrapers, shot from a rooftop perspective",
    size="1920x1080",
    duration=5,         # seconds
    n=1,
)

# Video generation is async — poll for completion
video_id = response.id
print(f"Video generation started: {video_id}")

while True:
    status = client.videos.retrieve(video_id)
    if status.status == "completed":
        print(f"Video ready: {status.url}")
        break
    elif status.status == "failed":
        print(f"Generation failed: {status.error}")
        break
    time.sleep(5)

5. Multimodal Combinations

The most useful multimodal systems chain modalities instead of treating them separately. This final example shows the common pattern: audio becomes text, text becomes structured insight or a summary, and that insight becomes speech again for a user-facing output.

from openai import OpenAI
from pathlib import Path

client = OpenAI()

def podcast_summarizer(audio_path: str) -> dict:
    """Full pipeline: transcribe audio → summarize → generate audio summary."""
    # Step 1: Transcribe
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(model="whisper-1", file=f)

    # Step 2: Summarize with LLM
    summary = client.responses.create(
        model="gpt-4.1-mini",
        input=f"Summarize this podcast transcript in 3 bullet points:\n\n{transcript.text}",
    )

    # Step 3: Generate audio of the summary
    audio_summary = client.audio.speech.create(
        model="tts-1-hd",
        voice="nova",
        input=f"Here's your podcast summary: {summary.output_text}",
    )
    audio_summary.stream_to_file(Path("podcast-summary.mp3"))

    return {
        "transcript_length": len(transcript.text),
        "summary": summary.output_text,
        "audio_file": "podcast-summary.mp3",
    }

result = podcast_summarizer("episode-42.mp3")
print(f"Summary: {result['summary']}")

            
            Try It Yourself: Build a ‘podcast summarizer’ pipeline: (1) Transcribe an audio file using Whisper, (2) generate a structured summary (key points, quotes, timestamps), (3) create a TTS audio version of the summary in a different voice. Test with a 5-minute audio clip and measure transcription accuracy.
        

Next in the SDK Track

In OA Part 7: Embeddings & File Search, we’ll master OpenAI’s embeddings API, vector stores, and the File Search tool for building RAG systems.

OpenAI SDK Track Part 6: Multimodal — Audio, Speech & Video

Table of Contents