1. Whisper Speech-to-Text
Audio pipelines usually start with transcription because text becomes the bridge to everything else: search, summarization, classification, routing, and analytics. The snippets below move from the simplest transcription flow to richer timestamped output and translation.
from openai import OpenAI
client = OpenAI()
# Basic transcription
with open("meeting-recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
)
print(transcript.text)
Timestamped output is what you want for editors, subtitle systems, QA review tools, or anything that needs to jump back to precise moments in an audio stream rather than just showing one large block of text.
from openai import OpenAI
client = OpenAI()
# Transcription with word-level timestamps
with open("interview.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"],
)
print(f"Full text: {transcript.text}")
print(f"\nSegments:")
for segment in transcript.segments:
print(f" [{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")
from openai import OpenAI
client = OpenAI()
# Translation (any language → English)
with open("french-podcast.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file,
)
print(f"English translation: {translation.text}")
2. TTS API
Text-to-speech is the reverse bridge: it turns generated or retrieved text back into an audio experience. In production, the main choices are usually voice, speed, and whether you need instant low-latency output or higher-quality rendered audio.
from openai import OpenAI
from pathlib import Path
client = OpenAI()
# Generate speech from text
speech_file = Path("output.mp3")
response = client.audio.speech.create(
model="tts-1-hd", # tts-1 (fast) or tts-1-hd (quality)
voice="nova", # alloy, echo, fable, onyx, nova, shimmer
input="Welcome to the OpenAI SDK tutorial. Today we'll explore text-to-speech capabilities and how to integrate them into your applications.",
speed=1.0, # 0.25 to 4.0
)
response.stream_to_file(speech_file)
print(f"Audio saved to {speech_file}")
Multilingual Call Center Automation
A global support center uses Whisper for real-time transcription in 12 languages, GPT-4 for intent detection and response generation, and TTS for automated voice responses. The system handles 40% of calls without human agents and provides real-time translation for the rest.
3. Streaming Audio
Streaming matters when the user should hear output immediately instead of waiting for the whole file to render. That is the right fit for voice assistants, live narration, or realtime conversational systems.
from openai import OpenAI
client = OpenAI()
# Stream TTS audio chunks for real-time playback
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="This is a streaming audio example. The audio chunks are delivered as they are generated, enabling real-time playback without waiting for the full generation to complete.",
response_format="pcm", # Raw PCM for streaming playback
)
# Write chunks as they arrive
with open("stream_output.pcm", "wb") as f:
for chunk in response.iter_bytes(chunk_size=4096):
f.write(chunk)
print("Streaming audio saved")
4. Video Generation (Sora)
Video generation is operationally different from text, images, and speech because it is inherently asynchronous. You submit a job, poll for state changes, and treat the result more like a render pipeline than an inline inference call.
from openai import OpenAI
import time
client = OpenAI()
# Generate a video with Sora
response = client.videos.generate(
model="sora",
prompt="A timelapse of a city transitioning from day to night, with lights gradually turning on across skyscrapers, shot from a rooftop perspective",
size="1920x1080",
duration=5, # seconds
n=1,
)
# Video generation is async — poll for completion
video_id = response.id
print(f"Video generation started: {video_id}")
while True:
status = client.videos.retrieve(video_id)
if status.status == "completed":
print(f"Video ready: {status.url}")
break
elif status.status == "failed":
print(f"Generation failed: {status.error}")
break
time.sleep(5)
5. Multimodal Combinations
The most useful multimodal systems chain modalities instead of treating them separately. This final example shows the common pattern: audio becomes text, text becomes structured insight or a summary, and that insight becomes speech again for a user-facing output.
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def podcast_summarizer(audio_path: str) -> dict:
"""Full pipeline: transcribe audio → summarize → generate audio summary."""
# Step 1: Transcribe
with open(audio_path, "rb") as f:
transcript = client.audio.transcriptions.create(model="whisper-1", file=f)
# Step 2: Summarize with LLM
summary = client.responses.create(
model="gpt-4.1-mini",
input=f"Summarize this podcast transcript in 3 bullet points:\n\n{transcript.text}",
)
# Step 3: Generate audio of the summary
audio_summary = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input=f"Here's your podcast summary: {summary.output_text}",
)
audio_summary.stream_to_file(Path("podcast-summary.mp3"))
return {
"transcript_length": len(transcript.text),
"summary": summary.output_text,
"audio_file": "podcast-summary.mp3",
}
result = podcast_summarizer("episode-42.mp3")
print(f"Summary: {result['summary']}")
Next in the SDK Track
In OA Part 7: Embeddings & File Search, we’ll master OpenAI’s embeddings API, vector stores, and the File Search tool for building RAG systems.