1. Realtime Architecture
sequenceDiagram
participant C as Client
participant WS as WebSocket / WebRTC
participant RS as Realtime Server
C->>WS: Connect + Auth
WS->>RS: session.create
RS-->>WS: session.created
C->>WS: Audio In (PCM16)
WS->>RS: input_audio_buffer.append
RS-->>WS: response.audio.delta
WS-->>C: Audio Out (PCM16)
RS-->>WS: response.done
The OpenAI Realtime API supports three connection methods for different deployment scenarios:
- WebRTC — for browser-based voice applications with low-latency peer-to-peer audio streaming
- WebSocket — for server-side applications needing full control over audio pipelines and event handling
- SIP (Session Initiation Protocol) — for telephony integrations connecting traditional phone systems to AI
All three methods share the same event-driven protocol: clients send events (audio buffers, text messages, function call outputs) and receive events (audio deltas, text deltas, tool calls, session updates).
2. WebSocket Connection
The WebSocket interface gives you full control over the realtime session. Connect with your API key, configure the session parameters, then send and receive events in a bidirectional stream.
import asyncio
import websockets
import json
import base64
async def connect_realtime():
"""Connect to OpenAI Realtime API via WebSocket."""
import os
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(url, additional_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {"type": "server_vad"},
},
}))
# Wait for session confirmation
response = await ws.recv()
event = json.loads(response)
print(f"Session created: {event['type']}")
# Send a text message
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Hello! Tell me a fun fact."}],
},
}))
await ws.send(json.dumps({"type": "response.create"}))
# Receive response events
async for message in ws:
event = json.loads(message)
if event["type"] == "response.text.delta":
print(event["delta"], end="", flush=True)
elif event["type"] == "response.done":
print("\n[Response complete]")
break
asyncio.run(connect_realtime())
AI Phone Receptionist
A dental clinic deployed a Realtime API-powered phone agent that handles appointment scheduling, rescheduling, and FAQ calls 24/7. The agent understands natural speech (including accents and interruptions), checks the calendar in real-time, and confirms bookings — handling 70% of calls without human staff.
3. Session Management
Sessions are configurable at any time during the connection. You can start with text-only mode and upgrade to audio later, switch voices, adjust temperature, or change turn detection settings — all without reconnecting.
import asyncio
import websockets
import json
import os
async def manage_session():
"""Demonstrate session configuration and updates."""
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1"}
async with websockets.connect(url, additional_headers=headers) as ws:
# Initial session with text only
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text"],
"instructions": "You are a helpful assistant. Keep responses brief.",
"temperature": 0.7,
"max_response_output_tokens": 500,
},
}))
config = json.loads(await ws.recv())
print(f"Configured: {config['type']}")
# Later: upgrade to audio
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "nova",
"turn_detection": {"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500},
},
}))
upgrade = json.loads(await ws.recv())
print(f"Upgraded to audio: {upgrade['type']}")
asyncio.run(manage_session())
4. Voice Activity Detection
Voice Activity Detection (VAD) determines when the user has finished speaking. The Realtime API offers two modes:
- Server VAD (
server_vad) — The server automatically detects speech boundaries using configurable thresholds. Best for natural conversational experiences. - Manual mode (
null) — You control when a turn ends by explicitly committing the audio buffer. Best for push-to-talk interfaces or when you need precise control.
import json
# Server VAD configuration options
server_vad_config = {
"type": "session.update",
"session": {
"turn_detection": {
"type": "server_vad",
"threshold": 0.5, # Speech detection sensitivity (0.0-1.0)
"prefix_padding_ms": 300, # Audio to include before speech starts
"silence_duration_ms": 500, # Silence before considering turn complete
},
},
}
# Manual turn detection (you control when turns end)
manual_config = {
"type": "session.update",
"session": {
"turn_detection": None, # Disable automatic detection
},
}
# With manual mode, explicitly commit audio and request response:
commit_event = {"type": "input_audio_buffer.commit"}
response_event = {"type": "response.create"}
print("Server VAD: automatic turn detection based on silence")
print("Manual mode: you decide when user is done speaking")
print(f"Config example: {json.dumps(server_vad_config, indent=2)}")
threshold (e.g., 0.3) for noisy environments where speech is harder to detect. Increase silence_duration_ms (e.g., 800ms) for users who pause frequently between sentences to avoid premature turn endings.
5. Realtime Function Calling
Tools work in the Realtime API just like in the Chat Completions API. Define functions in the session config, and the model will emit response.function_call_arguments.done events when it wants to invoke a tool. You execute the function and send back the result as a function_call_output conversation item.
import asyncio
import websockets
import json
import os
async def realtime_with_tools():
"""Realtime API with function calling."""
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1"}
async with websockets.connect(url, additional_headers=headers) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text"],
"tools": [{
"type": "function",
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
}],
},
}))
await ws.recv() # session.updated
# Ask something that triggers the tool
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {"type": "message", "role": "user", "content": [{"type": "input_text", "text": "What's the weather in Paris?"}]},
}))
await ws.send(json.dumps({"type": "response.create"}))
async for msg in ws:
event = json.loads(msg)
if event["type"] == "response.function_call_arguments.done":
# Execute the function
args = json.loads(event["arguments"])
result = json.dumps({"location": args["location"], "temp": 18, "condition": "partly cloudy"})
# Send result back
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {"type": "function_call_output", "call_id": event["call_id"], "output": result},
}))
await ws.send(json.dumps({"type": "response.create"}))
elif event["type"] == "response.text.delta":
print(event["delta"], end="", flush=True)
elif event["type"] == "response.done":
print("\n[Done]")
break
asyncio.run(realtime_with_tools())
6. Cost Optimization
Realtime API pricing is based on audio duration and text tokens. Understanding the cost structure helps you design efficient applications:
| Component | Pricing | Notes |
|---|---|---|
| Audio Input | $0.06 / min | PCM16 or G.711 formats |
| Audio Output | $0.24 / min | Generated speech from model |
| Text Input Tokens | $5.00 / 1M tokens | Instructions, context, user text |
| Text Output Tokens | $20.00 / 1M tokens | Model text responses |
- Use text modality only when audio isn’t needed — text tokens are significantly cheaper than audio minutes
- Set
max_response_output_tokensto cap response length and prevent runaway costs - Tune
silence_duration_msin server VAD to minimize idle audio being processed - Consider
gpt-realtime-2(standard) or check for lighter-weight realtime models for simpler use cases that don’t require the full model’s capabilities - Implement session timeouts to automatically close idle connections
7. WebRTC Connection (Browser)
For browser-based voice applications, WebRTC provides the lowest-latency connection by establishing peer-to-peer audio streams. The client creates an ephemeral token (via your backend), then connects directly to OpenAI’s realtime servers without routing audio through your server.
// Browser-side WebRTC connection to OpenAI Realtime API
async function connectRealtimeWebRTC(ephemeralToken) {
const pc = new RTCPeerConnection();
// Create audio output element
const audioEl = document.createElement("audio");
audioEl.autoplay = true;
pc.ontrack = (e) => { audioEl.srcObject = e.streams[0]; };
// Add microphone input
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(stream.getTracks()[0]);
// Create offer and connect
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
method: "POST",
headers: {
"Authorization": `Bearer ${ephemeralToken}`,
"Content-Type": "application/sdp",
},
body: offer.sdp,
});
const answer = { type: "answer", sdp: await response.text() };
await pc.setRemoteDescription(answer);
console.log("WebRTC session connected!");
return pc;
}
8. Specialized Realtime Models
Beyond the general-purpose gpt-realtime-2 model, OpenAI provides specialized models for specific audio workflows:
| Model | Purpose | Use Case |
|---|---|---|
| gpt-realtime-2 | General voice agent | Conversational AI, customer support, voice assistants |
| gpt-realtime-translate | Live speech translation | Multilingual meetings, real-time interpreter, cross-language communication |
| gpt-realtime-whisper | Live transcription | Meeting notes, live captions, streaming speech-to-text with low latency |
import asyncio
import websockets
import json
import os
async def live_translation_session():
"""Use gpt-realtime-translate for live speech translation."""
url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-translate"
headers = {
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(url, additional_headers=headers) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"instructions": "Translate all spoken English into Spanish.",
},
}))
response = await ws.recv()
print(f"Translation session ready: {json.loads(response)['type']}")
# Stream audio in, receive translated audio out
return ws
asyncio.run(live_translation_session())
9. SIP Integration (Telephony)
SIP (Session Initiation Protocol) connects traditional phone systems to the Realtime API. This enables building AI-powered IVR systems, call center agents, and phone-based assistants that integrate with existing PSTN infrastructure and PBX systems.
Next in the SDK Track
In OA Part 10: Fine-Tuning, Eval & Production, we’ll complete the SDK track with fine-tuning workflows, the Evaluation API, Batch API, and enterprise features.