OpenAI SDK Track Part 9: Realtime API

            
            What You’ll Learn: The Realtime API enables live, bidirectional audio conversations with OpenAI models — the model hears you speak in real-time and responds with natural speech, including interruptions, pauses, and turn-taking. This is fundamentally different from the request/response pattern: it’s a persistent connection with streaming audio in both directions. Think of it like a phone call with AI, not a text chat. Connection options include WebRTC (browsers), WebSocket (servers), and SIP (telephony).
        

1. Realtime Architecture

Client ↔ Realtime Server Flow

                sequenceDiagram
                    participant C as Client
                    participant WS as WebSocket / WebRTC
                    participant RS as Realtime Server
                    C->>WS: Connect + Auth
                    WS->>RS: session.create
                    RS-->>WS: session.created
                    C->>WS: Audio In (PCM16)
                    WS->>RS: input_audio_buffer.append
                    RS-->>WS: response.audio.delta
                    WS-->>C: Audio Out (PCM16)
                    RS-->>WS: response.done

The OpenAI Realtime API supports three connection methods for different deployment scenarios:

WebRTC — for browser-based voice applications with low-latency peer-to-peer audio streaming
WebSocket — for server-side applications needing full control over audio pipelines and event handling
SIP (Session Initiation Protocol) — for telephony integrations connecting traditional phone systems to AI

All three methods share the same event-driven protocol: clients send events (audio buffers, text messages, function call outputs) and receive events (audio deltas, text deltas, tool calls, session updates).

2. WebSocket Connection

The WebSocket interface gives you full control over the realtime session. Connect with your API key, configure the session parameters, then send and receive events in a bidirectional stream.

import asyncio
import websockets
import json
import base64

async def connect_realtime():
    """Connect to OpenAI Realtime API via WebSocket."""
    import os
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
        "OpenAI-Beta": "realtime=v1",
    }

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {"type": "server_vad"},
            },
        }))

        # Wait for session confirmation
        response = await ws.recv()
        event = json.loads(response)
        print(f"Session created: {event['type']}")

        # Send a text message
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": "Hello! Tell me a fun fact."}],
            },
        }))
        await ws.send(json.dumps({"type": "response.create"}))

        # Receive response events
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.text.delta":
                print(event["delta"], end="", flush=True)
            elif event["type"] == "response.done":
                print("\n[Response complete]")
                break

asyncio.run(connect_realtime())

            
            Connection Protocol: The Realtime API uses a persistent WebSocket connection. All communication is JSON events — there are no REST endpoints for audio streaming. The connection stays open for the duration of the session.
        

Real-World Application

AI Phone Receptionist

A dental clinic deployed a Realtime API-powered phone agent that handles appointment scheduling, rescheduling, and FAQ calls 24/7. The agent understands natural speech (including accents and interruptions), checks the calendar in real-time, and confirms bookings — handling 70% of calls without human staff.

HealthcareVoice AI

3. Session Management

Sessions are configurable at any time during the connection. You can start with text-only mode and upgrade to audio later, switch voices, adjust temperature, or change turn detection settings — all without reconnecting.

import asyncio
import websockets
import json
import os

async def manage_session():
    """Demonstrate session configuration and updates."""
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Initial session with text only
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text"],
                "instructions": "You are a helpful assistant. Keep responses brief.",
                "temperature": 0.7,
                "max_response_output_tokens": 500,
            },
        }))
        config = json.loads(await ws.recv())
        print(f"Configured: {config['type']}")

        # Later: upgrade to audio
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "nova",
                "turn_detection": {"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500},
            },
        }))
        upgrade = json.loads(await ws.recv())
        print(f"Upgraded to audio: {upgrade['type']}")

asyncio.run(manage_session())

4. Voice Activity Detection

Voice Activity Detection (VAD) determines when the user has finished speaking. The Realtime API offers two modes:

Server VAD (server_vad) — The server automatically detects speech boundaries using configurable thresholds. Best for natural conversational experiences.
Manual mode (null) — You control when a turn ends by explicitly committing the audio buffer. Best for push-to-talk interfaces or when you need precise control.

import json

# Server VAD configuration options
server_vad_config = {
    "type": "session.update",
    "session": {
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,          # Speech detection sensitivity (0.0-1.0)
            "prefix_padding_ms": 300,   # Audio to include before speech starts
            "silence_duration_ms": 500, # Silence before considering turn complete
        },
    },
}

# Manual turn detection (you control when turns end)
manual_config = {
    "type": "session.update",
    "session": {
        "turn_detection": None,  # Disable automatic detection
    },
}

# With manual mode, explicitly commit audio and request response:
commit_event = {"type": "input_audio_buffer.commit"}
response_event = {"type": "response.create"}

print("Server VAD: automatic turn detection based on silence")
print("Manual mode: you decide when user is done speaking")
print(f"Config example: {json.dumps(server_vad_config, indent=2)}")

            
            VAD Tuning Tips: Lower threshold (e.g., 0.3) for noisy environments where speech is harder to detect. Increase silence_duration_ms (e.g., 800ms) for users who pause frequently between sentences to avoid premature turn endings.
        

5. Realtime Function Calling

Tools work in the Realtime API just like in the Chat Completions API. Define functions in the session config, and the model will emit response.function_call_arguments.done events when it wants to invoke a tool. You execute the function and send back the result as a function_call_output conversation item.

import asyncio
import websockets
import json
import os

async def realtime_with_tools():
    """Realtime API with function calling."""
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text"],
                "tools": [{
                    "type": "function",
                    "name": "get_weather",
                    "description": "Get current weather for a location",
                    "parameters": {
                        "type": "object",
                        "properties": {"location": {"type": "string"}},
                        "required": ["location"],
                    },
                }],
            },
        }))
        await ws.recv()  # session.updated

        # Ask something that triggers the tool
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {"type": "message", "role": "user", "content": [{"type": "input_text", "text": "What's the weather in Paris?"}]},
        }))
        await ws.send(json.dumps({"type": "response.create"}))

        async for msg in ws:
            event = json.loads(msg)
            if event["type"] == "response.function_call_arguments.done":
                # Execute the function
                args = json.loads(event["arguments"])
                result = json.dumps({"location": args["location"], "temp": 18, "condition": "partly cloudy"})

                # Send result back
                await ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {"type": "function_call_output", "call_id": event["call_id"], "output": result},
                }))
                await ws.send(json.dumps({"type": "response.create"}))
            elif event["type"] == "response.text.delta":
                print(event["delta"], end="", flush=True)
            elif event["type"] == "response.done":
                print("\n[Done]")
                break

asyncio.run(realtime_with_tools())

6. Cost Optimization

Realtime API pricing is based on audio duration and text tokens. Understanding the cost structure helps you design efficient applications:

Component	Pricing	Notes
Audio Input	$0.06 / min	PCM16 or G.711 formats
Audio Output	$0.24 / min	Generated speech from model
Text Input Tokens	$5.00 / 1M tokens	Instructions, context, user text
Text Output Tokens	$20.00 / 1M tokens	Model text responses

            
            Cost Optimization Strategies:
            Use text modality only when audio isn’t needed — text tokens are significantly cheaper than audio minutes
Set max_response_output_tokens to cap response length and prevent runaway costs
Tune silence_duration_ms in server VAD to minimize idle audio being processed
Consider gpt-realtime-2 (standard) or check for lighter-weight realtime models for simpler use cases that don’t require the full model’s capabilities
Implement session timeouts to automatically close idle connections

        

            
            Try It Yourself: Build a simple voice assistant using the Realtime API: (1) establish a WebSocket connection, (2) stream microphone audio to the API, (3) play back the audio response in real-time, (4) implement a ‘push to talk’ mode. Then add one function call (get_time) that the assistant can invoke during conversation. Test a 2-minute conversation.
        

7. WebRTC Connection (Browser)

For browser-based voice applications, WebRTC provides the lowest-latency connection by establishing peer-to-peer audio streams. The client creates an ephemeral token (via your backend), then connects directly to OpenAI’s realtime servers without routing audio through your server.

// Browser-side WebRTC connection to OpenAI Realtime API
async function connectRealtimeWebRTC(ephemeralToken) {
    const pc = new RTCPeerConnection();

    // Create audio output element
    const audioEl = document.createElement("audio");
    audioEl.autoplay = true;
    pc.ontrack = (e) => { audioEl.srcObject = e.streams[0]; };

    // Add microphone input
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    pc.addTrack(stream.getTracks()[0]);

    // Create offer and connect
    const offer = await pc.createOffer();
    await pc.setLocalDescription(offer);

    const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
        method: "POST",
        headers: {
            "Authorization": `Bearer ${ephemeralToken}`,
            "Content-Type": "application/sdp",
        },
        body: offer.sdp,
    });

    const answer = { type: "answer", sdp: await response.text() };
    await pc.setRemoteDescription(answer);

    console.log("WebRTC session connected!");
    return pc;
}

8. Specialized Realtime Models

Beyond the general-purpose gpt-realtime-2 model, OpenAI provides specialized models for specific audio workflows:

Model	Purpose	Use Case
gpt-realtime-2	General voice agent	Conversational AI, customer support, voice assistants
gpt-realtime-translate	Live speech translation	Multilingual meetings, real-time interpreter, cross-language communication
gpt-realtime-whisper	Live transcription	Meeting notes, live captions, streaming speech-to-text with low latency

import asyncio
import websockets
import json
import os

async def live_translation_session():
    """Use gpt-realtime-translate for live speech translation."""
    url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-translate"
    headers = {
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
        "OpenAI-Beta": "realtime=v1",
    }

    async with websockets.connect(url, additional_headers=headers) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "instructions": "Translate all spoken English into Spanish.",
            },
        }))

        response = await ws.recv()
        print(f"Translation session ready: {json.loads(response)['type']}")
        # Stream audio in, receive translated audio out
        return ws

asyncio.run(live_translation_session())

9. SIP Integration (Telephony)

SIP (Session Initiation Protocol) connects traditional phone systems to the Realtime API. This enables building AI-powered IVR systems, call center agents, and phone-based assistants that integrate with existing PSTN infrastructure and PBX systems.

            
            When to use SIP: Use SIP when your users connect via phone calls (not browser or app). Typical scenarios include customer support call centers, automated appointment booking by phone, outbound calling campaigns, and integration with existing telephony infrastructure (Twilio, Vonage, Asterisk).
        

Next in the SDK Track

In OA Part 10: Fine-Tuning, Eval & Production, we’ll complete the SDK track with fine-tuning workflows, the Evaluation API, Batch API, and enterprise features.