OpenAI has unveiled gpt-realtime, its most advanced speech-to-speech model to date, accompanied by the launch of a Realtime API. These developments aim to reduce latency, enhance voice quality, and equip developers with robust tools, such as support for MCP servers, image input capabilities, and integration with Session Initiation Protocol (SIP) for voice calling, all designed to enable production-grade AI voice agents.
By integrating the Realtime API with gpt-realtime, OpenAI has created a system that handles end-to-end voice processing within a unified architecture, rather than chaining separate speech-to-text and text-to-speech models. This integration significantly reduces response times while preserving the subtleties of speech, marking a critical advancement for real-time agents where even minor delays can disrupt conversational flow.
Trained to produce higher-quality speech with more natural rhythm and intonation, gpt-realtime can now respond reliably to tone-based instructions such as "speak empathetically" or "use a professional tone." Two new synthetic voices, Cedar and Marin, are now available, while existing voices have been refined to enhance realism.
In comprehension benchmarks, gpt-realtime demonstrated significant progress. It can interpret non-verbal cues, switch languages within a single sentence, and accurately handle alphanumeric sequences across languages—such as phone numbers and vehicle identification codes—in Spanish, Chinese, Japanese, and French. Internal testing revealed an accuracy rate of 82.8% on large audio benchmarks, up from 65.6% in the previous model. Instruction-following capabilities have also improved, with scores on multi-challenge audio benchmarks rising from 20.6% to 30.5%.
Function calling has also seen notable enhancements. The model now excels at identifying relevant functions, invoking them at the appropriate moment, and supplying accurate parameters. In complex function benchmarks, accuracy improved from 49.7% to 66.5%. Updates to asynchronous function calling allow voice agents to continue conversations while awaiting results—an especially valuable feature for customer support and transactional applications.
The Realtime API has been enhanced to meet production demands. Developers can now directly connect remote MCP servers to conversations, enabling tool calls without manual integration. Image input support allows applications to engage in context-aware dialogue based on visual inputs like screenshots or photographs. SIP support enables integration of voice agents with existing phone systems, including PBX and desktop telephony. Reusable prompts simplify conversation management, while full data residency support for the EU addresses compliance concerns for European deployments.
According to the release notes, early enterprise partners are already testing these features in production-like scenarios. Zillow is piloting a voice-driven home search system, while T-Mobile is exploring customer service use cases where real-time adaptability is crucial. Both companies noted the shift from scripted automation to more flexible, domain-specific expertise powered by AI agents.
OpenAI has also strengthened security measures for deployments. The Realtime API includes integrated classifiers that can terminate harmful conversations, and developers can add domain-specific protections using the agent SDK. Preset voices in the Realtime API help mitigate impersonation risks.
The gpt-realtime model and Realtime API are now available to all developers. To get started, developers can access the Realtime API documentation and prompt guide, and test the new gpt-realtime demo in the playground.