On August 29, OpenAI officially launched its “Realtime API” into production, moving it out of the Beta testing phase.
This API is primarily aimed at businesses and developers, enabling them to build voice assistants for real-world applications such as customer service, education, and personal productivity. Its core component, the “gpt-realtime” model, employs an end-to-end Speech-to-Speech architecture, allowing direct generation and processing of audio without the need for intermediate text conversion. According to OpenAI, this model delivers faster response times, more natural-sounding voices, and improved handling of complex commands compared to its predecessor.
OpenAI highlighted that the gpt-realtime model can detect non-speech signals like laughter, supports mid-conversation language switching, and allows customization of voice tone—for instance, a friendly tone with a French accent or a fast-paced professional delivery. Additionally, two new voices, “Cedar” and “Marin,” have been introduced, along with enhancements to eight existing voice options.
In benchmark testing, the model demonstrated significant improvements: accuracy in the Big Bench Audio test increased from 65.6% to 82.8%, from 20.6% to 30.5% in the MultiChallenge test, and from 49.7% to 66.5% in the ComplexFuncBench test.
The API update also streamlines tool integration. The model now better selects appropriate tools, triggers them at the right moments, and configures parameters accurately, enhancing the reliability of function calls. Developers can connect external tools and services using the Session Initiation Protocol (SIP) and Media Control Protocol (MCP) servers. Reusable prompt functionality allows saving configurations and tool settings for different use cases, further improving development efficiency.
Image input support is now available in the API. During conversations, users can send screenshots or photos, and the model can interpret the visual content—such as reading text within an image or answering questions related to the image. Developers have control over which images the model can access.
Additionally, two new features have been introduced: developers can set token usage limits and condense long conversation histories. These features help manage costs more effectively during extended interactions. Pricing for the gpt-realtime model has been reduced by 20%, with current rates at $32 per million input audio tokens, $64 per million output audio tokens, and $0.40 per million cached input tokens.
OpenAI noted that the API includes the capability to detect harmful content and can automatically terminate conversations that violate platform policies. However, as seen in the safety evolution of language models, this should not be the sole security measure—developers are still encouraged to implement their own safety protocols.
For users in the European Union, the API offers data localization options and customized privacy policies for enterprise clients, ensuring compliance with regional data protection regulations.