OpenAI has officially unveiled Sora, its text-to-video AI model, as part of a 12-day "Product Launch Extravaganza" event series. Sora is now accessible to ChatGPT subscribers in the United States and most other countries, allowing users to experience it via the Sora.com website and the newly introduced Sora Turbo model. The latest update includes features such as text-to-video conversion, image animation, and video remixing.
According to OpenAI, ChatGPT Plus subscribers can prioritize the generation of up to 50 videos per month (equivalent to 1000 credits) with a resolution of 720p and a duration of 5 seconds each. Recently, OpenAI introduced the ChatGPT Pro subscription at $200 per month, which offers "unlimited generation" capabilities, increases the video limit to 500, enhances resolution to 1080p, and extends video length to 20 seconds. The more expensive plans also enable users to download watermark-free videos and allow up to five simultaneous video generations.
OpenAI initially announced the Sora text-to-video AI model in February of this year.
Users can now access Sora through a brand-new dedicated website. The interface offers multiple tools designed to streamline the video creation workflow. Users start by entering a prompt that specifies the desired content of the video. Subsequently, they can customize details such as the style of frames generated by Sora, video length, and other aspects. Sora supports outputting videos in three different aspect ratios: widescreen, vertical, and square.
To enable seamless switching of video aspect ratios, OpenAI has specially trained Sora using a technology known as "spatio-temporal patches." These patches serve as data units, analogous to tokens used by large language models when processing text. Spatio-temporal patches offer a standardized method for storing multimodal data that the video generation AI processes. Just as tokens can store various types of text, including prose and code, spatio-temporal patches can store videos of different proportions.
OpenAI developed the patches used to train Sora through a two-step process. First, each video in the training dataset is converted into latent space, an abstract mathematical representation that requires less storage than the original files. Next, the latent space is divided into smaller segments, with each segment functioning as an independent spatio-temporal patch.
In addition to enabling Sora to adjust video aspect ratios, the spatio-temporal patch technology offers other advantages during development. OpenAI states that using spatio-temporal patches allows Sora to be trained to handle videos with varying lengths, resolutions, and proportions, thereby simplifying the development process.
Additionally, OpenAI has equipped Sora with an advanced set of video customization controls. Experienced users can divide videos into multiple segments and input different instructions for customization in each segment. If a particular frame does not meet requirements, users can modify it by providing subsequent prompts. Sora also offers the capability to extract a specific frame and extend it to create an entirely new video.
The feature called "Blend" allows users to combine two video clips into a new video. In another section of the Sora interface, the "Featured" and "Latest" tabs display videos created by other users.
It is noteworthy that the original version of Sora previewed by OpenAI in February was capable of generating video clips up to one minute long. However, in the current release, the duration limit has been set to 20 seconds. Therefore, future updates to ChatGPT may include support for longer videos.
Furthermore, Sora has not yet been rolled out in the commercial versions of ChatGPT. If it is integrated into these plans in the future, OpenAI might introduce features tailored for professional video teams, such as the creation of shared content libraries, allowing teams to centrally store materials generated using Sora.