Sora is a new tool that creates AI-generated videos from text prompts

OpenAI, the company that develops chatGPT and DALL-E 3 has set the bar even higher by releasing its first model for generating video.

Last week, OpenAI released Sora, a video-generation AI model capable of creating footage that appears realistic and, above all, in line with the physics of our real world. Videos generated by Sora are impressive not just for the quality of the images but because the tool seems to understand how to reproduce characters that move and perform actions respecting the boundaries of the physical world. 
 According to OpenAI, Sora is a fundamental step in the evolution of generative AI, as it equals to “teaching AI how to understand and simulate the physical world in motion”. 
The current version of the model can take in text prompts, just like chatGPT and DALL-E 3, and produces up to one minute-long videos that maintain visual adherence to the user’s input request.  The example videos published by OpenAI are impressive. In one, a woman-like character walks in a city that looks like Tokyo at night. In another one, Mammoths run in the snow, overlapping each other without breaking any continuity in the image. In a third one, a dog walks from one window sill to another, all without giving out that feeling that it’s floating or flying, but rather respecting the physical representation of the force of gravity that our brain would expect. 
 

Your browser doesn't support HTML5 video.


Above, a video generated by Sora in response to this prompt: “Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.”

Like chatGPT, Sora uses a transformer architecture and learns from previous footage by breaking down the videos into smaller units of data called patches, similar to how GPT breaks down words in tokens. The videos are then generated by creating a series of graphic noise patches that the model subsequently “denoises” over 50+ diffusion steps. Thanks to the “patches” system, the model can create videos in any resolution or orientation. In addition, the model has foresight of many frames at a time, which helps it keep a subject the same even when it goes out of view temporarily in the generated video.
 


Sora is currently not available to the public. Before releasing it into any OpenAI product, the company wants to properly assess safety concerns and potential cases of misuse of the technology. 
“We’ll be taking several important safety steps before making Sora available in OpenAI’s products. We are working with red teamers – domain experts in areas like misinformation, hateful content, and bias –who will be adversarially testing the model”, the company wrote in an article about the new model. “We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product”