OpenAI debuts Text-to-Video: What’s So Right about ‘Sora’?

OpenAI has introduced Sora, a new software that can create hyper realistic one-minute videos based on text prompts.

The Sam Altman-led AI start-up is also reportedly working with visual artists, designers, and filmmakers to gather feedback on the tool.

Altman took to his X account to introduce Sora, sharing a host of videos on his profile to showcase the efficiency and visual capabilities of the new AI model.

While the model is currently in the red teaming phase, where the company is working towards identifying flaws in the system, the company is yet to share any information regarding a wider launch. 

What is Sora?

According to OpenAI, Sora is a text-to-video model that generates one-minute-long videos while ‘maintaining the visual quality and adherence to the user’s prompt.’

Sora is capable of generating complex scenes with numerous characters with specific types of motion and accurate details of the subject and background, the company said.

Sora will not only understand the users’ prompts, but also be able to comprehend how these things will reflect in the real world.

Altman shared creations of Sora based on his followers’ prompts and many let their imaginations run wild.

Sora: What is under the hood?

Sora is a diffusion-based model, which can generate entire videos at once or lengthen videos.

A transformer architecture unlocks superior scaling performance much similar to GPT models.

Videos and images are collections of smaller units of data which are known as patches. Each of these patches is similar to tokens in GPT.

Sora is built upon past research conducted for DALL-E, specifically the recapturing techniques, including generating descriptive captions for visual training data, and GPT Models. 

Apart from generating videos from prompts in natural language, the model can generate a video from an existing image.

Sora has an in-depth understanding of language which allows it to interpret prompts with accuracy.

It can create characters that showcase vibrant emotions and can create multiple shots within a single generated video,  consistently maintaining visual style and characters.

According to OpenAI, Sora will essentially animate the image’s components accurately and can extend existing videos by filling in missing frames

Limitations of Sora

OpenAI also highlighted Sora’s limitations. At present, the model may struggle with creating the ‘physics of a complex scene’ with accuracy.

Another area where the Sora engine may find difficulty is to understand specific instances of cause and effect.

The company illustrated this by the cookie example, so a video showing a person taking a bite out of a cookie, however, the cookie may not show the bite mark.

Spatial details are another area where Sora’s capabilities might not be fully developed. Responding to user prompts with spatial details could confuse it causing it to struggle with precise descriptions of events that take place over time. 

Safety concerns with Sora

On its official website, OpenAI has stated that it has been taking several safety measures before making Sora accessible in its products.

The company went on to assert that they are working with a team of domain experts ‘specific to misinformation, hateful content, and bias,’ esides, the company building tools like a ‘detection classifier’ that can detect misleading content and tell if a video was generated by Sora. 

These experts will be adversarially testing Sora. 

‘We’ll be engaging policymakers, educators, and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time,’ reads the official company statement.

The inclusion of C2PA, an open technical standard that allows publishers, companies, and others to embed metadata in media to verify its origin and related information is a possibility.

DALL e-3’s existing safety measures may also be incorporated into Sora. For instance, DALLE-3 has a mechanism to reject queries that ask for the depiction of public figures by name. This is meant to mitigate the possibility of using DALL-E 3-generated images to spread propaganda and misinformation.

OpenAI will also keep a check on the company’s usage policy which include, requests of extreme violence, sexual content, hateful imagery, celebrity likeness, or IP of others.

Image classifiers will review the frames of every video to ensure that they align with policy.

Text-to-video models are pushing the boundaries of of AI with their video generation capabilities and ushering in the era of advanced Artificial General Intelligence.

Sora is yet another step further in that direction and clearly ahead of other tools like Google’s Imagen Video, but accessible to the  general public. Google has also worked on Phenaki, its text-to-video model, and Meta too had its stint with the Make-A-Video tool.

OpenAI will again change the game, just as ChatGPT did.