top of page
  • Writer's picturePritish Bagdi

"How does Sora by OpenAI revolutionize text-to-video creation?"

In an effort to keep ahead of industry rivals, Microsoft-backed OpenAI has announced its latest breakthrough, Sora, a cutting-edge text-to-video model.


Pritish Bagdi

This action demonstrates OpenAI's dedication to preserving a competitive edge in the fast-growing field of artificial intelligence (AI) in an era where text-to-video solutions are becoming increasingly popular.


What is Sora?

Sora, which means sky in Japanese, is a text-to-video diffusion model capable of producing minute-long films that are difficult to distinguish from the original.

OpenAI stated in a post on the X platform (formerly Twitter) that "Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions."

According to the manufacturer, the new model can create lifelike films from still photos or user-supplied footage.

"We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction," the post read.

How are you going to attempt it?

The majority of us will have to wait to use the new AI model. Even though the text-to-video model was unveiled by the corporation on February 15, it is now in the red-teaming stage.

Red teaming is the process of simulating real-world use by a group of experts called the "red team" to find flaws and vulnerabilities in the system.

"We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals," the business stated.

Nonetheless, the business posted a number of demonstrations in the blog post, with OpenAI's CEO providing videos of user-requested prompts on X.

How does it operate?

Consider beginning with a loud, static image on a TV and gradually eliminating the fuzziness to reveal a clean, moving video. That's what Sora does. This unique software employs "transformer architecture" to progressively eliminate noise and produce videos.

Not just frames by frames, but complete films can be produced at once by it. Users can direct the video's content by feeding the model text descriptions, such as ensuring that a person remains visible even if they briefly walk off-screen.

Consider GPT models that produce text by word. Similar actions are taken by Sora, but with pictures and movies. Videos are divided into smaller segments known as patches it.

"Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully," the company said in the blog post.

However, the company has not provided any details on what kind of data the model is trained on.
















bottom of page