OpenAI Sora: One Step Away From The Matrix
Author: Anonymous
Posted on: 17 Feb 2024

OpenAI announced the most important AI model yet in 2024: sora, a state-of-the-art (SOTA) text-to-video model that can generate high-quality, high-fidelity 1-minute videos with different aspect ratios and resolutions. Calling it SOTA is an understatement; Sora is miles ahead of anything else in the space. It’s general, scalable, and it’s also… a world simulator?
Quick digression: Sorry, Google, Gemini 1.5 was the most important release yesterday—and perhaps of 2024—but OpenAI didn’t want to give you a single ounce of protagonism (if Jimmy Apples is to be believed, OpenAI had Sora ready since March—what?—which would explain why they manage to be so timely in disrupting competitors’ PR moves). I’ll do a write-up about Gemini 1.5 anyway because although it went under the radar, we shouldn’t ignore a 10M-token context window breakthrough.
Before you ask, OpenAI isn’t releasing Sora at this time (not even as a low-key research preview). The model is going through red-teaming and safety checks. OpenAI wants to gather feedback from “policymakers, educators and artists around the world.” They’re also working on a detection classifier to recognize Sora-made videos and on ways to prevent misinformation.
In the second part (hopefully soon), I’ll share reflections about where I think we’re going both technologically and culturally (there’s optimism but also pessimism). I hope you enjoy this first part because the second one, well, is not for amusement—which is appropriate given that soon everything will be.
Sora is a text-to-video model
Sora is a high-quality text-to-video model (compared to the competence), which is impressive in itself.
Sora is a diffusion transformer
Sora combines a diffusion model (DALL-E 3) with a transformer architecture (ChatGPT). The mix allows the model to process videos (which are temporal sequences of image frames) like ChatGPT processes text.
In particular, OpenAI has taken inspiration from DeepMind’s work on vision transformers to “represent videos and images as collections of smaller units of data called [spacetime] patches, each of which is akin to a token in GPT.”
Sora is a generalist, scalable model of visual data
Not only can Sora make images and videos from text, or transform images and videos into other videos, it can do it in a generalized, scalable way, unlike competitors.
For instance, Sora “can create multiple shots within a single generated video that accurately persist characters and visual style.” It can make videos up to a 1-minute in duration but you can also make them as short as you like. You can make vertical, square, and horizontal videos with different resolutions. From the report: “Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween.” Here’s an example.
Besides versatility, Sora appears to follow scaling laws that mirror those of language models. Quality improves substantially just by adding compute thanks to the characteristics of the transformer architecture. Here’s an example.
This generalized, scalable nature is what incites people to make predictions about AI disrupting Hollywood and movie-making in general. Given the pace of progress, it’s not crazy to imagine in a few months an AI model being able to create multi-scene, multi-character complex videos up to 5 or 10 minutes long.
Do you remember Will Smith eating spaghetti? That was one year ago.
Sora is a (primitive) world simulator
This is the news that has excited (worried?) me the most.
First, here’s a recap. Sora is a text-to-video model. Fine, it’s better than the rest but this technology already existed. Sora is a diffusion transformer. Likewise, OpenAI hasn’t invented the mix albeit they added interesting custom ingredients. Sora is a general and scalable visual model. Things begin to get interesting here. Possibilities open up for future research and surprise is warranted.
But, above all else, Sora is an AI model that can create physically sound scenes with believable real-world interactions. Sora is a world simulator. A primitive one, for sure (it fails, sometimes so badly that it’s better to call it “dream physics”) but the first of a kind.
OpenAI says Sora not only understands style, scenery, character, objects and concepts present in the prompt, etc., but also “how those things exist in the physical world.” I want to qualify this claim by saying that Sora’s eerie failures reveal that, although it might have learned an implicit set of physical rules that inform the video generation process, this isn’t a robust ability (OpenAI admits so much). But surely it’s a first step in that direction.
More from OpenAI on Sora as a world simulator (edited for clarity):
[Sora can] simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.
Simulation capabilities:
3D consistency
Long-range coherence and object permanence (e.g. our model can persist people, animals and objects even when they are occluded or leave the frame)
Interacting with the world (e.g. a painter can leave new strokes along a canvas that persist over time)
Simulating digital worlds (e.g. Minecraft)
I like Jim Fan’s take on this (and his breakdown of the pirate ship fight video):
Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos. Sora is a learnable simulator, or “world model”.
Of course it does not call UE5 [Unreal Engine 5] explicitly in the loop, but it's possible that UE5-generated (text, video) pairs are added as synthetic data to the training set.
OpenAI concluded the blog post with this sentence:
Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
So I will conclude this first part with two questions for you:
How far are we from The Matrix?
Do we really want to go there?