If you picture a cape-clad dog soaring through the clouds or an astronaut riding a horse on Mars, you may think you’re experiencing a fever dream.
But these surreal images exist outside of a sleepy daze: You can pull them up on your computer right now.
They were created by Meta’s top-of-the-line algorithms that can turn any text into a (somewhat) realistic video. Last month, Meta used these surreal clips to introduce its Make-A-Video AI text-to-video generator to the world.
Just days later, Google showed off not one but two AI video generators: Imagen Video and Phenaki. These models were designed to transform text descriptions into short video clips. The longest clips come from Phenaki and last up to several minutes long.
While these modern marvels haven’t yet been released for public access, they could forever change the way we make art. So far, the models have received both criticism and praise from AI and art experts.
“The big commercial advances are quite stunning, even for experts,” says David Bau, a computer scientist at Northeastern University.
How models sculpt videos from scratch
This isn’t the first instance of AI-powered video manipulation: Over the past few years, for example, several startups have figured out how to adjust lip movements to sync with audio. It’s also possible to swap people’s faces to, say, fool viewers into thinking a celebrity starred in a movie they weren’t actually involved in.
And now, the new text-to-video models can fuse unrelated concepts (like knitting, music, and pandas) and turn over a mind-bending final product (such as a panda knitting on a couch while bopping its head to music).
“That capability to create novel compositions of such a wide range of visual concepts is new,” says Bau. “That is what is so stunning.”
Make-A-Video and Imagen Video belong to a group of machine-learning models called diffusion models.
Engineers train a diffusion model by showing it a captioned image or video and asking the model to sharpen this image. In the process, the model learns to predict what images certain words might translate to onscreen.
To train a model on the scale needed for Make-A-Video and Imagen Video, engineers had it study hundreds of millions of images.
Finally, the model was able to tackle prompts. Like a sculptor slowly chiseling away at a chunk of stone, a diffusion model forms an image or video over hundreds or even thousands of runs — cleaning up pixels here, coloring a shape there, reshaping an edge elsewhere. Ultimately, a somewhat convincing clip remains.
Phenaki's creators similarly showed it millions of images and videos with accompanying text — but Phenaki learned which words in the text were important. That means it can take, say, a paragraph-sized narrative, break it down into a series of events, and turn that series into a movie of whatever length Phenaki deems appropriate.
Falling flat (for now)
Of course, the process is far from perfect. For one, text-to-video models are only as good as their (massive) training datasets, and it’s hard to separate data from the prejudices of the people that created them.
Imagen Video’s own creators acknowledged that their model is capable of turning out racist or sexist content. In fact, Google won’t release it to the public until the company addresses “several important safety and ethical challenges," according to the preprint.
Imagen Video can make videos that are “fake, hateful, explicit or harmful.”
The creators noted that Imagen Video can make videos that are “fake, hateful, explicit or harmful,” and it’s hard to detect and filter out this type of content at the moment.
Meta has remained mum on the topic, but a spokesperson said that the company “will continue to explore ways to further refine and mitigate potential risk.”
And these models aren’t likely to impress real-world artists. Compared to work by real-life artists, the cherry-picked computer-generated demonstration clips fall flat. But that could change.
Real-world applications
As text-to-video AI improves, some experts think the stock video industry could draw particular benefits. “Why would you pay a large amount of money … to license an image or a video, if you could just generate it on the spot?” says Henry Ajder, an AI scientist and consultant who researches deepfakes.
Generative models could also benefit Hollywood. Filmmakers could use them to envision how an actor might look in a certain part, for instance, or to plan out scenes before shooting them.
At some point, text-to-video might excel at spitting out certain products — banner advertisements, for example, or basic animations in video games. For artists, that could pose a major dilemma.
After all, some art jobs are being outsourced to AI-generated illustrations, says Julian Merkle, a video game concept artist at Beffio Studio.
“The combination of these two things seems like a fragile and undesirable situation.”
What’s more, the convergence of art and AI raises yet-to-be-answered questions. Who actually owns the rights to an AI-generated work of art? And what happens when they create media “in the style of” existing artists, which some models are already doing?
“There wasn't a consensus before the rise of AI and there won't likely be one after, though people seem quicker to attack AI on it than artists,” Merkle says. “I do think this is an issue with our copyright system.”
On the other hand, text-to-video could put more power in the hands of individuals. Imagine a single person who wants to make a video game.
Today, that’s an uphill battle — independent game creators must be skilled enough to make drawings, animations, cutscenes, and text. But in the future, one person might be able to accomplish all of that with the help of generative models
That’s all far on the horizon. For now, large-scale video generators are essentially locked jewels displayed in the companies’ glass cases. This means that researchers can’t take a closer look at the models’ mysterious workings.
“To me, the combination of these two things seems like a fragile and undesirable situation,” Bau says.
An uncertain future
But the engineers between other AI art generators tend to be far more transparent. The text generator GPT-3, which was created by the San Francisco-based OpenAI lab, can write a poem or text summary, among other tasks. While it isn’t free to use, GPT-3 spawned a wave of open-source language models.
And earlier this year, the image generator DALL-E 2 met its open-source match with Stable Diffusion.
“Moderation is the biggest challenge, for me, with these tools.”
So anyone with a powerful enough computer (a good gaming rig will do) could download and tinker with Stable Diffusion or a video-generating equivalent. But doing so removes the filters that prevent the proprietary models from spawning offensive content.
This might open the floodgates for more creative and dangerous deepfakes. Bullies could produce footage to target classmates, or abusers could churn out explicit clips of their exes. “When everyone can use it, it’s no longer just the celebrities that are being targeted,” says Ajder. “Moderation is the biggest challenge, for me, with these tools.”
So while we may soon enter an age of easy-access deepfakes, we likely have a few years to prevent dangerous outcomes. “We’re still a way away from hyperrealistic, indistinguishable-from-authentic content on video,” Ajder says.