It almost seems quaint to remember when all AI could do was generate images from a text prompt. Over the last couple of years generative AI has become more and more powerful, making the jump from photos to videos with the advent of tools like Sora. And now Microsoft has introduced a powerful tool that might be the most impressive (and terrifying) we've seen yet.
VASA-1 is an AI image-to-video model that can generate videos from just one photo and a speech audio clip. Videos feature synchronised facial and lip movements, as well as "a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness."
On its research website, Microsoft explains how the tech works. "The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviours."
Microsoft just dropped VASA-1.This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba10 wild examples:1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnDApril 18, 2024
In other words, it's capable of creating deepfake videos based on a single image. It's notable that Microsoft insists the tool is a "research demonstration and there's no product or API release plan." Seemingly in an attempt to allay fears, the company is suggesting that VASA-1 won't be making its way into users' hands any time soon.
From Sora AI to Will Smith eating spaghetti, we've seen all manner of weird and wonderful (but mostly weird) AI generated video content, and it's only going to get more realistic. Just look how much generative AI has improved in one year.