Cool or creepy? Microsoft's VASA-1 is a new AI model…

Cool or creepy? Microsoft's VASA-1 is a new AI model that turns photos into 'talking faces'

A new AI research paper from Microsoft promises a future where you can upload a photo, a sample of your voice and create a live, animated talking head of your own face.

VASA-1 takes in a single portrait photo and an audio file and converts it into a hyper realistic talking face video complete with lip sync, realistic facial features and head movement.

The model is currently only a research preview and not available for anyone outside of the Microsoft Research team to try, but the demo videos look impressive.

Similar lip sync and head movement technology is already available from Runway and Nvidia but this seems to be of a much higher quality and realism, reducing mouth artifacts. This approach to audio-driven animation is also similar to a recent VLOGGER AI model from Google Research.

How does VASA-1 work?

Microsoft says this is a new framework for the creation of lifelike talking faces and specifically for the purpose of animating virtual characters. All of the people in the examples were synthetic, made using DALL-E but if it can animate a realistic AI image, it can animate a real photo.

In the demo we see people talking as if they were being filmed, with slightly jerky but otherwise natural-looking movement. The lip sync is very impressive, with natural movement and no artefacts around the top and bottom of the mouth seen in other tools.

One of the most impressive things about VASA-1 seems to be the fact it doesn't require a face-forward portrait style image to make it work.

There are examples with shots facing a range of directions. The model also seems to have a high degree of control, capable of taking eye gaze direction, head distance and even emotion as an input to steer the generation.

What is the point of VASA-1?

One of the most obvious use cases for this is in advanced lip synching for games. Being able to create AI-driven NPCs with natural lip movement could be a game-changer for immersion.

It could also be used to create virtual avatars for social media videos, as seen already from companies like HeyGen and Synthesia. One other area is in AI-based movie making. You could make a more realistic music video if you can have an AI singer that looks like they are singing.

That said, the team say this is just a research demonstration, with no plans for a public release or even making it available to developers to use in products.

How well does VASA-1 work?

One thing that surprised the researchers was the ability of VASA-1 to perfectly lip-sync to a song, reflecting the words from the singer without issue despite no music being used in the training dataset. It also handled different image styles including the Mona Lisa.

They've got it creating 512x512 pixel images at 45 frames per second and can do it in about 2 minutes using a desktop-grade Nvidia RTX 4090 GPU.

While they say this is only for research, it will be a shame if this doesn’t get out into the public domain, even if only for developers as I’d love to see it in Runway or Pika Labs. Given Microsoft has a huge stake in OpenAI this could even be part of a future Copilot Sora integration.

More from Tom's Guide

Read news from 100's of titles, curated specifically for you.

Already a member? Sign in here