Google just published a new AI model that can take two images and fill in the gap to create a seamless animation that looks similar to a live photo.
VDIM (Video Interpolation With Diffusion Models) was created by Google’s research division DeepMind and uses one image to act as the first frame and the other the final frame. It then fills in all the shots in between using AI to make the video.
This could be great for bringing life to the flurry of pics you took in the park of your kids playing, or during an event of where you forgot to film the action.
It is currently only a research preview but the underlying technology could one day be an every day part of taking photos using smartphones.
How does VDIM work?
VDIM turns still images into video by creating the missing frames using a diffusion model, similar to those found in Midjourney, DALL-E or Google's own Imagen 2.
Essentially it uses the two images as reference material, much like Runway or Pika Labs can create a video from an image, but in this case it uses them as the first and last frame.
It starts by creating a low-resolution version of the full, final video. It does this by running a cascade of diffusion models in a sequence to continually refine the video. This first step allows VDIM to capture the motion and dynamics of the final output.
It then passes this information on to a higher resolution step where it is upscaled and improved to more closely match the input images and make the motion more fluid.
What could VDIM be used for?
One potential use case for VDIM the team examined in the research paper was video restoration. AI has been used to improve old images but this could help clean up old family movies or restore a film with broken frames.
Old movies can sometimes suffer a burnt out frame in the middle of a sequence, making it hard to watch. Or they have a handful of frames with scratching.
VDIM could be given the first and last clean frame and used to recreate the motion between those two points — restoring it to its original quality.
How well does VDIM work?
Nobody outside of the Google DeepMind research team has actually used VDIM yet as it is a research project, but from the example video clips its a good start for a new type of AI video.
Some of the example videos shared by Google DeepMind include the start of a box cart race where they had just two still images — this was turned into a fluid run off.
Another shows a woman on a swing where they turn two images into a fluid swinging motion and a third shows a camel going from a jittery jump to a natural walk.
Personally, I think this is one of those research projects Google should take forward and find a way to implement in live software, particularly for video restoration. Especially if they’re able to extend it beyond just a couple of seconds or a dozen or so frames.