A new AI video model seems to come along every week and the latest, from Chinese tech giant Tencent, is a big deal. Hunyuan offers state-of-the-art video quality and motion while also being fully open-source.
Hunyuan Video is a 13-billion parameter diffusion transformer model that can take a simple text prompt and turn it into a high-resolution 5-second video. Currently, there aren't many places to try it outside China, but as it's open-source that will change. One service, FAL.ai, has already created a version you can play with.
The demo video looks impressive, with short sequences each offering a glimpse at natural-looking human and animal motion in a photorealistic style. There are also clips showing different animation styles.
Current implementations I've tried take up to 15 minutes to generate 5 seconds of video, so haven't had a lot of time for an experiment but my tests point to it being roughly equivalent to Runway Gen-3 and Luma Labs Dream Machine in output but prompt adherence (at least in English) isn't as good.
How Hunyuan works
Hunyuan is an open-source AI video model with 13 billion parameters. This makes it much larger than similar open-source models including the impressive Mochi-1 from Genmo. However, not all parameters are created equally, so this could be more bloat than performance — it will require more testing to tell.
It works like any other AI video model in that you give it text or an image, it gives you a video based on your input. It is available as a download but the current version requires at least 60GB of GPU memory — so you're looking at at least an Nvidia H800/H20.
This is open-source and like with Mochi-1 there will likely be some fine tuning to bring the requirements down so you can run it on something like an RTX4090.
Tencent says during testing it was able to achieve high visual quality, motion diversity and generation stability with human evaluations putting it on par with all the major commercial models. Being open-source does give it an advantage, in that the entire community can add features and improve the model.
The company said in the documentation that "this will empower everyone in the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem."
How well does Hunyuan work?
I've tried it out on FAL.ai and found that its prompt adherence and contextual understanding of physics weren't as good as promised in the documentation nor as good as Runway, Kling or Hailuo.
For example, I gave it my traditional test prompt: "A dog on the train." This tests how it handles a less descriptive prompt and one that requires an understanding of motion and speed.
It did OK but was over-simplistic in its output. When I try the same prompt with other models I get rapid motion outside, a clear train interior and a cute dog sitting on the seat. Hunyuan gave me a dog but it looked like a Doctor's waiting room.
Mochi-1 achieved an output comparable to Runway and Kling from the same prompt. It is possible it was just a bad generation from Hunyuan and trying again would result in a better output, but at 15 minutes per attempt I didn't have time.