Google Research scientists have released their paper on GameNGen, an AI-based game engine that generates original Doom gameplay on a neural network. Using Stable Diffusion, scientists Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter designed GameNGen to process its previous frames and the current input from the player to generate new frames in the world with surprising visual fidelity and cohesion.
AI-generating a complete game engine with consistent logic is a unique achievement. GameNGen's Doom can be played like an actual video game, with turning and strafing, firing weapons, and accurate damage from enemies and environmental hazards. An actual level is built around you in real-time as you explore it. It even keeps a mostly precise tally of your pistol's ammo. According to the study, the game runs at 20 FPS and is difficult to distinguish in short clips from actual Doom gameplay.
To obtain all of the training data necessary for GameNGento to accurately model its own Doom levels, the Google team trained its agent AI to play Doom at all difficulties and simulate a range of player skill levels. Actions like collecting power-ups and completing levels were rewarded. At the same time, player damage or death was punished, creating agents that could play Doom and providing hundreds of hours of visual training data for the GameNGen model to reference and recreate.
A significant innovation in the study is how the scientists maintained cohesion between frames while using Stable Diffusion over long periods. Stable Diffusion is a ubiquitous generative AI model that generates images from image or text prompts and has been used for animated projects since its release in 2022.
Stable Diffusion's two most significant weaknesses for animation are its lack of cohesion from frame to frame and its eventual regression in visual fidelity over time. As seen in Corridor's Anime Rock Paper Scissors short film, Stable Diffusion can create convincing still images but suffers from flickering effects as the model outputs consecutive frames (notice how the shadows seem to jump all across the faces of the actors from frame to frame).
The flickering can be solved by feeding Stable Diffusion its output and training it using the image it created to ensure frames match one another. However, after several hundred frames, the image generation becomes less and less accurate, similar to the effect of photocopying a photocopy many times.
Google Research solved this problem by training new frames with a more extended sequence of user inputs and frames that preceded them—rather than just a single prompt image—and corrupting these context frames using Gaussian noise. Now, a separate but connected neural network fixes its context frames, ensuring a constantly self-correcting image and high levels of visual stability that remain for long periods.
The examples of GameNGen seen so far are, admittedly, less than perfect. Blobs and blurs pop up on-screen at random times. Dead enemies become blurry mounds after death. Doomguy on the HUD is constantly flickering his eyebrows up and down like he's The Rock on Monday Night Raw. And, of course, the levels generated are inconsistent at best; the embedded YouTube video above ends in a poison pit where Doomguy suddenly stops taking damage at 4% and completely changes its layout after turning around 360 degrees inside it.
While the result is not a winnable video game, GameNGen produces an impressive simulacrum of the Doom we love. Somewhere between tech demos and thought experiments on the future of AI, Google's GameNGen will become a crucial part of future AI game development if the field continues. Paired with Caltech's research on using Minecraft to teach AI models consistent map generation, AI-baked video game engines could be coming to a computer near you sooner than we'd thought.