In the last few months, we've seen how large language models such as ChatGPT can generate text copy, how image generators like Stable Diffusion can create pictures on demand and even how some can do text-to-speech. One enterprising developer who goes by the handle Pizza Later, combined five different AI models to create a live action commercial for a fictional pizza restaurant called "Pepperoni Hug Spot."
The resulting video, which I've embedded below, is both horrifying and impressive at the same time. The commercial features photo-realistic people who are eating, cooking and delivering some very appetizing pepperoni pizza. It even has human-sounding dialog and decent background music. However, the facial expressions and dead eyes on some of the characters are a little much.
Obviously, the quality of the output leaves something to be desired. At times, objects appear to blend into each other; my son said that it looked like the people were eating pizza that grew out of the plate.
The people all look like residents of the uncanny valley. And the somewhat incoherent script reads like text from another language that was improperly translated into English (though it was not).
However, it's impressive to see just how close these technologies are to being ready for prime time. We can see how, in short order, the photo-realistic video images could become a lot more convincing.
To be fair, this video did require some human editing. Pizza Later told us that they used five different models to make various assets for the video and then spent some time using Adobe After Effects to stitch the video, dialog, music and some custom images together. Overall, it took them 3 hours to complete the project.
Pizza Later said they got the idea for the commercial after gaining access to Runway Gen-2, a text-to-video model that's in private beta. In an email interview, the developer told me that their initial prompt for the video was just "a happy man/woman/family eating a slice of pizza in a restaurant, tv commercial." Runway Gen-1, which creates videos based on existing footage, is available to try free right now either on the web or via a brand new iOS app.
After seeing the high quality of video that Runway Gen-2 created, Pizza Later used GPT-4 (the engine behind ChatGPT and Bing Chat) to come up with a name for the fictional pizza joint (Pepperoni Hug Spot) and to write the script. The developer then used ElevenLabs Prime Voice AI to provide realistic narration with a male voice. They used MidJourney to generate some images that appear in the video, including the restaurant exterior and some pizza patterns. They also used Soundraw to create background music.
Most of the tools Pizza Later used are paid, but offer some kind of free trial, lower-end free account or initial set of free credits. Clearly, this is far from a plug and play operation as the developer had to stitch the end results together.
Perhaps, in the near future, a multi-model tool like Microsoft Jarvis will be able to perform all these tasks via a single chat prompt. Or maybe an autonomous agent such as AutoGPT (see how to use AutoGPT) will generate commercials if you give it the broad goal of marketing a restaurant. However, for now, this video is really impressive, even after knowing that it required human editing.