Artificial intelligence speech startup ElevenLabs offered an insight into what its planning to release in the future, adding sound effects to AI generated video for the first time.
Best known for its near human-like text-to-speech and synthetic voice services, ElevenLabs added artificially generated sound effects to videos produced using OpenAI’s Sora.
OpenAI unveiled its impressive Sora text-to-video artificial intelligence model last week, showcasing some of the most realistic, consistent and longest AI generated video to date.
ElevenLabs says it isn’t ready to release its text-to-sfx model yet but when live it will be able to create a full range of sounds including footsteps, waves and ambience. The company wrote on X: "We were blown away by the Sora announcement but felt it needed something... What if you could describe a sound and generate it with AI?"
ElevenLabs expanding to include sounds
ElevenLabs was founded in 2022 and is seen as producing the most realistic synthetic voices, generating speech that is close enough to natural to be almost undetectable.
The U.K.-based startup reached billion dollar value unicorn status at the start of this year with its most recent $80 million Series B round. This announcement of the funding round came with a new tool for synching AI speech in video for auto translations — taking on the international dubbing market.
There are already some text-to-sfx models on the market, often built around music AI models including myEdit, AudioGen and Stable Audio from StabilityAI. The sounds from ElevenLabs appear to be among the most natural but it isn’t clear how much editing was involved.
It isn’t currently clear when text-to-sfx will launch but ElevenLabs has released a waitlist sign-up that asks for a “prompt you might use to create a sound”.
What does this mean for AI video?
The next stage will likely be tools that can analyze the content of a video and automatically add sound effects at exactly the right points. The same could apply to music. Most AI music tools are currently text-to-music, but in future with multimodality, they could go from image or video.
One of the dreams of generative AI has been the ability to create an entire, fully rounded piece of content from a single prompt.
At the moment that is barely a dream, let alone close to reality but with advances like text-to-sfx, improved AI video and synthetic voice — it is getting closer.