Google launched a new artificial intelligence product at its Google I/O event on Tuesday — Gemini Live. We all assumed that is what the Gemini Assistant in Android was supposed to do but this is Google and anything goes.
If it wasn’t for the fact it comes just one day after OpenAI’s first consumer product event, I’d ponder over whether Gemini Live was launched to take on ChatGPT Voice. Both are built using native multi-modal AI models and have impressive voice and video capabilities.
Currently in the global AI race the front runners seem to be OpenAI and Google, with the former seemingly cozying up to Apple and the iPhone and the latter in control of Android. Forget AI devices like the Rabbit r1 or the Humane Pin — the short-term winner is the smartphone.
Both ChatGPT Voice and Gemini Live are being integrated into an existing AI product and neither is available today — but how else do these next-generation assistants compare?
How do Gemini Live and ChatGPT 4o compare?
This summer, we’re expanding Gemini’s multimodal capabilities — including the ability to have an in-depth two-way conversation using your voice. This new experience is called Live. #GoogleIO pic.twitter.com/eAZbaO5WKzMay 14, 2024
Google is on the back foot a little when it comes to credibility, especially around showing off live video analysis and voice capabilities. When it announced Gemini Ultra last year it did so with a video of it responding to real-time video — only it wasn’t real-time or video.
However, this time they made a point of making the tech, at least the underlying “Project Astra” aspect of it including speech and video conversation available to try out at I/O.
Both offer a conversational, natural language voice interface, both offer the potential for live video analysis through a smartphone camera and both seem to be fast enough for a truly natural conversation where you can interrupt the AI mid-flow.
However, there are some notable differences. OpenAI’s ChatGPT Voice sounds more natural, can detect and respond to emotion and vocal tones and even adapt in real-time to how you ask it to speak. I didn’t see evidence of that capability from Gemini Live.
The other big difference is around multimodality. Gemini still relies on other models for output including using Imagen 3 for images and Veo for video. GPT-4o is natively multimodal in both directions — the o stands for omni, or in all directions. It creates its own images and sound.
Gemini Live vs GPT-4o: The future of voice assistants
The world seems to be moving towards voice and away from text input. When I first watched the OpenAI announcement my reaction was that this is a paradigm shift in human-computer interface, one as big as the launch of the mouse or the touch screen.
I still hold that view and the fact Google is also launching a native, natural-sounding voice interface further cements that. Even Meta has its MetaAI, a voice bot available in its VR headsets and the Ray-Ban smart glasses.
While the smartphone might be the winner for now, its clear the real form factor for these voice AI models is smart glasses. Available with cameras at eye height and arms to send soundwaves into your ears — they are the perfect AI device.
The question is whether OpenAI moves into hardware, launching its own pair of smart glasses or whether this is the new Siri and will power a future Apple Glasses product. Also, whether Google is really brave enough to resurrect Google Glass.