What you need to know
- OpenAI just released its new flagship GPT-4o model.
- It can reason across audio, vision, and text in real time, making interactions with ChatGPT more seamless.
- OpenAI also unveiled a native ChatGPT app for Mac, snubbing Windows.
- A viral ChatGPT demo has showcased GPT-4o's audio and visual capabilities to talk to another AI model.
OpenAI just unveiled its new flagship GPT-4o model (I know I'm not the only one getting confused by these models as they continue to ship). Essentially, GPT-4o is an improved version of OpenAI's GPT-4 and is just as smart. The model is more intuitive and can reason across audio, vision, and text in real time, making interactions with ChatGPT more seamless.
The "magic" behind OpenAI's just-concluded Spring Update event remains debatable, but the demos emerging on social media are pretty impressive, bordering on mind-blowing. Translating the Italian language to English and relaying the information in real-time is quite something, potentially keeping impediments to communication, like language barriers, at bay.
But what left me perplexed was a video demo shared by OpenAI President and Co-founder Greg Brockman on X (formerly Twitter). I never thought we'd one day reach a point where a virtual assistant could hold a full conversation with another AI assistant with minimal complications.
Introducing GPT-4o, our new model which can reason across text, audio, and video in real time.It's extremely versatile, fun to play with, and is a step towards a much more natural form of human-computer interaction (and even human-computer-computer interaction): pic.twitter.com/VLG7TJ1JQxMay 13, 2024
The demo starts with the user explaining to two AI chatbots that they'll essentially be talking to each other. The user walks the chatbots through his expectations, stating that one of the chatbots can see the world via a camera. In contrast, the other chatbot can model questions or even direct it to perform specific tasks with the user's assistance.
"Well, Well, Well, just when I thought things couldn't get any more interesting," jokingly replied the first chatbot. Talking to another AI that can see the world, this sounds like a plot twist in the AI universe." Just before the AI assistant could agree to the terms, the user asked it to pause for a bit while he gave instructions to the second AI.
Right off the bat, the user starts talking to the second AI assistant by telling it that it will have access to see the world. I assume this is a subtle prompt asking the assistant to access the camera on the phone, which it'll use as its eyes to see the world. Instantly, the interface features a camera (selfie mode) and paints a crystal clear picture of what the user is wearing and his environment.
From this point, the user points out that the first AI model will talk to it and ask questions, including moving the camera and what it sees. It's expected to be helpful, and the questions will be answered accurately.
The process begins with the AI that can "see the world," explaining what's in its view, including the user and more context about his dress code and building design. Interestingly, it almost feels like two humans conversing on FaceTime, as the first AI gives feedback based on the information shared. Additionally, the AI seems to have a firm grasp of what the user is doing, their expression, and even their style based on what they have on.
What blew my mind was when the user signaled another person in the room to come closer and appear in the AI's view. The AI instantly picked up on this and even indicated that the user "might be getting ready for presentation or conversation" based on his direct engagement with the camera.
Interestingly, introducing a third party didn't affect the conversation between both AIs. At first glance, it's almost possible to say that the AI didn't catch a glimpse of the person entering the room and standing behind the user holding the phone.
However, this isn't the case. The user briefly stopped the conversation between both AIs to ask if something unusual had occurred. The AI with visual capabilities pointed out that a second person came into view behind the first person and playfully made bunny eyes behind the first person before quickly leaving the frame. The AI referred to the situation as light-hearted and unexpected.
The demo continues to showcase GPT-4o's vast capabilities. The user even requests that both models create a song based on the events that had just transpired and sing it by alternating lines. At some point, it seems like a choir master prepping his choir for a significant upcoming event at church.
I should also point out that most of the demos I've seen are mostly on Apple devices like the iPhone and MacBook. Perhaps this is why OpenAI released a native ChatGPT app for Mac users before shipping it to Windows. Besides, OpenAI CEO Sam Altman admitted that "the iPhone is the greatest piece of technology humanity has ever made."