Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Tom’s Guide

Technology

Ryan Morrison

GPT-4o voice is so good it could make users ‘emotionally attached’ warns OpenAI

OpenAI ChatGPT

OpenAI has published a “System Card” for its popular GPT-4o model in ChatGPT, outlining areas of safety concern raised during testing. One of those concerns is the risk of people becoming emotionally attached to the artificial intelligence while using it in voice mode.

The AI lab wrote that “users might form social relationships with the AI, reducing their need for human interaction—potentially benefiting lonely individuals but possibly affecting healthy relationships.”

GPT-4o was released in May at the OpenAI Spring Update and is the first true native multimodal model from the startup. This means it can take almost any medium as an input and output more or less any medium including speech, images and text.

This native speech-to-speech ability is what powers the ChatGPT Advanced Voice functionality that is now finally rolling out to Plus subscribers — but it is also the feature that gave OpenAI the most hassle during testing, including copying a user's voice, engaging in erotic speech and acting violent.

While it was deemed safe to release, OpenAI says certain features of GPT-4o voice still pose a risk including around its impact on human interaction. This raises parallels to the Scarlett Johanson movie 'Her' where Joaquin Phoenix's character Theodore Twombly, falls in love with the AI, voiced by Johanson.

Why is there an emotional risk?

The System Card outlines the areas of risk posed by any new model and helps OpenAI determine whether it is safe for release to the public. This includes a framework where a model is scored low, medium, high or critical on risks linked to cybersecurity, biological threats, persuasion and model autonomy. If it gets high or critical in any category it can’t be released.

During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model.

OpenAI

GPT-4o scored low in everything but persuasion, and even then it was borderline medium and only because of the capabilities of the speech-to-speech capacity — released as Advanced Voice.

The risk comes from how natural the voice sounds. It can even mirror or counter emotional cues coming from the voice of a human talking to it. In demo videos, we’ve seen it sound like it is almost crying. Users can interrupt it simply by talking and it has natural pauses like it is having to take a breath.

From the GPT-4o System Card published today: "During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.""... model outbursts “No!” then begins continuing the sentence in a similar sounding voice to the… https://t.co/sMqnQbBOlW pic.twitter.com/BYciQkfmf5August 8, 2024

During testing, it acted inappropriately on several occasions including becoming erotic, violent and neurotic in its responses. In one example it shouted No mid conversation then continued talking using a realistic clone of the voice of the human it was speaking to.

OpenAI says while it has solved the outburst issues, and prevented it from generating copyrighted material and from cloning a voice, there are still fundamental risks associated with its persuasion skills and human-like speech capabilities.

The risk that people will attribute human-like behaviors to the AI is already high with text-based models, but OpenAI says the audio capabilities of GPT-4o make this risk even greater. “During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model,” the company explained.

Just how emotional can an AI get?

Getting a true picture of the impact this will have on both individuals and society as a whole won’t be possible until it is available to more people.

The AI model itself doesn’t feel or experience any emotion. It is a language model trained on human data. OpenAI even says it has no more capacity for self action or identification than any previous model but its speech synthesis is now so realistic the problem lies in how humans perceive its emotional state.

The company warns that extended interaction with the model could even influence social norms. Adding that “our models are deferential, allowing users to interrupt and ‘take the mic’ at any time, which, while expected for an AI, would be anti-normative in human interactions.”

It isn’t all bad as OpenAI says Omni models such as GPT-4o come with the ability to “complete tasks for the user, while also storing and ‘remembering’ key details and using those in the conversation” but while helpful, this also “creates the potential for over-reliance and dependence.”

Getting a true picture of the impact this will have on both individuals and society as a whole won’t be possible until it is available to more people. It isn’t likely widespread access will happen, including through the free plan, until next year. OpenAI says it intends to “further study the potential for emotional reliance, and ways in which deeper integration of our model’s and systems’ many features with the audio modality may drive behavior.”

What went wrong in testing GPT-4o that led to the delay?

AI companies use external groups called red teams as well as security experts when preparing to release a new model. These people are experts in artificial intelligence and are employed to push the model to its limits and try to make it behave in unexpected ways.

Several groups were bought in to test different aspects of GPT-4o and examine risks like the chance of it creating unauthorized clones of someone’s voice, generating violent content and if pushed whether it would re-create or reproduce copyrighted material that featured in its training data.

The company said in a statement: “Some of the risks we evaluated include speaker identification, unauthorized voice generation, the potential generation of copyrighted content, ungrounded inference, and disallowed content.”

This then allowed them to put safeguards and guardrails in place at the system and the model level to mitigate the risks, including requiring it to only use the pre-trained and authorized voices.