Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

LiveScience

Owen Hughes

AI speech generator 'reaches human parity' — but it's too dangerous to release, scientists say

Microsoft OpenAI

Hi-tech digital sound wave low and high richter scale with circle vibration on light blue Background.

Microsoft has developed a new artificial intelligence (AI) speech generator that is apparently so convincing it cannot be released to the public.

VALL-E 2 is a text-to-speech (TTS) generator that can reproduce the voice of a human speaker using just a few seconds of audio.

Microsoft researchers said VALL-E 2 was capable of generating "accurate, natural speech in the exact voice of the original speaker, comparable to human performance," in a paper that appeared June 17 on the pre-print server arXiv. In other words, the new AI voice generator is convincing enough to be mistaken for a real person — at least, according to its creators.

"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time," the researchers wrote in the paper. "Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases."

Human parity in this context means that speech generated by VALL-E 2 matched or exceeded the quality of human speech in benchmarks used by Microsoft.

The AI engine is capable of this given the inclusion of two key features: "Repetition Aware Sampling" and "Grouped Code Modeling."

Repetition Aware Sampling improves the way the AI converts text into speech by addressing repetitions of "tokens" — small units of language, like words or parts of words — preventing infinite loops of sounds or phrases during the decoding process. In other words, this feature helps vary VALL-E 2's pattern of speech, making it sound more fluid and natural.

Grouped Code Modeling, meanwhile, improves efficiency by reducing the sequence length — or the number of individual tokens that the model processes in a single input sequence. This speeds up how quickly VALL-E 2 generates speech and helps manage difficulties that come with processing long strings of sounds.

The researchers used audio samples from speech libraries LibriSpeech and VCTK to assess how well VALL-E 2 matched recordings of human speakers. They also used ELLA-V — an evaluation framework designed to measure the accuracy and quality of generated speech — to determine how effectively VALL-E 2 handled more complex speech generation tasks.

"Our experiments, conducted on the LibriSpeech and VCTK datasets, have shown that VALL-E 2 surpasses previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity," the researchers wrote. "It is the first of its kind to reach human parity on these benchmarks."

The researchers pointed out in the paper that the quality of VALL-E 2’s output depended on the length and quality of speech prompts — as well as environmental factors like background noise.

"Purely a research project"

Despite its capabilities, Microsoft will not release VALL-E 2 to the public due to potential misuse risks. This coincides with increasing concerns around voice cloning and deepfake technology. Other AI companies like OpenAI have placed similar restrictions on their voice tech.