Moshi Chat is a new native speech AI model from French startup Kyutai, promising a similar experience to GPT-4o where it understands your tone of voice and can be interrupted.
Unlike GPT-4o, Moshi is a smaller model and can be installed locally and run offline. This could be perfect for the future of smart home appliances — if they can improve the responsiveness.
I had several conversations with Moshi. Each lasts up to five minutes in the current online demo and in every case it ended with it repeating the same word over and over, losing cohesion.
In one of the conversations it started to argue with me, flat out refusing to tell me a story, demanding instead to state a fact and wouldn’t let up until I said “tell me a fact.”
This is all likely an issue of context window size and compute resources that can be easily solved over time. While OpenAI doesn’t need to worry about the competition from Moshi yet, it does show that as with Sora, where Luma Labs, Runway and others are pressing against its quality — others are catching up.
What is Moshi Chat?
Moshi Chat is the brainchild of the Kyutai research lab and was built from scratch six months ago by a team of eight researchers. The goal is to make it open and build on the new model over time, but this is the first openly accessible native generative voice AI.
“This new type of technology makes it possible for the first time to communicate in a smooth, natural and expressive way with an AI,” the company said in a statement.
Its core functionality is similar to OpenAI’s GPT-4o but from a much smaller model. It is also available to use today, whereas GPT-4o advanced voice won’t be widely available until Fall.
The team suggests Moshi could be used in roleplay scenarios or even as a coach to spur you on while you train. The plan is to work with the community and make it open so others can build on top of and further fine-tune the AI.
It is a 7B parameter multimodal model called Helium trained on text and audio codecs, but Moshi is speech in speech out natively. It can run on an Nvidia GPU, Apple's Metal or a CPU.
What happens next with Moshi?
Kyutai hopes that the community support will be used to enhance Moshi's knowledge base and factuality. These have been limited because it is a lightweight base model, but it is hoped that expanding these aspects in combination with native speech will create a powerful assistant.
The next stage is to further refine the model and scale it up to allow for more complex and longer form conversations with Moshi.
In using it and from watching the demos I’ve found it incredibly fast and responsive for the first minute or so, but the longer the conversation goes on the more incoherent it becomes. Its lack of knowledge is also obvious and if you cal it out for making a mistake it gets flustered and goes into a loop of "I’m sorry, I’m sorry, I’m sorry."
This isn’t a direct competitor for OpenAI’s GPT-4o advanced voice yet, even though advanced voice isn’t currently available. But, offering an open, locally running model that has the potential to work in much the same way is a significant step forward for open source AI development.