Hello and welcome to Eye on AI! In this edition…Governor Newsom vetoes SB 1047; ByteDance plans new AI model based on Huawei chips; Microsoft announces AI models will improve Windows search; and the U.S. Commerce Department sets a new rule that eases restrictions on AI chip shipments to the Middle East.
Cerebras has a need for speed. In a bid to take on Nvidia, the AI chip startup is rapidly moving toward an IPO after announcing its filing for one yesterday. At the same time, the company is also in a fierce race with fellow AI chip startups Groq and SambaNova for the title of ‘fastest generative AI.’ All three are pushing the boundaries of their highly-specialized hardware and software to enable AI models to produce responses using ultra-fast generative AI that even outperform Nvidia GPUs.
Here’s what that means: When you ask an AI assistant a question, it must sift through all of the knowledge in its AI model to quickly come up with an answer. In industry parlance, that process is known as “inference.” But large language models don’t sift through words during the inference process. When you ask a question or give a chatbot a prompt, the AI breaks that into smaller pieces called “tokens”—which could represent a word, or a chunk of a word—to process its answer and respond.
Pushing for faster and faster output
So what does “ultra-fast” inference mean? If you’ve tried chatbots like OpenAI’s ChatGPT, Anthropic’s Claude, or Google’s Gemini, you probably think the output of your prompts arrives at a perfectly reasonable pace. In fact, you may be impressed by how quickly it spits out answers to your queries. But in February 2024, demos of a Groq chatbot based on a Mistral model produced answers far faster than people could read. It went viral. The setup served up 500 tokens per second to produce answers that were nearly instantaneous. By April, Groq delivered an even speedier 800 tokens per second, and by May SambaNova boasted it had broken the 1,000 tokens per second barrier.
Today, Cerebras, SambaNova, and Groq are all delivering over 1,000 tokens per second, and the “token wars” have revved up considerably. At the end of August, Cerebras claimed it had launched the “world’s fastest AI inference” at 1,800 tokens per second, and last week Cerebras said it had beaten that record and become the “first hardware of any kind” to exceed 2,000 tokens per second on one of Meta’s Llama models.
When will fast be fast enough?
This led me to ask: Why would anyone need generative AI output to be that fast? When will fast be fast enough?
According to Cerebras CEO Andrew Feldman, generative AI speed is essential since search results will increasingly be powered by generative AI, as well as new capabilities like streaming video. Those are two areas where latency, or the delay between an action and a response, is particularly annoying.
“Nobody’s going to build a business on an application that makes you sit around and wait,” he told Fortune.
In addition, AI models are quickly being used to power far more complex applications than just chat. One rapidly growing area of interest is developing application workflows based on AI agents, in which a user asks a question or prompts an action that doesn’t simply involve one query to one model. Instead it leads to multiple queries to multiple models that can go off and do things like search the web or a database.
“Then the performance really matters,” said Feldman, explaining that a reasonably slow output today could quickly become painfully slow.
Unlocking AI potential with speed
The bottom line is that speed matters because faster inference unlocks greater potential in applications built with AI, Mark Heaps, chief technology evangelist at Groq, told Fortune. That is especially true for data-heavy applications in fields like financial trading, traffic monitoring, and cybersecurity: “You need insights in real time, a form of instant intelligence that keeps up with the moment,” he said. “The race to increase speed…will provide better quality, accuracy, and potential for greater ROI.”
It’s worth noting, he pointed out, that AI models still have nowhere near as many neural connections as the human brain. “As the models get more advanced, bigger, or layered with lots of agents using smaller models, it will require more speed to keep the application useful,” he explained, adding that this has been an issue throughout history. “Why do we need cars to get beyond 50 mph? Was it so we could go fast? Or producing an engine that could do 100 mph enabled the ability to carry more weight at 50 mph?”
Rodrigo Liang, CEO and cofounder of SambaNova, agreed. Inference speed, he told Fortune, “is where the rubber hits the road—where all the training, the building of models, gets put to work to deliver real business value.” That’s particularly true now that the AI industry is moving more of its training from training AI models to putting them into production. “The world is looking for the most efficient way to produce tokens so you can support an ever-growing number of users,” he said. “Speed allows you to service many customers concurrently.”
Sharon Goldman
sharon.goldman@fortune.com