Google’s open source AI model Gemma is now available through the Groq chatbot. It joins Mixtral from the French AI lab Mistral and Meta’s Llama 2.
Gemma is a much smaller language model than the likes of Gemini or OpenAI’s ChatGPT but is available to install anywhere, including on your laptop — but nothing runs it as fast as the Language Processing Unit (LPU) chips that power the Groq interface.
In a quick test I asked Gemma on Groq to act like an alien tour guide showing humans around its home planet, explaining some of the more exciting sights and attractions.
Responding at a staggering 679 tokens per second, the entire scenario was in front of me, well constructed and imaginative, faster than I could read my own prompt.
What is Google Gemma?
There is a growing trend for smaller, open source AI models that aren’t as capable as their bigger siblings but still perform well and are small enough to run on a laptop or even phone.
Gemma is Google’s answer to this growing trend. Trained in a similar way to Gemini is available in a two billion and seven billion parameter version and is a large language model.
As well as running on laptops it can be run in the cloud on services such as Groq or even integrated into commercial apps to bring LLM functionality to products.
Google says it will expand the Gemma family over time and we may see large more capable versions. Being open source means other developers can build on top of the model, fine-tuning it on their own data or adapting it to work in different ways.
What is Groq and why is it so fast?
Groq is both a chatbot platform with multiple open source AI models to choose from, as well as a company making a new kind of chip designed specifically for running AI models quickly.
"We've been laser-focused on delivering unparalleled inference speed and low latency,” explained Mark Heap, Groq’s Chief Evangelist during a conversation with Tom’s Guide. “This is critical in a world where generative AI applications are becoming ubiquitous."
The chips, designed by Groq founder and CEO Jonathan Ross, who also led the development of Google's Tensor Processing Units (TPU) that were used to train and run Gemini, are designed for rapid scalability and for the efficient flow of data through the chip.
How does Gemma on Groq compare?
To compare the speed of Gemma on Groq to it running on a laptop I installed the AI model on my M2 MacBook Air and run it through Ollama, an open source tool that makes it easy to run AI offline.
I gave it the same prompt: “Imagine you are an alien tour guide showing human visitors around your home planet for the first time. Describe some of the most fascinating and unusual sights, sounds, creatures and experiences you would share with them during the tour. Feel free to get creative and include vivid details about the alien world!”
After five minutes it had written four words. This is likely due to the fact I only have 8GB of RAM in my MacBook, but other models like Zephyr from StabilityAI or Phi-2 from Microsoft run fine.
Why does speed matter?
Even compared to other cloud installs of Gemma the Groq installation is impressively fast. It beats out ChatGPT, Claude 3 or Gemini in response time and while on the surface this seems unnecessary, imagine if that AI was given a voice.
It replies to fast for any human to read in real time, but if it was connected to an equally fast text-to-speech engine like ElevenLabs, also running on Groq chips, it could not only respond to you in real time but even rethink and adapt to interruptions creating a natural conversation.
Developers can also access Gemma through Google Cloud's own Vertex AI which allows for integrating the LLM into apps and products through an API. This feature is also available through Groq or it could be integrated and downloaded for offline use.