A research paper from UC Santa Cruz and accompanying writeup discussing how AI researchers found a way to run modern, billion-parameter-scale LLMs on just 13 watts of power. That's about the same as a 100W-equivalent LED bulb, but more importantly, its about 50 times more efficient than the 700W of power that's needed by data center GPUs like the Nvidia H100 and H200, never mind the upcoming Blackwell B200 that can use up to 1200W per GPU.
The work was done using custom FGPA hardware, but the researchers clarify that (most) of their efficiency gains can be applied through open-source software and tweaking of existing setups. Most of the gains come from the removal of matrix multiplication (MatMul) from the LLM training and inference processes.
How was MatMul removed from a neural network while maintaining the same performance and accuracy? The researchers combined two methods. First, they converted the numeric system to a "ternary" system using -1, 0, and 1. This makes computation possible with summing rather than multiplying numbers. They then introduced time-based computation to the equation, giving the network an effective "memory" to allow it to perform even faster with fewer operations being run.
The mainstream model that the researchers used as a reference point is Meta's LLaMa LLM. The endeavor was inspired by a Microsoft paper on using ternary numbers in neural networks, though Microsoft did not go as far as removing matrix multiplication or open-sourcing their model like the UC Santa Cruz researchers did.
It boils down to an optimization problem. Rui-Jie Zhu, one of the graduate students working on the paper, says, "We replaced the expensive operation with cheaper operations." Whether the approach can be universally applied to AI and LLM solutions remains to be seen, but if viable it has the potential to radically alter the AI landscape.
We've witnessed a seemingly insatiable desire for power from leading AI companies over the past year. This research suggests that much of this has been a race to be first while using inefficient processing methods. We've heard comments from reputable figures like Arm's CEO warning that AI power demands continuing to increase at current rates would consume one fourth of the United States' power by 2030. Cutting power use down to 1/50 of the current amount would represent a massive improvement.
Here's hoping Meta, OpenAI, Google, Nvidia, and all the other major players will find ways to leverage this open-source breakthrough. Faster and far more efficient processing of AI workloads would bring us closer to human brain levels of functionality — a brain gets by with approximately 0.3 kWh of power per day by some estimates, or 1/56 of what an Nvidia H100 requires. Of course, many LLMs require tens of thousands of such GPUs and months of training, so our gray matter isn't quite outdated just yet.
Get all your news in one place.
100’s of premium titles.
One app.
Start reading
One app.
Get all your news in one place.
100’s of premium titles. One news app.
AI researchers run AI chatbots at a lightbulb-esque 13 watts with no performance loss — stripping matrix multiplication from LLMs yields massive gains
Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member?
Sign in here
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member?
Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member?
Sign in here
Our Picks