Large language models can be squeezed onto your phone…

Large language models can be squeezed onto your phone — rather than needing 1000s of servers to run — after breakthrough

The logos of Google Gemini, ChatGPT, Microsoft Copilot, Claude by Anthropic, Perplexity, and Bing apps are displayed on the screen of a smartphone in Reno, United States, on November 21, 2024.

Powerful artificial intelligence (AI) models like ChatGPT need copious amounts of power to run so they are usually housed in vast data centers. But a new breakthrough could compress these AI models so they fit onto a smartphone or laptop.

A new algorithm, dubbed Calibration Aware Low precision Decomposition with Low Rank Adaptation (CALDERA), compresses the massive amounts of data needed to run a large language model (LLM) by trimming redundancies in the code and reducing the precision of its layers of information.

This leaner LLM performs with accuracy and nuance at slightly lower levels than the uncompressed version, scientists said in a study published May 24 to the preprint database arXiv, ahead of a presentation at the Conference on Neural Information Processing Systems (NeurIPS) in December.

"Any time you can reduce the computational complexity, storage and bandwidth requirements of using AI models, you can enable AI on devices and systems that otherwise couldn't handle such compute- and memory-intensive tasks," study co-author Andrea Goldsmith, professor of electrical and computer engineering at Princeton University, said in a statement.

Whenever someone uses ChatGPT (to take one popular example) on their phone or laptop, any request made is sent to huge, remote servers, where the data is processed at a great environmental and financial cost, the scientists said in the study. This is because AI models of this size consume large amounts of processing power as they tap into hundreds, if not thousands, of components such as graphics processing units (GPUs). Therefore, to perform these requests using the single GPU on a small device, the size and scope of the AI model must be compressed.

To compress an LLM, CALDERA combines two techniques. The first technique is "low-precision," which reduces the number of bits (1s and 0s of data) used to store information, which speeds up storage and processing while improving energy efficiency, the scientists said. The second, called "low-rank," refers to reducing redundancies in the learnable parameters used in training LLMs.

"We proposed a generic algorithm for compressing large data sets or large matrices. And then we realized that nowadays, it's not just the data sets that are large, but the models being deployed are also getting large. So, we could also use our algorithm to compress these models," study co-author Rajarshi Saha, a doctoral student at Stanford University, said in the statement. "Using both of these properties together, we are able to get much more compression than either of these techniques can achieve individually."