Nvidia and Mistral AI have released a new small language model that purportedly features "state-of-the-art" accuracy in a tiny footprint. The new LM is known as the Mistral-NemMo-Minitron 8B, a miniaturized version of NeMo 12B that has been pruned from 12 billion to 8 billion parameters.
The new 8 billion-parameter small language model was shrunken down through two different AI optimization methods, said Bryan Catanzaro, VP of deep learning research at Nvidia, in a blog post. The team behind the new LM used a process that combines pruning and distillation. "Pruning downsizes a neural network by removing model weights that contribute the least to accuracy. During distillation, the team retrained this pruned model on a small dataset to significantly boost accuracy, which had decreased through the pruning process."
These optimizations enabled the developers to train the optimized language model on a "fraction of the original dataset" resulting in up to 40x cost savings in terms of raw compute. Normally, AI models have to balance between model size and accuracy, but with Nvidia and Mistral AI's new pruning and distillation techniques, language models can have the best of both worlds.
Mistral-NeMo-Minitron 8B, armed with these enhancements, purportedly leads nine language-driven AI benchmarks featuring a similar size. The amount of computing power saved is enough for laptops and workstation PCs to run Minitron 8B locally, making it faster and more secure to operate compared to cloud services.
Nvidia has designed Minitron 8B around consumer-based computer hardware. The LM is packaged as a Nvidia NIM microservice, and the AI model is optimized for low latency, which improves response times. Nvidia provides its custom model service, AI Foundry, to take Minitron 8B and manipulate it to work on even less powerful systems, such as smartphones. Accuracy and performance won't be as good, but Nvidia claims the model would still be a high-accuracy LM, requiring a fraction of the training data and compute infrastructure that it would otherwise need.
Pruning and distillation appear to be the next frontier for artificial intelligence performance optimization. Theoretically, there's nothing preventing developers from applying these optimization techniques to all current language models, which would significantly boost performance across the board, including large language models that can only be powered by AI-accelerated server farms.