Nvidia’s Llama-3.1-Minitron 4B is a small language model that punches above its weight
As tech companies race to deliver on-device AI, we are seeing a growing body of research and techniques for creating small language models (SLMs) that can run on resource-constrained devices.
The latest models, created by a research team at Nvidia, leverage recent advances in pruning and distillation to create Llama-3.1-Minitron 4B, a compressed version of the Llama 3 model. This model rivals the performance of both larger models and equally sized SLMs while being significantly more efficient to train and deploy.
The power of pruning and distillation
Pruning and distillation are two key techniques for creating smaller, more efficient language models. Pruning involves removing less important components of a model. “Depth pruning” removes complete layers while “width pruning” drops specific elements such as neurons and attention heads.
Model distillation is a technique that transfers knowledge and capabilities from a large model—often called the “teacher model”—to a smaller, simpler “student model.” There are two main ways to do distillation. First is “SGD training,” where the student model is trained on the inputs and responses of the teacher. Another method is “classical knowledge distillation,” where in addition to the results, the student is trained on the inner activations of the teacher model.
In a previous study, Nvidia researchers demonstrated the effectiveness of combining pruning with classical knowledge distillation. They started with the Nemotron 15B model and progressively pruned and distilled it down to an 8-billion parameter model. They then performed a light retraining procedure using model distillation with the original model as the teacher and the pruned model as the student. Finally, they repeated the process with the 8B model as the starting point to create a smaller 4B model.
This approach resulted in a 16% improvement in performance on the popular MMLU benchmark compared to training a 4-billion parameter model from scratch. Impressively, the entire process required 40X fewer tokens than training the model from scratch. The model’s performance was comparable to Mistral 7B, Gemma 7B, and Llama-3 8B, which were trained on trillions of tokens.
Article