China’s DeepSeek has made innovations in the cost of AI and innovations like mixture of experts (MoE) and fine-grain expert segmentation which significantly improve efficiency in large language models. The DeepSeek model activates only about 37 billion parameters out of its total 600+ billion parameters during inference, compared to models like Llama that activate all parameter. This results in dramatically reduced compute costs for both training and inference.
Others have been using mixture of experts (MoE) but DeepSeek R1 aggressively scaled to the number of experts within the model.