Part 5/8:
The financial implications are equally profound. The reinforcement learning phase of M1 was completed in just three weeks at an approximate rental cost of $534,700 using 512 Nvidia H800 graphics cards. In stark contrast, competitors like Deepseek R1 faced costs of $5–6 million, with estimates for training OpenAI's models exceeding $100 million.
Revolutionary Reinforcement Learning Algorithm
The success of M1 can be attributed not only to its design but also to a novel reinforcement learning algorithm called CISPO (Clipped Importance Sampling Policy Optimization). Unlike traditional methods that restrict certain gradient flows to avoid instability, CISPO allows all tokens to contribute, leading to richer outputs.