Part 6/16:
A common issue with reasoning models is their tendency to ramble or generate endless text, especially when handling lengthy tasks. Hermes 4 addresses this by a specialized fine-tuning stage aimed solely at learning when to stop. The team generated extensive reasoning traces, inserted precise stopping points, and retrained the model to recognize the optimal moment to conclude its output. The results were impressive: runaway generations decreased by up to 80%, with only marginal drops (around 5-12%) in accuracy across benchmarks.
Exemplary Benchmarks and Ethical Alignment
Hermes 4’s performance on various benchmarks underscores its capabilities:
Math 500: 96.3%
AIME 24: 81.9%
AIME 25: 82.1%
GPQA Diamond: 70.5%
Live Codebench: 61.3%