Part 4/9:
The analogy of "sucking supervision through a straw" encapsulates the inefficiency of current RL methods. Instead of receiving rich, step-by-step feedback, models rely on sparse reward signals that aggregate the entire trajectory's success or failure. This sparse feedback leads to a noisy, high-variance signal that makes efficient learning difficult.
Humans do not learn this way. When humans solve problems, they do not generate hundreds of attempts indiscriminately. Instead, they reflect on their previous moves, critically evaluate their approaches, and adjust their strategies accordingly. This mental review process is absent in current Large Language Models (LLMs), which primarily rely on pattern recognition and fine-tuning without genuine reasoning or iterative reflection.