Part 2/10:
In training large language models, the process typically begins with pre-training using vast amounts of data, which is then followed by alignment or fine-tuning. The fine-tuning can be done through Supervised Fine Tuning (SFT), where models learn from human-curated data, and through Reinforcement Learning (RL), where feedback loops (praise or disapproval) are utilized to guide the model toward preferred behaviors.
The limitations of human data curation present a bottleneck in the rapid training of these models. The challenge lies in developing approaches that rely less on human data, giving rise to the concepts presented in the paper.