Part 2/10:
For much of the previous three centuries, statistical learning theory has suggested that larger models may perform worse than their smaller counterparts. This paradigm would have predicted that the rapid proliferation of LLMs five years ago could lead us only to more ineffective, bloated models prone to overfitting—where a model learns the noise in the training data rather than the underlying patterns. Initially, the foundational technologies behind LLMs, such as autoregressive transformer neural networks, had been in existence since 2017, yet few researchers pursued the idea of training gigantic models due to the fear of overfitting.