Part 6/15:
Training Data: Scope and Limitations
Another critical factor is training data. The widely circulated rumor is that GPT-4 has been trained on “a significant portion of the internet”, but this is an ambiguous phrase.
A model trained on all internet content naturally filters out spam, misinformation, harmful content, striving for factuality and safety.
Quality over quantity matters: It's not just about more data, but better data—preferably curated, verified, and balanced.