RE: LeoThread 2025-11-04 23-07

Part 2/10:

Many assume that the training data — the vast corpus of internet text, books, articles, and code — is the main factor defining a chatbot's personality. However, the speaker clarifies that most of these models are trained on similar datasets. These datasets include Common Crawl, licensed books, articles, and proprietary data, with variations in curation, cleaning, and the inclusion of synthetic data.

Some models incorporate synthetic data, which is created artificially specifically for training purposes. Although synthetic data can add divergence, it still traces back largely to the same foundational sources, such as web crawled data, books, or licensed content. Thus, training data sets serve as the common root from which these models grow.