RE: LeoThread 2025-11-05 15-48

Part 3/8:

The next essential step is data augmentation—not just producing data but refining it. Shapiro emphasizes that raw generated outputs include both good and bad samples. The goal is to filter out less useful data to improve model training.

He illustrates:

The types of plots generated (e.g., a detailed story set in 1922 France or Egypt during the Renaissance).
How some samples are too short or mismatched in content length.
The importance of deleting samples that are too brief (less than 1 KB), which often do not contain sufficient detail for meaningful fine-tuning.

By doing so, he reduces his dataset from 396 samples to approximately 202 high-quality plot outlines, ensuring that only coherent, sufficiently detailed stories are retained.