You are viewing a single comment's thread from:

RE: LeoThread 2025-11-05 15-48

in LeoFinance21 days ago

Part 4/8:

Tip: The dataset is backed up in a Git repository, allowing easy access and version control.


Manual versus Automated Data Filtering

While manual editing of individual samples is an option, Shapiro advocates for an automated approach, leveraging scripting to streamline cleanup at scale. Automating this process is faster and minimizes human error, although manual editing remains a more labor-intensive alternative for fine-tuning specific samples.


Structuring Data for Fine-Tuning

The crux of preparing for training lies in formatting data into a consistent prompt-completion schema. Shapiro discusses his existing script, find_prepare_finetuned_data.py, which he modifies to handle a dataset consisting of prompts and corresponding completions.

Data structure: