Part 5/8:
Each data point has a prompt—a particular query like "Generate a plot outline for a tragic crime story set in the 1920s America."
Each completion is the detailed outline or plot generated by GPT-3.
These are stored in JSON Lines (
.jsonl) format, where each line contains a dictionary with the prompt and completion.
He demonstrates how the dataset looks and confirms it contains matching pairs (one prompt per output) suitable for training.
Transforming the Data for Training
Part of the process involves cleaning the prompt and completion texts:
Removing unnecessary labels or metadata.
Replacing newline characters for consistency.
Ensuring the completion is formatted as an outline, not just a summary, for the model to learn to generate detailed plots.