Another by epoch.ai
Scaling up AI training runs requires access to increasingly large datasets. So far, AI labs have relied on web text data to fuel training runs. Since the amount of web data generated year to year grows more slowly than the data used in training, this will not be enough to support indefinite growth. In this section, we summarize our previous work on data scarcity, and expand it by estimating further possible gains in scale enabled by multimodal and synthetic data.