Part 7/15:
Data scarcity remains a bottleneck. Only a fraction of human knowledge resides online, with much in private datasets, books, proprietary content, and non-digitized media. This creates a gap: to achieve true Artificial General Intelligence (AGI), models might need access to datasets beyond the internet, including academic journals, books, proprietary datasets.
Why this matters:
Currently, the scaling law indicates that more data yields better model performance—but only if the data is relevant and high-quality.
OpenAI's recent projects, like Whisper, point to a strategy of converting diverse media types into text data, aiming for richer multimodal capabilities.