RE: LeoThread 2025-11-05 15-48

Part 7/15:

Data scarcity remains a bottleneck. Only a fraction of human knowledge resides online, with much in private datasets, books, proprietary content, and non-digitized media. This creates a gap: to achieve true Artificial General Intelligence (AGI), models might need access to datasets beyond the internet, including academic journals, books, proprietary datasets.

Why this matters:

Currently, the scaling law indicates that more data yields better model performance—but only if the data is relevant and high-quality.
OpenAI's recent projects, like Whisper, point to a strategy of converting diverse media types into text data, aiming for richer multimodal capabilities.

RE: LeoThread 2025-11-05 15-48

Modality: Text, Images, and Audio