https://aeon.co/essays/generative-ai-has-access-to-a-small-slice-of-human-knowledge
English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide. Approximately 97% of the world's languages are classified as "low-resource" in computing.
The phenomenon narrows the scope of accessible knowledge as AI-generated content increasingly fills the internet and becomes training data for subsequent models.
Posted using SoMee