Ai2 said it accomplished this feat by focusing on data quality over quantity.
Algorithms fed billions of examples, like GPT-4o, are impressively capable. But they also ingest a ton of low-quality information. All this noise consumes precious computing power.
To build their new multimodal models, Ai2 assembled a backbone of existing large language models and vision encoders. They then compiled a more focused, higher quality dataset of around 700,000 images and 1.3 million captions to train new models with visual capabilities. That may sound like a lot, but it’s on the order of 1,000 times less data than what’s used in proprietary multimodal models.