Part 7/11:
Engaging in a mid-training phase where the ratio of text versus image data is carefully balanced—initially 95% text, transitioning toward 30% text to foster combined multimodal comprehension.
Final fine-tuning on image understanding tasks.
Throughout this process, they utilized around 100 billion multimodal tokens derived from open-source datasets and their synthetic vision data, underscoring the comprehensive effort to produce robust, versatile models.
Benchmark Results: Validating Performance Claims
LFM2VL models demonstrate competitive performance across multiple benchmarks:
- Real-world question answering (QA): The 1.6 billion model scored 65.23, comparable with top models like intern VL3.