RE: LeoThread 2025-11-09 20-32

Part 7/11:

Engaging in a mid-training phase where the ratio of text versus image data is carefully balanced—initially 95% text, transitioning toward 30% text to foster combined multimodal comprehension.
Final fine-tuning on image understanding tasks.

Throughout this process, they utilized around 100 billion multimodal tokens derived from open-source datasets and their synthetic vision data, underscoring the comprehensive effort to produce robust, versatile models.

Benchmark Results: Validating Performance Claims

LFM2VL models demonstrate competitive performance across multiple benchmarks:

Real-world question answering (QA): The 1.6 billion model scored 65.23, comparable with top models like intern VL3.