You are viewing a single comment's thread from:

RE: LeoThread 2025-11-09 20-32

in LeoFinance14 days ago

Part 7/11:

  • Engaging in a mid-training phase where the ratio of text versus image data is carefully balanced—initially 95% text, transitioning toward 30% text to foster combined multimodal comprehension.

  • Final fine-tuning on image understanding tasks.

Throughout this process, they utilized around 100 billion multimodal tokens derived from open-source datasets and their synthetic vision data, underscoring the comprehensive effort to produce robust, versatile models.


Benchmark Results: Validating Performance Claims

LFM2VL models demonstrate competitive performance across multiple benchmarks:

  • Real-world question answering (QA): The 1.6 billion model scored 65.23, comparable with top models like intern VL3.