Part 6/11:
Arya has been tested across numerous benchmarks, where it consistently demonstrates competitive performance. It has outperformed other open-source models like Pixol 12B and Llama 3.2 111B. More notably, against proprietary giants such as GPT-4 and Claude 3.5, Arya has held its ground:
In document question-answering tasks, Arya scored 92.6%, surpassing many larger models.
It scored 66.8% on long video understanding benchmarks and 72.1% across related video tasks.
Its long context window—capable of processing 64,000 tokens simultaneously—allows it to analyze lengthy documents or videos without losing focus, providing a significant edge over models with more limited context capacities.