Sort:  

Caution is advised with public benchmarks since they can be gamed.

It comes down to discipline and self-restraint from the team (who face strong incentives otherwise) to avoid overfitting test sets via elaborate gymnastics around test-set–adjacent data in the document-embedding space.

With many doing this, the pressure to overfit is high

Interacting directly with the model and comparing it to other LLMs (ride the LLM cycle — rotate models daily) is worthwhile.

Early impressions were positive across personality, writing, coding vibe, humor — very solid daily-driver potential, appearing as a tier 1 LLM

In the coming days and weeks, attention will focus on ensembles derived from private evaluations, which many organizations now build for themselves and occasionally report