RE: LeoThread 2025-11-01 03-13

Part 13/15:

Finally, Rodriques discusses benchmarking AI in science. Traditional metrics like question-answering datasets or multiple-choice tests (e.g., Humanity’s Last Exam) are insufficient—they do not capture the nuanced, speculative, and iterative nature of real science.

Instead, performance should be measured by the AI's ability to generate hypotheses that lead to verified discoveries through wet labs, collaborations, and real-world feedback. Their ongoing development of LAB-Bench aims to evaluate core scientific skills, such as literature comprehension, hypothesis formulation, and experimental design.