Ofir Press, a postdoctoral researcher at Princeton University who helped develop SWE-bench, says that agentic AI tends to lack the ability to plan far ahead and often struggle to recover from errors. “In order to show them to be useful we must obtain strong performance on tough and realistic benchmarks,” he says, like reliably planning a wide range of trips for a user and booking all the necessary tickets.
You are viewing a single comment's thread from: