Part 3/9:
Aiden, an OpenAI employee, emphasizes that benchmarks can be misleading if taken at face value. He advocates looking beyond scores to see what the AI is capable of doing—such as extensive tool use, deep research, debugging by googling documentation, and generating sophisticated Python scripts. During live tests, GPT-4 mini managed to execute up to 600 tool calls in a row, demonstrating increasingly agentic behavior. This suggests that the AI is not just performing tasks but actively strategizing and managing complex workflows—traits that edge toward autonomy.