RE: LeoThread 2024-11-09 06:43

Competing models, like Anthropic's, scored even lower on OpenAI's SimpleQA benchmark, with its recently released Claude-3.5-sonnet model getting only 28.9 percent of questions right. However, the model was far more inclined to reveal its own uncertainty and decline to answer — which, given the damning results, is probably for the best.

Worse yet, OpenAI found that its own AI models tend to vastly overestimate their own abilities, a characteristic that can lead to them being highly confident in the falsehoods they concoct.