Something weird is happening with LLMs and chess
This post looks at different AI models' performance at playing chess. Almost all large language models (LLMs) are terrible at chess except gpt-3.5-turbo-instruct. It is unknown what is causing this, but it may be due to different training data, the quality of training data, the effects of instruction tuning, or there may be something particular about different transformer structures. This may explain why people got good results with LLMs and chess two years ago and why the field has been quiet since.