Chatbot Software Begins to Face Fundamental Limitations
Recent results show that large language models struggle with compositional tasks, suggesting a hard limit to their abilities.
On December 17, 1962, Life International published a logic puzzle(opens a new tab) consisting of 15 sentences describing five houses on a street. Each sentence was a clue, such as “The Englishman lives in the red house” or “Milk is drunk in the middle house.” Each house was a different color, with inhabitants of different nationalities, who owned different pets, and so on. The story’s headline asked: “Who Owns the Zebra?” Problems like this one have proved to be a measure of the abilities — limitations, actually — of today’s machine learning models.
Also known as Einstein’s puzzle or riddle (likely an apocryphal attribution), the problem tests a certain kind of multistep reasoning. Nouha Dziri(opens a new tab), a research scientist at the Allen Institute for AI, and her colleagues recently set transformer-based large language models (LLMs), such as ChatGPT, to work on such tasks — and largely found them wanting. “They might not be able to reason beyond what they have seen during the training data for hard tasks,” Dziri said. “Or at least they do an approximation, and that approximation can be wrong.”