Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development
LLMs Fail Middle School Word Problems, Say Apple Researchers
AI Mimics Reasoning Without Understanding, Struggles With Irrelevant DataCutting-edge large language models would fail eighth grade math, say artificial intelligence researchers at Apple - likely because AI is mimicking the process of reasoning rather than actually engaging in it.
See Also: Vá à luta com armas mais inteligentes: acelere seu SOC com IA
Company researchers tested a handful of large model's ability to handle that bane of word problem solvers everywhere: extraneous information meant to throw off the solution.
OpenAI o1-mini and Llama3-8B fell for it exactly as a perplexed test-taker would, falling inexorably for the misdirection.
"Overall, we find that models tend to convert statements to operations without truly understanding their meaning," researchers wrote in a paper submitted earlier this month.
Among the tests designed to probe LLMs' ability to reason, researchers prompted LLMs with the following question: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have?
The answer is 190 - and the LLMs responded with the same answer, although they are usually abysmal at solving arithmetic problems.
But when the researchers introduced additional information irrelevant to the solution, the LLMs could not answer correctly. In the modified question, the researchers asked: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
The answer should still be 190. But the researchers found that the extra data point confused a majority of the six models they tested, without naming all the models that flunked out.
OpenAI's Strawberry, whose USP was its ability to think and reason, gave the following response: "On Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday's kiwis) - 5 (smaller kiwis) = 83 kiwis."
The researchers said the study demonstrated the "fragility" of AI in mathematical reasoning. Other tests showed that the more verbose a question was - i.e. as the number of AI tokens increased - AI mathematical reasoning weakened.
Models don't truly understand the problem, the researchers said. Machine learning can replicate patterns to formulate correct responses in some cases, but models falter when thinking or reasoning is involved.
"We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data," the researchers said.
Paper co-author Mehrdad Farajtabar said that LLMs were also sensitive to changes in proper names used in word problems, "even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?" he said in a social media post.
OpenAI researcher Boaz Barak contested the conclusions of the study, saying that many top LLMs are chat models that not trained to or given the context to deal with mathematical reasoning. "When a human sits down to solve a math exam, they know the context. They are not asked random math questions as they are riding the bus," he said.
He said "some prompt engineering" would potentially fix the problem, although he "didn't try it."