Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development

LLMs Fail Middle School Word Problems, Say Apple Researchers

AI Mimics Reasoning Without Understanding, Struggles With Irrelevant Data
LLMs Fail Middle School Word Problems, Say Apple Researchers
Math is tough. Especially when you lack cognition. (Image: Shutterstock)

Cutting-edge large language models would fail eighth grade math, say artificial intelligence researchers at Apple - likely because AI is mimicking the process of reasoning rather than actually engaging in it.

See Also: AI and ML: Ushering in a new era of network and security

Company researchers tested a handful of large model's ability to handle that bane of word problem solvers everywhere: extraneous information meant to throw off the solution.

OpenAI o1-mini and Llama3-8B fell for it exactly as a perplexed test-taker would, falling inexorably for the misdirection.

"Overall, we find that models tend to convert statements to operations without truly understanding their meaning," researchers wrote in a paper submitted earlier this month.

Among the tests designed to probe LLMs' ability to reason, researchers prompted LLMs with the following question: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have?

The answer is 190 - and the LLMs responded with the same answer, although they are usually abysmal at solving arithmetic problems.

But when the researchers introduced additional information irrelevant to the solution, the LLMs could not answer correctly. In the modified question, the researchers asked: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

The answer should still be 190. But the researchers found that the extra data point confused a majority of the six models they tested, without naming all the models that flunked out.

OpenAI's Strawberry, whose USP was its ability to think and reason, gave the following response: "On Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday's kiwis) - 5 (smaller kiwis) = 83 kiwis."

The researchers said the study demonstrated the "fragility" of AI in mathematical reasoning. Other tests showed that the more verbose a question was - i.e. as the number of AI tokens increased - AI mathematical reasoning weakened.

Models don't truly understand the problem, the researchers said. Machine learning can replicate patterns to formulate correct responses in some cases, but models falter when thinking or reasoning is involved.

"We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data," the researchers said.

Paper co-author Mehrdad Farajtabar said that LLMs were also sensitive to changes in proper names used in word problems, "even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?" he said in a social media post.

OpenAI researcher Boaz Barak contested the conclusions of the study, saying that many top LLMs are chat models that not trained to or given the context to deal with mathematical reasoning. "When a human sits down to solve a math exam, they know the context. They are not asked random math questions as they are riding the bus," he said.

He said "some prompt engineering" would potentially fix the problem, although he "didn't try it."


About the Author

Rashmi Ramesh

Rashmi Ramesh

Assistant Editor, Global News Desk, ISMG

Ramesh has seven years of experience writing and editing stories on finance, enterprise and consumer technology, and diversity and inclusion. She has previously worked at formerly News Corp-owned TechCircle, business daily The Economic Times and The New Indian Express.




Around the Network

Our website uses cookies. Cookies enable us to provide the best experience possible and help us understand how visitors use our website. By browsing databreachtoday.com, you agree to our use of cookies.