The Illusion of Thinking: What Apple's New Study Reveals About AI’s Limits in Reasoning

Date 

The Illusion of Thinking: What Apple's New Study Reveals About AI’s Limits in Reasoning thumbnail

A recent study from Apple researchers challenges our assumptions about "thinking" AI models, uncovering startling flaws in how well they actually reason through problems.

Over the last year, frontier AI models like Claude 3.7, DeepSeek-R1, and OpenAI's o-series have marketed themselves as better reasoners. These Large Reasoning Models (LRMs) go beyond simple text prediction, generating elaborate step-by-step thought processes known as “Chain-of-Thought” (CoT). The idea? If a model can "think" through a problem, it can solve harder ones more reliably.

But a new study from researchers at Apple, titled The Illusion of Thinking, suggests that these models may not be reasoning in any meaningful sense at all. Using controlled puzzle environments, the team probes how well these models handle increasing problem complexity.

The Setup: Controlled Puzzle Environments

Most existing evaluations of reasoning models rely on datasets like MATH or AIME, which are prone to data contamination and make it hard to isolate the effects of reasoning from memorization. The Apple team took a different route: they built custom puzzle environments (like Tower of Hanoi, River Crossing, Blocks World, and Checker Jumping) where complexity can be precisely controlled and reasoning traces can be rigorously analyzed.

By comparing “thinking” models (those that output detailed reasoning steps) to their non-thinking counterparts under equivalent inference compute, the researchers established a clearer picture of what these models can and can’t do.

The Findings: Three Regimes, One Collapse

The study found that reasoning models fall into three distinct performance regimes:

1. Low Complexity: Surprisingly, non-thinking models actually outperform thinking models. They’re more accurate and token-efficient on simple tasks.

2. Medium Complexity: Thinking models begin to shine, showing better performance due to their ability to explore solutions more deeply.

3. High Complexity: Both thinking and non-thinking models collapse entirely—performance drops to zero.

Most striking is what happens at the edge of this collapse. Instead of increasing their reasoning effort as problems get harder, thinking models decrease the number of tokens spent thinking, even when they have compute budget left. This suggests a scaling failure that’s not due to hardware limits, but something intrinsic to how current reasoning architectures operate.

Inside the Mind of a Model: Overthinking, Then Giving Up

The researchers went beyond final answers and analyzed the reasoning traces—the step-by-step “thoughts” generated before an answer is given. They found that in simpler problems, models often hit the right solution early but continue thinking unnecessarily, a behavior known as “overthinking.”

In moderately complex problems, models tend to stumble through incorrect paths before landing on the correct answer later in the trace. But at high complexity levels, correct answers vanish entirely, and reasoning becomes erratic or incoherent.

Even more troubling, when the correct algorithm was explicitly given to the model (e.g., the steps to solve Tower of Hanoi), the models still failed at the same complexity thresholds. They didn’t just struggle to figure out the answer—they couldn’t even execute known solutions reliably.

Broader Implications: Beyond the Benchmark Hype

The reception to The Illusion of Thinking has been resonant among AI researchers, particularly those skeptical of current reinforcement learning-based approaches to reasoning. The paper challenges the assumption that longer thought traces equal deeper understanding, showing instead that "thinking" often masks failure modes that go undetected by standard benchmarks.

It also adds fuel to the growing discourse around the limits of large language models, echoing recent critiques that scaling alone may not be enough to achieve robust reasoning or general intelligence.

Conclusion: Time to Rethink Reasoning

This study doesn’t prove that AI can’t reason—it shows that what we currently call reasoning may be little more than statistical flailing cloaked in verbose output. The findings point toward a need for new architectures, better interpretability tools, and more rigorous evaluations that go beyond final answer accuracy.

In the meantime, anyone building critical applications based on the supposed reasoning prowess of modern AI would be wise to remember: just because a model sounds like it’s thinking doesn’t mean it is.