Yes, Apple researchers found that on simple, synthetic puzzles, non-reasoning LLMs sometimes outperform models with chain-of-thought.
But we ALREADY KNOW that reasoners aren't best at everything. This isn't a surprising finding. It's why the default model in ChatGPT isn't a reasoning model, for example.
And here’s what the paper doesn’t refute:
▣ On the hardest real-world reasoning benchmarks like GPQA and FrontierMath reasoning models dominate. These benchmarks are carefully scrubbed for data contamination and reward multi-step inference. Non-reasoners don’t win them. Ever.
▣ Even this paper shows that reasoning models outperform in the medium complexity regime—the one that most closely resembles real-world planning, writing, and analysis tasks.
▣ The authors acknowledge that their synthetic puzzles represent a “narrow slice of reasoning” and can’t generalize to knowledge-intensive task.
So let’s not confuse “reasoning models sometimes struggle with contrived logic puzzles” with “reasoning doesn’t matter.”
Finally, while I won’t speculate on motives, it’s interesting that this comes from the one Big Tech player still far behind in LLM performance, with no reasoning model of its own to speak of.
This post was originally shared by on Linkedin.