AGI Might Be Farther Away Than Promised | Apple’s Newest Paper Questions the Hype
Are we missing the truth behind the AGI hype bubble?
Artificial General Intelligence (AGI) has been a hot buzzword in the tech world. In the past year, companies like OpenAI, Anthropic, and Google DeepMind have boasted that their latest large language models are stepping stones to human-level reasoning. But a recent white paper from an unlikely source – Apple – pours cold water on some of these claims. Titled “The Illusion of Thinking”, Apple’s research suggests that today’s AI isn’t truly “thinking” after all, at least not in the robust way the hype implies. The paper’s findings indicate that when faced with truly complex problems, even the most advanced AI models fall apart in surprising ways. This has sparked a lively debate in the AI community and raised questions about whether the narrative around imminent AGI holds up under technical scrutiny.
Quick Dive Infographic
AGI: Hype vs. Reality
The Great AI Reasoning Debate of 2025
The Spark: Apple’s “Illusion of Thinking”
The Test
Advanced AI models (LRMs) were given logic puzzles like Tower of Hanoi, River Crossing, & Blocks World with increasing difficulty.
The “Collapse”
At high complexity, accuracy plummeted to 0%. Models couldn’t reliably execute known algorithms, hitting a “wall of failure.”
The Give-Up
On the hardest puzzles, models used fewer “thinking” steps. Instead of trying harder, they appeared to short-circuit and abandon the problem.
“Current AI simulates reasoning without scaling it. It’s an ‘illusion of thinking.’“
The Rebuttal: An Illusion of an Illusion?
Unfair Test
Models hit practical limits (like token output caps). They often stated they were truncating long answers, which was wrongly scored as a failure.
The Real Proof
When asked to write code to solve hard puzzles (instead of listing every step), models succeeded, showing conceptual understanding.
Trick Questions
Some puzzles were mathematically unsolvable. A model was penalized for correctly identifying this, which a good reasoner should do.
“The question isn’t whether AIs can reason, but whether our tests can distinguish reasoning from typing.“
The Core of the Debate
Apple’s View
A Fundamental Flaw
AI reasoning is brittle. This points to a deep, architectural issue, not just a minor bug. Current models lack generalizable problem-solving skills.
Implication: We’ve hit a scaling wall. AGI is further away.
Rebuttal’s View
An Engineering Problem
The core logic is present but hampered by the interface. We need better evaluation methods and tools (like code execution) for the AI.
Implication: We need hybrid AI systems and smarter tests.
What This Means For AGI
From a Sprint to a Marathon
This debate suggests the road to AGI isn’t a simple, smooth ramp. It’s a complex marathon requiring fundamental research, not just bigger models.
The Old Narrative (Hype)
“AGI is just a few years away!”
The New Reality
?
“We have more fundamental work to do.”
Let’s dive deeper.
Apple’s “Illusion of Thinking” White Paper
Apple is not usually the first name that comes to mind in cutting-edge AI research, but this June 2025 white paper has made waves. The paper – formally titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” – zeroes in on a new breed of AI systems that the authors call Large Reasoning Models (LRMs). These are basically large language models (LLMs) augmented with extra steps of reasoning (like chain-of-thought prompts that let the model “think out loud” before answering). The big question Apple’s team asked: do these reasoning-augmented AIs actually reason, or do they just fake it until it all falls apart when tasks get hard?
To find out, Apple’s researchers set aside the usual benchmarks (like math word problems or coding challenges) and devised a set of controllable logic puzzles. They wanted tasks where they could easily dial up the complexity and see how the models coped, without worry that the models had seen the answers during training. They ended up with four classic puzzle environments:
- Tower of Hanoi – moving stacks of disks between pegs under certain rules (harder with more disks).
- River Crossing – shuttling people (actors) and their associated items (agents) across a river with a small boat, without ever leaving certain combinations alone (harder with more pairs and limited boat capacity).
- Checker Jumping – a puzzle where colored checkers on a line must swap sides by sliding or jumping (harder with more checkers, as moves grow quadratically).
- Blocks World – rearranging stacks of blocks to reach a goal configuration (harder with more blocks, requiring careful multi-step planning).
Each puzzle’s difficulty could be increased in a controlled way – e.g. add more disks or people – while the underlying rules stayed the samemedium.commedium.com. This allowed Apple’s team to test models on progressively harder versions of the same problem and observe at what point the AI’s “reasoning” breaks down. Crucially, they examined not just whether the final answer was correct, but also the step-by-step reasoning traces the models generated. They even built simulators to validate each step of a solution (for example, checking that every move in a Tower of Hanoi sequence was legal and led toward the goal)medium.com.
They pitted the latest reasoning-enabled models (like Anthropic’s Claude with a “thinking” mode, an OpenAI model variant dubbed “o3” with chain-of-thought, and a research model called DeepSeek) against their own non-reasoning counterparts (the same base models but prompted to give only a final answer without showing their work)medium.com. All models were given the same computational budget in terms of tokens – so the “thinkers” didn’t get to use more total text than the standard ones; they just allocated some of their budget to reasoning steps.
When Complexity Breaks the Illusion
The results were revealing. At simple levels of complexity, the plain old direct LLMs actually did better than the fancy reasoning versions. For easy puzzles, the “thinking” models often over-complicated things, wasted time on unnecessary steps, and sometimes even confused themselves into wrong answers, whereas the straightforward models went straight for the solutionmedium.commedium.com. In other words, for tasks so simple that no elaborate reasoning is needed, adding a chain-of-thought was like hiring a logician to do basic arithmetic – it introduced more room for error. Apple’s team observed instances of this overthinking: an LRM would find the correct answer quickly but then keep churning out extra reasoning steps until it talked itself into an incorrect movearize.com.
At moderate complexity, the situation flipped. On puzzles of intermediate difficulty, the reasoning-enhanced models pulled ahead. Their step-by-step approach let them systematically explore possible moves and backtrack from mistakes, so they solved more puzzles than the standard modelsmedium.com. This was the sweet spot where the chain-of-thought technique actually paid off: the problems were hard enough that a brute-force answer often failed, but not so hard that the reasoning process itself broke down. The LRMs would try a path, realize it was leading nowhere, and try a different approach – much like a human would think through a tricky brainteaser. This trial-and-error powered by an explicit reasoning trace gave them an edge in accuracy at medium difficultymedium.com.
However, as soon as the puzzles became very complex, all the models – reasoning or not – hit a wall. Beyond a certain point, success rates absolutely plummeted to essentially zeromedium.comitpro.com. The paper describes it as a “complete accuracy collapse” once the tasks exceeded a critical complexity thresholditpro.com. For instance, in the Tower of Hanoi, models could handle perhaps up to 6 or 7 disks reasonably and sometimes even manage 10 disks with many errors corrected – but by the time you asked for solutions with 15 disks, none of the models could reliably solve itmedium.com. Every run would eventually get tangled in mistakes or just stop. This is striking because the minimum number of moves to solve 15 disks is 32,767 – a huge number, but within the token limits of these models. Yet they still failed long before reaching the solution, often messing up after only a few hundred moves or lessmedium.com.
In Apple’s experiments, as puzzle complexity increased, model performance fell off sharply. The chart’s top row shows accuracy dropping to 0% on the hardest problems, and the bottom row shows the number of “thinking” tokens used by a reasoning model. Tellingly, the models did not even use their full allotted thoughts on the hardest puzzles – they started using fewer tokens once the complexity got very high, essentially giving up despite having room to continuelinkedin.com.
One particularly counter-intuitive behavior noted was how the models handled their “thinking budget.” Initially, as tasks got harder, the reasoning models appropriately used more tokens – they “thought harder” and longer, which makes sense. But beyond a point, the trend reversed: on the most complex puzzles, the models actually used fewer tokens for reasoning, despite plenty of budget leftmachinelearning.apple.comlinkedin.com. It’s as if the AI realized it was in over its head and just short-circuited. Instead of trying harder, it would spit out a brief, often incorrect answer or trail off. Apple’s team likened this to a scaling limit – the model’s reasoning effort grows with complexity up to a medium level, then mysteriously declines at high complexitymachinelearning.apple.com. This suggests a failure of true general reasoning: a human problem-solver doesn’t intentionally think less when a puzzle gets harder, but these AI systems did.
Digging into the step-by-step traces provided even more insight. In easier puzzles, the reasoning traces often showed the model found a correct solution early, but then kept exploring alternatives (many of them wrong) instead of confidently stopping once it had the answermedium.com. In moderate puzzles, the traces showed the models going through a search – trying one path, hitting a dead end, then trying a different sequence, somewhat like working through a maze with backtrackingmedium.com. By the hardest puzzles, the traces were either very short (the model gave up early) or a jumble of inconsistent moves with no coherent strategymedium.commedium.com.
Apple’s researchers also tried giving the models a leg up: What if we just tell the AI the known algorithm? For example, the Tower of Hanoi has a simple recursive algorithm; the team prompted the models with explicit step-by-step instructions for it. Surprisingly, that didn’t fix things – the models still failed at roughly the same complexity pointarize.commedium.com. Even when handed the blueprint, they couldn’t execute it reliably for large $N$. This hints that the limitation isn’t just not figuring out the right strategy, but actually carrying it out correctly at scale. The AI might know of the algorithm, but it doesn’t truly understand or stick to it under pressure.
Another intriguing observation was an inconsistency across different puzzles. The models could perform much deeper into a solution on some tasks than others. For example, one model managed to carry out over 100 moves correctly on a 10-disk Tower of Hanoi (not enough to finish, but still a substantial partial solution), yet the same model would consistently bog down after only ~4 moves on a River Crossing with just 3 pairsmedium.comlinkedin.com. Why would it do so much better relatively on one puzzle than another? The authors suggest it comes down to familiarity: Tower of Hanoi is a well-known puzzle – many examples of it or similar problems likely appeared in the training data (helping the model “pattern match” some of it)medium.com. River Crossing puzzles, by contrast, are less ubiquitous, so the model had less prior exposure to draw from. In short, the AI was leaning on learned patterns rather than pure reasoning – and where it lacked those patterns, it stumbled much soonermedium.commedium.com.
Taken together, Apple’s study paints a picture of AI reasoning that is more smoke-and-mirrors than robust general intelligence. The authors call it an “illusion of thinking” for a reason: these models give a good imitation of step-by-step reasoning on easy examples, but they don’t scale up that ability in a reliable wayrcrwireless.com. Eventually, they either stop trying or start making nonsensical moves, even when theoretically they have all the computational resources needed to keep going. As the paper bluntly concludes, current state-of-the-art reasoning-augmented LLMs “fail to develop generalizable problem-solving capabilities” and hit a wall of failure beyond a certain complexity in multiple domainsmail.cyberneticforests.com. And unlike humans, giving them more time or even the right formula doesn’t save them – it just prolongs the inevitable breakdown.
Apple’s timing in publishing this – right in the middle of AI mania and just before a big developer conference – certainly raised some eyebrows. Was this purely an academic contribution, or also a strategic message? Regardless, the technical findings demand attention. If AI models can’t maintain their reasoning performance as tasks get harder, that’s a serious reality check on some of the grand claims about “near-human” intelligence.
The Counterpoint: Anthropic (and Friends) Push Back
It didn’t take long for the broader AI research community to react. Just days after Apple’s paper went public, a rebuttal paper appeared with the cheeky title “The Illusion of the Illusion of Thinking.” This response was spearheaded by Alex Lawsen, a researcher affiliated with Open Philanthropy, and notably credited an AI model (Anthropic’s Claude, cheekily listed as “C. Opus”) as a co-authorrcrwireless.com9to5mac.com. The rejoinder argued that Apple’s dire conclusions about reasoning might be themselves an illusion — caused not by fundamental flaws in the AI, but by flaws in the experimental design and evaluation metrics.
Lawsen’s critique makes a few key points, essentially suggesting that Apple’s tests weren’t entirely fair to the poor AIs:
- Ignoring Token Limits and Practical Constraints: The rebuttal claims that Apple interpreted the models’ failures as reasoning collapses when in fact many were caused by the models hitting output limits. For example, in those Tower of Hanoi tasks with many disks, the solution requires thousands of moves to list out. The models tested (like Claude) often have limits on how many tokens they can output in one go. Lawsen points out instances where the model explicitly said something along the lines of: “…the pattern continues, but I’ll stop here to avoid making this too long.”9to5mac.com In Apple’s evaluation, stopping early meant a failed solution – but from the model’s perspective, it wasn’t that it couldn’t reason further, it was that it was self-imposing a limit to avoid an extremely long answer. The rebuttal suggests this shows a bit of self-awareness by the AI about practical limits (like a human saying “you get the idea, I won’t write out all 32,000 moves”). So, Lawsen argues, Apple might have mistaken a policy/formatting limitation for a cognitive limitation9to5mac.comrcrwireless.com.
- Unsolvable Puzzles in the Mix: In the River Crossing set, Apple ramped up difficulty partly by adding more actor/agent pairs. But at some point, the puzzle as configured became mathematically unsolvable (e.g. too many people for the boat capacity to ever get everyone across under the rules)arize.com9to5mac.com. Apple’s evaluation still marked the AI as failing those cases. Lawsen’s rebuttal finds this unfair – essentially, the model was penalized for correctly recognizing an impossible task. It’s like asking someone a trick question that has no answer and then marking them wrong for saying “this has no solution.” A truly intelligent reasoner should say “no solution” in such cases, which some models did – yet Apple’s scoring treated that as a failure rather than a correct response to an unsolvable scenarioarize.com.
- Evaluation Method Missed the Real Issue: Apple’s automated evaluation looked for a complete, step-by-step solution trace. If anything was missing or the model stopped prematurely, it was a fail. The rebuttal contends that this didn’t distinguish why a solution was incomplete9to5mac.com. Was it because the model genuinely got confused (a reasoning failure), or because it ran out of its allowed output length or decided to truncate? Apple’s setup would label both simply as failures. Lawsen suggests that a more nuanced evaluation would treat “ran out of space” differently from “gave an incorrect move,” for examplearize.com.
To prove their point, the rebuttal authors reran some of the experiments with slight tweaks. The most striking test: instead of asking the AI to list every single move for, say, a 15-disk Tower of Hanoi (which is impossibly long to print), they asked the AI to output a short computer program (a recursive Lua function) that would produce the moves9to5mac.com. This tests whether the model knows the algorithm conceptually without forcing it to dump thousands of tokens. The outcome? The models succeeded with flying colors. Claude, Google’s Gemini, and OpenAI’s model all quickly generated correct algorithms for solving 15-disk Hanoi, a problem far larger than the ones where Apple reported 0% success9to5mac.com. In other words, when freed from the burden of literally spelling out every step, the AI had no trouble demonstrating a solution strategy for the hardest cases. Lawsen took this as evidence that the reasoning capability was there, but Apple’s test setup (which demanded exhaustive step-by-step answers) was forcing the models into failure by exhausting their output budgets or patience9to5mac.com9to5mac.com.
The rebuttal doesn’t claim that the models are perfect reasoners – far from it. Lawsen readily acknowledges that truly general algorithmic problem-solving is still a challenge and that his alternate tests were limited in scope9to5mac.com. However, he argues the framing of Apple’s conclusions was too pessimistic. The collapse observed might reflect the models’ operational constraints (context window sizes, token limits, rigid output formats) rather than an absolute inability to handle complexity9to5mac.com9to5mac.com. In a memorable line, the rejoinder states: “The question isn’t whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing.”9to5mac.com In other words, maybe the AIs can reason better than it appears – we just have to ask them to demonstrate it in the right way.
Importantly, the rebuttal also touches on a deeper issue: What exactly counts as reasoning? One critique leveled at Apple’s approach is that if a model simply recognizes a complex puzzle and regurgitates a known solution or code, Apple would say it’s not “really reasoning,” just matching a patternrcrwireless.com. But from a practical standpoint, if the model knows an answer or can call a tool to get it, isn’t that as useful as reasoning it out? Anthropic researchers have developed internal methods to trace what a model is internally doing when it “thinks,” and they argue that looking only at the token outputs (the printed chain-of-thought) might give a misleading picturearize.com. The model could be doing a lot under the hood that isn’t captured in the textual reasoning it outputs. Apple’s analysis treated the visible chain-of-thought as the ground truth of the model’s reasoning process, which might be an oversimplification.
The back-and-forth got even more meta when it came to the complexity metric itself. Apple treated, say, “15 disks Tower of Hanoi” as extremely complex because of the long solution. The rebuttal counters that complexity isn’t just measured in length of solution – some puzzles with fewer steps can be harder logically. (Think of a chess endgame: it might only be 20 moves, but very hard to find the right moves.) In Apple’s set, the River Crossing puzzles actually required far fewer total moves than Tower of Hanoi, yet the models struggled more with them due to the combinatorial constraints. This suggests that an AI failing earlier on River Crossing isn’t necessarily less intelligent; it’s a harder reasoning task despite fewer moves. Lawsen emphasizes using complexity measures tied to true computational hardness, not just “number of steps,” to gauge AI reasoningarize.com.
After corrections for these issues, the rebuttal authors maintain that Apple’s doom-and-gloom interpretation is a bit overstated. Yes, today’s models struggle with long, complex, structured problems – but maybe not as hopelessly as Apple implied. If we allow them to use tools, break tasks into code, or otherwise sidestep brute-force output, they can tackle problems at higher complexity. And future improvements (like longer context windows, better self-checking, or models that know when to stop and ask for help) could push that boundary furtherarize.com.
Industry Reactions and the Public Perception
This technical debate has broader implications, because it directly challenges the narrative of rapid progress toward AGI. Over the past year, leading AI labs have stoked excitement by claiming their models are beginning to reason like humans. OpenAI, for instance, in announcing an earlier “reasoning” mode for GPT, said the model could “spend more time thinking through problems before they respond, much like a person would.”itpro.com Google’s DeepMind has hyped up its upcoming Gemini model as being particularly strong at reasoning through difficult problemsitpro.com. Anthropic has marketed Claude’s latest versions as having been trained on more “real-world” tasks, implying they are more grounded and reliable in practical reasoningitpro.com. Such claims have fed a perception that human-level AI is just around the corner – perhaps only a couple of years away.
Popular AI assistants and models like OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Gemini (icons above) have all been touted for their advanced reasoning capabilities. Apple’s findings cast doubt on whether these systems are truly on the verge of human-like general intelligence.
Apple’s research landed like a splash of cold water. It systematically demonstrated that scaling up task complexity leads to breakdowns, not breakthroughs, in these modelsitpro.commail.cyberneticforests.com. This undermines the idea that we can simply scale current AI techniques to get to AGI. If adding more tokens of “thinking” or just training on more data was going to yield robust general reasoning, we’d expect these models to handle bigger puzzles at least somewhatgracefully. Instead, Apple showed they actually get worse in relative terms. The illusion of reasoning is laid bare when the crutch of training-set similarity is kicked awayrcrwireless.comitpro.com. As one commentator put it, the paper reveals that these models “simulate reasoning without scaling it”rcrwireless.com.
The industry response is mixed. Some experts agree with Apple’s sobering take. AI skeptics (like Gary Marcus and others) have long argued that today’s deep learning models lack true understanding, and Apple’s results bolster those critiques. Even within companies, many researchers know these limitations – they’re actively working on them – but such nuances aren’t often conveyed in press releases or product announcements. It is telling that Apple, a company with massive AI talent but less hype to spin (since it’s not selling a chatbot to the public yet), chose to emphasize what doesn’t work in current models. It serves as a counterweight to months of relentlessly positive spin from others.
On the other side, supporters of the rapid progress view (including many at OpenAI and Anthropic) caution against over-interpreting Apple’s findings. From their perspective, the glass is half full: today’s models can solve moderately complex tasks with reasoning, and the failures at extreme complexity just point the way toward what to improve. Anthropic’s CEO, Dario Amodei, has discussed ideas like allowing a model to know when it’s stuck and then call for external help or tools – something Apple’s paper also mused about as a possible path forwardarize.com. OpenAI, Anthropic, and DeepMind are all exploring techniques to extend reasoning, such as increasing context length (so the model can keep a longer train of thought) or using planning algorithms alongside the neural network.
DeepMind, for instance, has historically integrated search algorithms with neural nets (e.g. AlphaGo’s tree search). One could imagine future large models that, upon hitting a complex reasoning impasse, automatically switch to a different strategy (like calling a Python solver, breaking the task into subtasks, or querying an external memory). These labs might argue: just because the pure end-to-end neural approach fails at long tasks doesn’t mean AI systems as a whole can’t be engineered to handle them. In fact, the rebuttal’s success with code generation hints at a hybrid approach – the model itself realized a programmatic solution is better than brute-forcing the answer. Companies are already implementing things like that (OpenAI’s Code Interpreter, tool-use via plugins, etc.), effectively giving current AIs a way to work around their shortfalls in sustained reasoning.
Still, there’s an elephant in the room: public AGI timelines. Not long ago (in early 2025), DeepMind’s CEO Demis Hassabis speculated that we might see human-level AI within 5 to 10 yearscognitivetoday.comcognitivetoday.com, and OpenAI’s Sam Altman even mused about the possibility of achieving it before 2030 or soonercognitivetoday.com. These optimistic timelines assume that progress will continue at a breakneck pace – essentially extrapolating the impressive gains from GPT-3 to GPT-4 and beyond. But Apple’s study raises a critical question: what if we’ve already hit a wall in one important dimension (reasoning through complexity) that won’t be solved just by scaling up current models? If today’s cutting-edge systems can’t reliably solve, say, a 15-disk Tower of Hanoi or a tricky multi-step logic puzzle, then claims of being on the cusp of human-level intellect seem premature. After all, humans can solve these puzzles (maybe with some effort), and human students improve on them with practice – current AIs do not, unless fundamentally changed.
AGI Hype vs. Technical Reality: A Sustainable Narrative?
The clash between Apple’s findings and the prevailing AGI hype is a healthy reality check for the field. On one hand, we have record-breaking AI achievements: models that can draft coherent essays, write code, pass professional exams, even generate creative imagery. On the other hand, as Apple’s work highlights, these same models can spectacularly fail at tasks that require a deeper, algorithmic consistency – even when given every chance to succeed. It’s as if we’ve built sprinters who can dash 100m in world-record time, but collapse in exhaustion halfway through a marathon. General intelligence is more marathon than sprint.
Is the narrative around near-term AGI sustainable in light of such roadblocks? Perhaps the hype will be forced to cool a bit. Investors and the public are starting to see not just shiny demos but also the limitations: chatbots that go off the rails, reasoning that unravels with too much pressure, and the lack of any guarantee that scaling alone will bridge those gaps. The timeline for AGI may need revising – or at least, a more cautious confidence interval. If a year ago some thought AGI was, say, 5 years away, findings like this suggest it might be much further absent some new breakthrough. We may need fundamental research advances – new model architectures that incorporate logic and memory in more human-like ways – not just bigger versions of GPT-4.
None of this is to say that progress has stalled. The AI community is actively tackling these weaknesses, and next-generation models will surely improve on certain reasoning benchmarks. But Apple’s message cuts through the noise: don’t confuse performance on curated tests with genuine general reasoning. As the debate between Apple’s researchers and the rebuttal team shows, even defining and measuring “reasoning” is tricky business. Are we measuring the AI’s thought process or just its ability to produce the right output? How do we account for practical limits versus true cognitive limits? These questions will need solid answers if we are to chart a credible path toward AGI.
In the end, Apple’s foray into AI research did exactly what good science is supposed to do – challenge assumptions. It forced AI builders to consider whether their models are really solving problems or just playing the part until they can’t. The current wave of AI enthusiasm has been fueled by rapid, visible improvements, but that trajectory may not be smooth going forward. There will be setbacks and reality checks like this along the way. The promise of AGI isn’t dead, but it may be a bit further on the horizon than the most optimistic voices have been claiming. As researchers dig into why reasoning collapses and how to fix it, the narrative will likely shift from “we’re almost there” to “we have more work to do.” And that is ultimately a good thing – because getting to true AGI is a marathon, not a sprint, and we’re still learning how to run the distance without stumbling.
Sources:
- Apple Machine Learning Research – “The Illusion of Thinking” (June 2025)machinelearning.apple.commachinelearning.apple.com
- Arize AI Blog – “What the Apple AI Paper Says About LLM Reasoning”arize.comarize.com
- Arize AI Blog – Rebuttal summary in “The Illusion of the Illusion of Thinking”arize.comarize.com
- 9to5Mac – “New paper pushes back on Apple’s LLM ‘reasoning collapse’ study”9to5mac.com9to5mac.com
- 9to5Mac – Lawsen’s evaluation critiques and recommendations9to5mac.com9to5mac.com
- RCR Wireless – “Thinking about ‘the illusion of thinking’ – why Apple has a point”rcrwireless.comrcrwireless.com
- ITPro (Future plc) – Apple research on reasoning modelsitpro.comitpro.com
- Isaak Kamau (Medium) – “Breaking Down Apple’s Illusion of Thinking”medium.commedium.com
- Markus Bestehorn (LinkedIn post) – Summary of Apple’s findingslinkedin.comlinkedin.com
- Cognitive Today – Demis Hassabis on AGI timeline (5–10 years)cognitivetoday.com and AGI predictions