Jul 21, 2024 - Georg Zoeller (AILTI)

Analysis: Recall or Reason - the $100B question

Seven years into the transformer, the question to which LLMs can reason, or whether they just recall known patters continues to be hazy. Here's an update on what we know and why it very much matters

Given the rate of progress, hype and capital pressure to put AI into production, it is no surprise that understanding of Generative AI technology significantly lags the investor narratives and vast valuations of AI companies.

It is certainly fair to say that we currently do not understand current, transformer based Generative AI technology well enough to say with any level of certainty whether lofty claims of “AGI” or “Reasoning” are supported, relegating them into “belief” rather than “science” or “engineering”.

Case to the point, even the often cited, seminal paper on “Emergent Properties in LLMs” is now under significant scrutiny, not just for using unclear definitions but also due to emerging counter-evidence questioning it’s conclusions

A related, open question we have about Large Language Models like GPT4 is whether they primarily reproduce information and solutions they have memorized (“recall”) in response to problems or if they are able to come to conclusions because they apply concepts they have learned (“reasoning”).

It is worth naming this distinction precisely, because it is the same one that determines what work survives. Recall is surface - the explicit, codifiable, reproducible pattern that can be compressed and replayed. Reasoning, if it exists, would be substrate - the earned capacity to take a learned concept and apply it where it has never been applied. The two look identical when the answer is fluent. They behave completely differently the moment the problem changes.

Recall of Memorized Ideas

The kind of “fuzzy” / “semantic” matching and information retrieval the technology is able to do given the compression factor of data involved does represent a powerful new capability in itself, but is often inferior (due to hallucinations) to specialized data retrieval systems like databases. Recall also is only useful for known situations, it fails when the LLM confronts a new problem.

This is the heart of the danger. Recalled output arrives sounding exactly as confident as reasoned output - which means a fluent answer to a problem the model has actually never solved is the most expensive kind of mistake there is. Fluency is not correctness, and a system that has memorized a pattern will present its replay with the same eloquence as genuine understanding.

An entire branch of Generative AI research and development is currently dedicated to improving recall, primarily by Retrieval Augmented Generation (“RAG”)

Abstract Reasoning

Generalized, transferrable reasoning on the other hand implies the ability to take learned concepts and apply it to new problems across domains. If/When LLMS are able to perform reasoning, their potential impact is vastly higher and opens the path to novel application in many currently human domain jobs.

Current online discourse around LLM and the marketing pushed by large AI companies heavily focuses on “reasoning” and away from reproduction of learned knowledge as it provides powerful investor narrative.

It’s really hard to answer…

In practice, it’s extremely hard to tell in the current ecosystem if a model is recalling or reasoning for an external observer, for several reasons

The training data all top end models is secret. Therefore, an external observer is unable to establish whether the LLM has seen a problem before and is reurgitating the solution, or if it’s reasoning it’s way to a solution. Even creating a completely novel problem and exhaustively checking the internet on whether it has been proposed is ineffective at ruling out this possibility.
All major vendors are using the data generated when users interact with the model for future training. As such, even after creating a novel problem and running tests on the LLM, there’s a good chance that future variations of the model will now be aware of the problem.
All major AI vendors have a extreme financial interest to maximize the hype and belief of broad capabilities.
The way we are benchmarking AI performance is extremely immature
Accidental Benchmark contamination and intentional manipulation are common, as evident with the strong variation of results whenver benchmarks are updated to novel problems.

Recent research therefore has focused on ways to solve the benchmarking question. For example in an extensive study, researchers from MIT and Boston University investigated the reasoning capabilities:

arXiv:2307.02477 - Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

In short, if an LLM is performing abstract, transferrable reasoning, variations to the problem should not materially affect LLM performance. Current research finds the variations are substantial.

Current Resarch: More Recalling than Reasoning

These studies, so far, indicating that the majority of LLM performance can be explained with effective recall of memorized data than reasoning.

Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.

One way to interprete these findings is to look at the question of Recall and Reasoning not as a binary, black and white distinction but gradient, or, more succinctly:

Sufficently fuzzy pattern matching and recall is indistinguishable from reasoning

Indistinguishable to an observer who only reads the output. And that is exactly the problem: if you cannot tell recall from reasoning by looking at the answer, then the answer’s fluency is no longer evidence of anything. The more impressive the output sounds, the more it earns trust it may not have earned - which means the failure mode here is not the model’s, it’s the reader’s. Granting that unearned trust to a confident answer is the single most common way otherwise careful people get caught: they accept the replay, defer to it, and only discover the variation didn’t hold after the decision is already made.

Why does it matter ?

For businesses, what matters most is the ability to perform specific tasks. Most employees operate on well defined, repeating business problems that have ample data foundation available for training models. From that perspective, recalling such information is very well enough to drive significant automation potential for those jobs.

Or, corollary: The more variation in task details a job entails, the less likely memorisation dependent technology with limited ability to generalize or transfer is going to displace it. Put differently: the work that is surface - codifiable, repeating, already sitting in the data - is the work that recall can absorb. The work that is substrate - the judgment that holds when the parameters move, the accountability for the call - is the part recall cannot reach, precisely because it was never in the training data to begin with. The recall/reason line is, at root, the automation line.

This has a direct operational consequence for anyone putting these models into production. If the model recalls rather than reasons, you do not deploy it as a decision-maker; you deploy it as a fast, fallible component inside a system you still control. The discipline is to wrap the probabilistic engine in a deterministic harness - explicit rules, verification steps, and checks that you authored and can defend - so that the parts of the workflow that must be reliable are run by logic you encoded, and only the genuinely fuzzy parts are left to the model. The human stays accountable for the decision because the human wrote the rules the system executes; the model is allowed to retrieve and draft, never to judge. That is the difference between a system you can stand behind when it fails and one that simply fails in your name.

The test for whether you have got that boundary right is a simple one: when an AI-assisted decision goes wrong, can the person responsible still explain why the decision was rational at the time - to a regulator, to a board, to themselves? A decision built on recalled output the operator never actually verified does not survive that question. A decision where the human supplied the context, checked the variation, and owns the call does. Verifying the output rather than deferring to it is not optional friction; it is the labour that converts a fluent answer into a defensible one.

Given the extreme hype-cycle AI is currently in, a growing consensus that it’s reasoning ability is limited would almost certainly have an impact on valuations … and the lofty narrative of replacing professions that

More independet research is needed to understand how the technology works and to establish durable measurements of progress to identify possible phase-shifts between recall and reasoning in major models and how they relate to data, training and resulting models.

TL;DR

Current science attributes the larger share of model performance on problem solving to mechanisms more aking to recalling memorized information than to abstract, generalizable reasoning.
Business decision makers, investors and career minded individuals should pay attention to new, verifiable research and credible benchmarking on the topic of “is it reasoning or memorizing” as it will provide insight into trajectory, scaling and real world business impact of the technology.
If the current trends hold, “high variation in daily workload and problem parameters” may be seen as possibly correlated with resilience against AI displacement for jobs and businesses.
It would also mean that the technology would require a much more constant influx of new data to stay relevant and operate in job domains, increasing costs to deploy and operate. The value and dependency of “access to current data” in this scenario would remain elevated, shifting balance of power in the industry away from model makers to data owners.
Practically, the recall/reason gap is a governance instruction, not just a forecast. Because a confident answer looks the same whether it was recalled or reasoned, the value moves to whoever can tell the difference and own the consequence. That means deploying the model inside a system you control - codified rules and verification you authored doing the load-bearing work, the model retrieving and drafting at the edges - and keeping a human accountable for any decision the output feeds. The competitive edge stops being “who has the cleverest model” and becomes “who can verify its output and answer for the result.”