Friday, 4:47 p.m. A product team runs the demo one last time. Laptops open, shoulders tight. The room has that silence that means: we are about to commit.
The assistant answers a gnarly question. Then, helpfully, it narrates its inner life:
“Skill used: exponent and root skills.” [1]
Someone smiles. Someone says the line that always arrives on schedule: “See? It knows what it’s doing.”
That sentence is doing more work than it looks. It is not a technical claim. It is a trust claim, a social shortcut. And it has a cousin that shows up a beat later, usually from the risk person:
“If it knows what it’s doing, does it also know when it’s lost?”
That is the real question.
Not “Can it solve the problem?” Lots of systems solve problems on a good day.
The question is: Does it know when it’s lost?
What metacognition actually means Link to heading
Humans don’t just think. We watch ourselves thinking. We notice when we are bluffing, foggy, sprinting on a bad map. We plan, monitor, adjust, and sometimes stop. That loop is what John Flavell called “cognition about cognition.” 2
In practice, metacognition is less like a philosopher stroking a beard and more like a hiker checking the sky.
You need two moves:
- Monitoring: How is this going?
- Control: Given that, what do I do next?
Nelson and Narens formalized this as two interacting levels: an object level that does the task, and a meta level that models it, with information flowing up and decisions flowing down. 3
Most AI talk blurs these moves together. Confident narration passes for monitoring. Fluent strategy talk passes for control. But the loop only works if it can constrain behavior, not just decorate it.
So when a NeurIPS paper claims LLMs show “metacognitive capabilities,” it deserves a careful read. Not a dunk, not a fan club. A careful read.
What the paper actually shows Link to heading
The paper is “Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving” by Didolkar et al. (NeurIPS 2024). 1
Their claim is specific: LLMs can name the skill needed for a math task, and that naming can improve performance. They build a system and test it. No incense. No vibes.
The method:
- Use GPT-4 to assign skill names to training problems. 1
- Discover those names are absurdly over-specific. MATH produces 35,000 skill names from 7,500 problems. That is not a taxonomy. That is a Borges story.
- Cluster the names into a usable menu: 22 skills for GSM8K, 117 for MATH. 1
- At test time, ask the model to pick a skill, retrieve solved examples tagged with that skill, use them as in-context exemplars. 1
Performance improves. MATH goes from 42.2 (baseline) to 53.88. 1 GSM8K from 93.00 to 94.31. 1
If you build systems, you read that and think: Okay. That will ship.
Now comes the trap.
If a model can select a skill label, you can tell a story where it is reflecting on its own reasoning. The story is easy to tell because it uses human-shaped words: skill, strategy, procedure, knowledge, meta.
And the story might still be wrong.
Described like an engineer writing a diagram title, the method sounds like this:
Self-labeling plus exemplar retrieval improves in-context learning.
That is the capability.
A routing key is not a monitoring-and-control loop. A label can help you start the right procedure. It does not help you notice failure mid-procedure.
The authors acknowledge this. Skill-based approaches still fail due to main-skill errors, secondary-skill gaps, or misapplication even when the exemplar is relevant. 1 They try multiple skills per question and see a bump (53.88 to 55.14), exactly what you’d expect if real problems don’t live in single-skill bins. 1
The paper is careful. The hype is not the paper’s fault.
But the word “metacognition” starts quietly importing extra promises:
- It understands its own process.
- It can recognize its limits.
- It can intervene when it is wrong.
The paper does not claim all that. LinkedIn does.
Humans don’t just name skills, they stop Link to heading
Watch an experienced human solve a hard problem. The memorable part is not the label. It is the stopping.
A few steps, then pause. Re-read. Realize you’re chasing a ghost. Change tactics. Sometimes quit and come back tomorrow, the most underrated cognitive strategy in the world.
That pattern lines up with Nelson and Narens: monitoring feeds control. 3 You don’t get metacognition by narrating. You get it by steering.
Here is the awkward fact: even humans are not great at introspection when it matters. We are persuasive narrators of our own actions, often telling a clean story after the messy process is done. The narration is useful. It is not always truthful.
LLMs are optimized to produce narration. They are, by construction, excellent at generating plausible “self-explanations.” That is a feature, not a moral failing.
But it means “metacognitive-sounding text” is a weak signal. Easy to fake. You don’t need an internal self-model. You need pattern completion.
So the question becomes operational:
If the model says “I’m using exponent and root skills,” what changes when it is wrong?
Does it know when it’s lost?
What “knowing your limits” looks like in the literature Link to heading
Researchers who study this question seriously don’t rely on vibes.
Kadavath et al. show that larger language models can, in some settings, produce useful self-evaluation signals, estimates of whether their claims are correct. 4 That cuts against the lazy take that models are pure confidence machines.
But notice what that work implies:
- You need careful framing.
- You need specific evaluation.
- You often need to ask the model to step outside generation and do a different task.
“Metacognition” does not fall out of fluent answering. You have to engineer it.
Farquhar et al. (Nature 2024) treat hallucinations as an uncertainty problem, using entropy-based estimators to detect a subset they call confabulations. 6 Again: we are not trusting the model’s nice self-talk. We are building instruments to detect when it is drifting.
When researchers measure “knowing your limits,” they don’t treat a skill label as the destination. They treat it as one feature among many, and they measure the relationship between uncertainty signals and actual correctness. 4 5 6
You check the altimeter, not the vibes.
Does it know when it’s lost?
Sometimes. Partially. In constrained settings. If you ask it the right way. 4
In open-ended generation? Enough to bet your product on? Not yet, and the very existence of the hallucination-detection literature should make us cautious. 5 6
What to call it instead Link to heading
Names are not just marketing. Names are control knobs on organizations.
If you call this “metacognition,” stakeholders hear: self-awareness, self-checking, self-correction.
If you call it “skill routing,” they hear: a pipeline choice that can fail, and needs monitoring.
Call the method what it is:
Self-labeling retrieval for in-context learning. 1
Not because it sounds cool. Because it sounds limited. Limitation is a safety feature.
How to use it without getting hypnotized Link to heading
Back to that 4:47 p.m. demo.
If you want the skill-labeling step to help rather than mislead, design it like a trail sign, not a therapist’s note.
Make the label auditable. Show the selected skill and the exemplars retrieved. Let a human override. Treat it as hypothesis, not confession.
Measure whether labels track outcomes. A label that sounds right but doesn’t improve correctness is theater. Track error rates by skill bucket. Look for skills that correlate with confident failure. Those are the dangerous trails.
Separate “plan selection” from “confidence.” The method improves plan selection via better exemplars. 1 That is different from the model reliably knowing whether its final answer is correct. For that, you need calibration work, not narrative. 4 5
Build a stopping mechanism. The most human-like metacognitive move is not “I know.” It is “Wait.” If your system cannot pause, verify, or defer when uncertainty spikes, you are still in the land of confident continuation. Uncertainty estimation is how you make “Wait” real. 6
Treat fluent self-talk as UI, not telemetry. It can help users follow along. It is not a reliable report of internal state. Your telemetry should be things you can validate.
A compass is useful. A compass that lies is worse than no compass.
The point, stated once Link to heading
The NeurIPS paper shows something real: skill labels plus matched exemplars improve math benchmarks. 1 Good result.
The trap is in the translation.
“Metacognition” invites a false comfort: it can police itself. But real metacognition is not about naming the move. It is about stopping, rerouting, and paying the cost of uncertainty. 2 3
Self-policing is the thing we need most, and trust least.
When a model says “Skill used: exponent and root skills,” enjoy it as a routing trick. Then ask the only question that matters when you leave the trailhead:
Does it know when it’s lost?
Right now: sometimes, a little, with help.
Not enough to hand over the map.
Sources
- Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving Didolkar, Goyal, Ke, Guo, Valko, Lillicrap, Rezende, Bengio, Mozer, Arora (NeurIPS 2024). 1
- Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry John H. Flavell (1979). 2
- Metamemory: A theoretical framework and new findings Thomas O. Nelson and Louis Narens (1990). 3
- Language Models (Mostly) Know What They Know Kadavath et al. (2022). 4
- A Survey of Confidence Estimation and Calibration in Large Language Models Geng et al. (NAACL 2024). 5
- Detecting hallucinations in large language models using semantic entropy Farquhar et al. (Nature 2024). 6
