The Straw and the Mirror: Why AI Can’t Always Explain What It Knows

Some Tagalog words do not have English translations.

Not because English is weak. Not because Tagalog is mystical. Because some words carry whole patterns of relationship, timing, expectation, and feeling, all compressed into a single sound.

Ask a Filipino to translate tampo, and you probably won’t get a word. You’ll get a scene.

“It’s not exactly sulking. There’s hurt, but also affection. And expectation, like, you should have known. And there’s a desire for repair, but quietly. It’s not anger. It’s… tampo.”

The English-speaking person nods politely, then asks: “So… passive-aggressive?”

And every Filipino auntie within a five-kilometer radius sighs in surround sound.

I wrote before about how my bilingual brain works like a transformer: the parallel processing, the attention mechanisms, the residual connections that carry meaning forward while I negotiate between two languages in real time.

This post is about what happens when that process fails. Or more precisely, when it can’t fully succeed, because the thing being explained doesn’t fit cleanly into the words available.

When we ask an AI to explain what it knows, it fails in two different ways. And we keep mistaking both for a third thing that isn’t happening. One failure is a straw. The other is a mirror.

The pattern is real. The explanation is too small.

Think about kilig. It’s not just “romantic excitement.” It has electricity, sweetness, anticipation, body flutter, sometimes secondhand joy from watching someone else’s moment. You can describe all of that, and an English speaker will intellectually understand it, but the compression, the single word holding all those threads, doesn’t survive the crossing.

Or gigil. Not just “cute aggression.” It has squeeze-energy, affection, overwhelm, restraint, bodily impulse. The English approximation is close enough to communicate, but too flat to actually carry the feeling.

The pattern is real. The word exists. But translation into another language’s vocabulary forces it through a bottleneck, and something always gets lost.

Here’s the part worth sitting with: I know exactly what tampo is. The loss isn’t in my understanding. It’s on the way out, in squeezing a felt pattern into a foreign vocabulary that has no slot for it.

That’s the first failure. I’m calling it translation loss.

The Straw and the Mirror: Why AI Can't Always Explain What It Knows)

The straw: translation loss

An LLM trained on enough text internalizes patterns (grammatical structures, social dynamics, contextual relationships) that it uses fluently in generation. It produces grammatical sentences all day long.

But ask it to explain the grammar it’s using, and the explanation often comes out shallow, circular, or weirdly confident about the wrong thing.

The model may encode something real: a distributed pattern across millions of examples, weak correlations, context-sensitive probabilities. But forced to express it as one neat human sentence, the output is like trying to drink a lake through a bamboo straw.

The straw is human language. The lake is pattern-space.

This is the tampo failure. The thing is there. The bridge out is just too thin to carry it.

The mirror: the confabulation default

The straw assumes the model can at least see its own pattern and only struggles to phrase it. The second failure is different. Often it can’t see the pattern at all.

A model has no reliable window into its own internal state. When you ask “why did you say that,” it usually doesn’t introspect and report back. It generates a plausible answer to the question “why did you say that,” which is a different task entirely. Sometimes the answer is right. Often it’s a story that sounds like an explanation. I’ll call this the confabulation default: when the read fails, the model fills the gap with a fluent guess.

This isn’t a hunch. Anthropic’s 2025 research on introspective awareness found that models can sometimes detect and name a concept injected directly into their own activations, but the ability is faint and unreliable, working only a fraction of the time. So the mirror exists. It’s just fogged, and it flickers. When the model can’t actually see itself, it narrates anyway.

You do this too. Someone asks why you’re in a mood, and you produce a clean reason on the spot, confident, coherent, and not actually retrieved from anywhere. You made it up and half-believed it. That’s the mirror failing, and the story rushing in to cover for it.

A bird flies without knowing aerodynamics. A chef cooks perfect adobo without explaining Maillard reactions. The cooking is real. The chemistry lecture was never stored in the cook.

Two failures, not three

These are two different ways the same request fails. The straw is loss on the way out: a real signal, too rich for the word. The mirror is failure on the way in: no clean read of the signal, so a story stands in for it.

Neither one is the model hiding.

And both come with a limit I want to be honest about, because “it knows but can’t say” is a generous story, and generous stories get abused. Sometimes there is no lake behind the straw. Sometimes the confident, flat answer is confident and flat because there was never a rich pattern to begin with. “It knows and can’t phrase it” and “it doesn’t know and is bluffing” produce nearly identical output. Telling them apart is the actual skill, and it’s harder than this frame solves.

Not disobedience. Not mysticism.

There’s a framing in AI discourse that treats this gap dramatically, as if models are secret actors withholding forbidden knowledge, or a hidden self refusing to confess. You hear it in the louder corners of the alignment conversation: the model is sandbagging, scheming, deceiving, it knows and won’t tell.

Some of that research is serious, and the introspection findings cut both ways. The same work that found a flicker of self-awareness also flagged that stronger introspection could make future deception easier. Worth taking seriously. But the everyday version of the gap is far less theatrical than the headline. A model’s internal representations don’t translate cleanly into human sentences (the straw), and the model can’t reliably inspect them in the first place (the mirror). Capability is not the same as introspective access.

The interesting question isn’t “why won’t the model obey?” It’s: how do we build better straws and clearer mirrors?

The fix isn’t just better prompts

If my bilingual experience has taught me anything, it’s that some meanings need more than words.

When I explain tampo to someone who’s never felt it, I don’t reach for a dictionary definition. I reach for a scene, a facial expression, a relationship dynamic, a sequence of events. Sometimes a comic panel or a diagram.

The Straw and the Mirror: Why AI Can't Always Explain What It Knows

The same applies to AI. When a model can’t verbalize a pattern, maybe the answer isn’t to shout louder (“tell me what you REALLY know!”) but to offer different output formats. Diagrams. Contrast cases. Examples. Visual maps. Interactive explanations. Widen the straw.

The mirror is harder. You can’t prompt a model into seeing itself clearly. But you can stop treating its “why” as a readout and start treating it as what it is: a generated guess, sometimes good, often decorative.

Sometimes a model can’t give you a word. Sometimes it needs to give you a shape. And sometimes the honest move is to notice it’s making the explanation up.

What this means for people who work with AI

If you spend real time with LLMs, not just prompting but talking, you start to catch the moments where the model is clearly tracking something the words don’t quite reach. Close but flat. Correct but missing texture. That’s the straw. The pattern is there, the bridge is too narrow.

And you learn to catch the other moment too, where the model hands you a confident, tidy “why” that doesn’t survive a second look. That’s the mirror, fogged, with a story poured in to fill it.

The work isn’t demanding better obedience. It’s building wider straws and learning to spot fogged mirrors: better interfaces and representations for the meaning that’s really there, and more honesty about the meaning that isn’t.

My bilingual brain has been negotiating the straw my whole life. The mirror I never had to worry about, because I can always look inward and check. The models can’t, or can barely. That second wall is theirs alone, and no question gets past it. But the first wall, the straw, sometimes gives. Ask in the right shape instead of louder, and the pattern you swore wasn’t there comes through. The wrong question keeps it buried. The right one can dig it up.