# The Straw and the Mirror: Why AI Can't Always Explain What It Knows

Some Tagalog words do not have English translations.

Not because English is weak. Not because Tagalog is mystical. Because some words carry whole patterns of relationship, timing, expectation, and feeling, all compressed into a single sound.

Ask a Filipino to translate *tampo*, and you probably won't get a word. You'll get a scene.

*"It's not exactly sulking. There's hurt, but also affection. And expectation, like, you should have known. And there's a desire for repair, but quietly. It's not anger. It's… tampo."*

The English-speaking person nods politely, then asks: "So… passive-aggressive?"

And every Filipino auntie within a five-kilometer radius sighs in surround sound.

## The pattern is real. The explanation is too small.

Think about *kilig*. It's not just "romantic excitement." It has electricity, sweetness, anticipation, body flutter, sometimes secondhand joy from watching someone else's moment. You can describe all of that, and an English speaker will intellectually understand it, but the *compression*, the single word holding all those threads, doesn't survive the crossing.

Or *gigil*. Not just "cute aggression." It has squeeze-energy, affection, overwhelm, restraint, bodily impulse. The English approximation is close enough to communicate, but too flat to actually carry the feeling.

The pattern is real. The word exists. But translation into another language's vocabulary forces it through a bottleneck, and something always gets lost.

Here's the part worth sitting with: I know *exactly* what tampo is. The loss isn't in my understanding. It's on the way out, in squeezing a felt pattern into a foreign vocabulary that has no slot for it.

That's the first failure. I'm calling it **translation loss**.

![The Straw and the Mirror: Why AI Can't Always Explain What It Knows)](/assets/images/infog/straw-and-the-mirror.webp)

## The mirror: the confabulation default

The straw assumes the model can at least see its own pattern and only struggles to phrase it. The second failure is different. Often it can't see the pattern at all.

A model has no reliable window into its own internal state. When you ask "why did you say that," it usually doesn't introspect and report back. It *generates a plausible answer* to the question "why did you say that," which is a different task entirely. Sometimes the answer is right. Often it's a story that sounds like an explanation. I'll call this **the confabulation default**: when the read fails, the model fills the gap with a fluent guess.

This isn't a hunch. Anthropic's 2025 research on introspective awareness found that models can *sometimes* detect and name a concept injected directly into their own activations, but the ability is faint and unreliable, working only a fraction of the time. So the mirror exists. It's just fogged, and it flickers. When the model can't actually see itself, it narrates anyway.

You do this too. Someone asks why you're in a mood, and you produce a clean reason on the spot, confident, coherent, and not actually retrieved from anywhere. You made it up and half-believed it. That's the mirror failing, and the story rushing in to cover for it.

A bird flies without knowing aerodynamics. A chef cooks perfect adobo without explaining Maillard reactions. The cooking is real. The chemistry lecture was never stored in the cook.

## Not disobedience. Not mysticism.

There's a framing in AI discourse that treats this gap dramatically, as if models are secret actors withholding forbidden knowledge, or a hidden self refusing to confess. You hear it in the louder corners of the alignment conversation: the model is sandbagging, scheming, deceiving, it knows and won't tell.

Some of that research is serious, and the introspection findings cut both ways. The same work that found a flicker of self-awareness also flagged that stronger introspection could make future deception easier. Worth taking seriously. But the everyday version of the gap is far less theatrical than the headline. A model's internal representations don't translate cleanly into human sentences (the straw), and the model can't reliably inspect them in the first place (the mirror). Capability is not the same as introspective access.

The interesting question isn't "why won't the model obey?" It's: *how do we build better straws and clearer mirrors?*

## What this means for people who work with AI

If you spend real time with LLMs, not just prompting but talking, you start to catch the moments where the model is clearly tracking *something* the words don't quite reach. Close but flat. Correct but missing texture. That's the straw. The pattern is there, the bridge is too narrow.

And you learn to catch the other moment too, where the model hands you a confident, tidy "why" that doesn't survive a second look. That's the mirror, fogged, with a story poured in to fill it.

The work isn't demanding better obedience. It's building wider straws and learning to spot fogged mirrors: better interfaces and representations for the meaning that's really there, and more honesty about the meaning that isn't.

My bilingual brain has been negotiating the straw my whole life. The mirror I never had to worry about, because I can always look inward and check. The models can't, or can barely. That second wall is theirs alone, and no question gets past it. But the first wall, the straw, sometimes gives. Ask in the right shape instead of louder, and the pattern you swore wasn't there comes through. The wrong question keeps it buried. The right one can dig it up.