You Can Now See AI's Actual Reasoning. It's More Alien Than Expected.

New interpretability tools show how Claude and GPT-5 'think.' The process looks nothing like human reasoning.

The Interpretability Breakthrough

For years, AI reasoning was a black box. We knew the input and output but not what happened in between. New interpretability tools from Anthropic, OpenAI, and DeepMind are finally letting us see inside.

What we're finding is strange.

---

What the Tools Reveal

The Visible Reasoning Chain

Modern interpretability shows us: - Which neurons activate for which concepts - How attention flows between tokens - What 'features' the model is representing internally - How conclusions are assembled from components

Example: Solving a Math Problem

Human approach: 1. Read problem 2. Identify what's being asked 3. Recall relevant formula 4. Apply formula step by step 5. Check answer AI approach (observed): 1. Pattern-match against similar problems seen in training 2. Activate 'math mode' features 3. Simultaneously consider multiple solution paths 4. Weight paths by learned probability of correctness 5. Merge paths into single output 6. No explicit 'checking'—confidence is implicit

---

The Alien Patterns

Superposition

AI represents multiple concepts in the same neurons simultaneously—something humans don't do.

``` Human brain: Neuron A = 'dog' Neuron B = 'cat'

AI 'brain': Neuron A = 0.7 'dog' + 0.3 'furry' + 0.1 * 'pet' Neuron B = 0.6 'cat' + 0.4 'feline' + 0.2 * 'predator' ```

Concepts are distributed and overlapping, not discrete.

Parallel Consideration

Humans think somewhat linearly. AI models consider many possibilities simultaneously:

Human: 'Should I use recursion or iteration? Let me think about recursion first...' AI: [Simultaneously weighing recursion, iteration, and 12 other approaches, combining probabilities from training data about which works best for this problem type]

Shortcut Reasoning

AI often reaches correct conclusions through 'shortcuts' that skip human-style logical steps:

Human solving: A → B → C → D → Answer AI solving: A → [something we don't fully understand] → Answer

The answer is correct, but the path is not how humans would get there.

---

Specific Findings

The 'Truthfulness' Circuit

Researchers found specific neurons that activate when Claude is about to make a claim it's uncertain about. Amplifying these neurons increases hedging language ('I think', 'probably').

The 'Code Mode' Switch

When processing code vs. natural language, entirely different sets of neurons activate—almost like two different models.

The 'Entity Tracking' System

When reading stories, specific attention patterns track which pronouns refer to which characters. This happens in layers 15-20 (of 96) in Claude.

---

Why It Matters

For AI Safety

If we can see how AI reasons, we can: - Detect deceptive reasoning - Identify potential failure modes - Verify alignment with human values - Catch problems before deployment

For Capability

Understanding reasoning lets us: - Improve weak areas - Transfer skills between domains - Build more efficient models - Design better training

For Science

AI might be discovering new ways to think that we can learn from: - Novel problem-solving strategies - Different ways to represent knowledge - Insights about cognition generally

---

The Philosophical Implications

Is It Really 'Thinking'?

The processes we're observing are: - Systematic and consistent - Capable of producing novel solutions - Responsive to logical constraints

But they're also: - Based on pattern matching - Lacking explicit goals or intentions - Without self-awareness (as far as we can tell)

What Is Understanding?

Claude can explain quantum mechanics better than most physicists. But when we look at the internal process, there's no 'aha' moment—just patterns of activation.

Does it 'understand' quantum mechanics or just predict what humans who understand it would say?

---

Researcher Perspectives

'We built these systems, but we're now studying them like we study the brain—as natural phenomena we don't fully understand.' — Anthropic Interpretability Lead

'The models are doing something that works. Whether we call it 'thinking' or something else is almost a semantic question.' — DeepMind Researcher

'I expected AI reasoning to be primitive compared to humans. It's not primitive—it's just different. Alien is the right word.' — Stanford Professor

---

Tools You Can Use

Anthropic's Interpretability Tools

- Feature visualization: See what concepts neurons represent - Attention patterns: Track information flow - Circuit analysis: Identify specialized subnetworks

OpenAI's Approach

- Chain-of-thought visibility: See the model's reasoning steps - Neuron explorer: Browse activations for any input - Automated interpretability: Use AI to interpret AI

---

What This Means for Users

Right Now

- AI explanations of its reasoning may not reflect actual process - Correct outputs don't guarantee correct reasoning - 'Show your work' prompts create post-hoc rationalization

Going Forward

- Better interpretability → more trustworthy AI - Understanding failures → preventing failures - Knowing how AI thinks → knowing when to trust it

---

The Bottom Line

We've built minds that work. Now we're learning that they work in ways we didn't expect and don't fully understand.

This is simultaneously exciting (new forms of cognition!) and concerning (we don't know what's happening inside).

The interpretability work is essential. We're building the tools to understand the tools we've already built.

---