You Can Now See AI's Actual Reasoning. It's More Alien Than Expected.
New interpretability tools show how Claude and GPT-5 'think.' The process looks nothing like human reasoning.
The Interpretability Breakthrough
For years, AI reasoning was a black box. We knew the input and output but not what happened in between. New interpretability tools from Anthropic, OpenAI, and DeepMind are finally letting us see inside.
What we're finding is strange.
---
What the Tools Reveal
The Visible Reasoning Chain
Modern interpretability shows us: - Which neurons activate for which concepts - How attention flows between tokens - What 'features' the model is representing internally - How conclusions are assembled from components
Example: Solving a Math Problem
Human approach: 1. Read problem 2. Identify what's being asked 3. Recall relevant formula 4. Apply formula step by step 5. Check answer AI approach (observed): 1. Pattern-match against similar problems seen in training 2. Activate 'math mode' features 3. Simultaneously consider multiple solution paths 4. Weight paths by learned probability of correctness 5. Merge paths into single output 6. No explicit 'checking'—confidence is implicit---
The Alien Patterns
Superposition
AI represents multiple concepts in the same neurons simultaneously—something humans don't do.
``` Human brain: Neuron A = 'dog' Neuron B = 'cat'
AI 'brain': Neuron A = 0.7 'dog' + 0.3 'furry' + 0.1 * 'pet' Neuron B = 0.6 'cat' + 0.4 'feline' + 0.2 * 'predator' ```
Concepts are distributed and overlapping, not discrete.
Parallel Consideration
Humans think somewhat linearly. AI models consider many possibilities simultaneously:
Human: 'Should I use recursion or iteration? Let me think about recursion first...' AI: [Simultaneously weighing recursion, iteration, and 12 other approaches, combining probabilities from training data about which works best for this problem type]Shortcut Reasoning
AI often reaches correct conclusions through 'shortcuts' that skip human-style logical steps:
Human solving: A → B → C → D → Answer AI solving: A → [something we don't fully understand] → AnswerThe answer is correct, but the path is not how humans would get there.
---
Specific Findings
The 'Truthfulness' Circuit
Researchers found specific neurons that activate when Claude is about to make a claim it's uncertain about. Amplifying these neurons increases hedging language ('I think', 'probably').
The 'Code Mode' Switch
When processing code vs. natural language, entirely different sets of neurons activate—almost like two different models.
The 'Entity Tracking' System
When reading stories, specific attention patterns track which pronouns refer to which characters. This happens in layers 15-20 (of 96) in Claude.
---
Why It Matters
For AI Safety
If we can see how AI reasons, we can: - Detect deceptive reasoning - Identify potential failure modes - Verify alignment with human values - Catch problems before deployment
For Capability
Understanding reasoning lets us: - Improve weak areas - Transfer skills between domains - Build more efficient models - Design better training
For Science
AI might be discovering new ways to think that we can learn from: - Novel problem-solving strategies - Different ways to represent knowledge - Insights about cognition generally
---
The Philosophical Implications
Is It Really 'Thinking'?
The processes we're observing are: - Systematic and consistent - Capable of producing novel solutions - Responsive to logical constraints
But they're also: - Based on pattern matching - Lacking explicit goals or intentions - Without self-awareness (as far as we can tell)
What Is Understanding?
Claude can explain quantum mechanics better than most physicists. But when we look at the internal process, there's no 'aha' moment—just patterns of activation.
Does it 'understand' quantum mechanics or just predict what humans who understand it would say?
---
Researcher Perspectives
'We built these systems, but we're now studying them like we study the brain—as natural phenomena we don't fully understand.' — Anthropic Interpretability Lead
'The models are doing something that works. Whether we call it 'thinking' or something else is almost a semantic question.' — DeepMind Researcher
'I expected AI reasoning to be primitive compared to humans. It's not primitive—it's just different. Alien is the right word.' — Stanford Professor
---
Tools You Can Use
Anthropic's Interpretability Tools
- Feature visualization: See what concepts neurons represent - Attention patterns: Track information flow - Circuit analysis: Identify specialized subnetworksOpenAI's Approach
- Chain-of-thought visibility: See the model's reasoning steps - Neuron explorer: Browse activations for any input - Automated interpretability: Use AI to interpret AI---
What This Means for Users
Right Now
- AI explanations of its reasoning may not reflect actual process - Correct outputs don't guarantee correct reasoning - 'Show your work' prompts create post-hoc rationalizationGoing Forward
- Better interpretability → more trustworthy AI - Understanding failures → preventing failures - Knowing how AI thinks → knowing when to trust it---
The Bottom Line
We've built minds that work. Now we're learning that they work in ways we didn't expect and don't fully understand.
This is simultaneously exciting (new forms of cognition!) and concerning (we don't know what's happening inside).
The interpretability work is essential. We're building the tools to understand the tools we've already built.
---
Related Reading
- Frontier Models Are Now Improving Themselves. Researchers Aren't Sure How to Feel. - ChatGPT vs Claude vs Gemini: The Definitive 2026 Comparison Guide - Which AI Hallucinates the Least? We Tested GPT-5, Claude, Gemini, and Llama on 10,000 Facts. - Claude's Extended Thinking Mode Now Produces PhD-Level Research Papers in Hours - Anthropic's Claude 4 Shows 'Genuine Reasoning' in New Study. Researchers Aren't Sure What That Means.