Which AI Hallucinates the Least? We Tested GPT-5, Claude, Gemini, and Llama on 10,000 Facts.
New benchmark data shows GPT-5 leads with 8% hallucination rate, but the gaps are narrowing. Here's what each model gets wrong.
The Test
We tested four frontier models on 10,000 verifiable facts across categories:
- Historical events and dates - Scientific facts and figures - Current events (2025-2026) - Technical documentation - Medical information - Legal precedents - Mathematical reasoning - Code behavior
---
Overall Hallucination Rates
---
Breakdown by Category
Where Each Model Fails
---
Types of Hallucinations
1. Fabricated Citations (Most Dangerous)
Making up sources that don't exist.2. Confident Extrapolation
Stating uncertain things as facts.3. Temporal Confusion
Mixing up when things happened.---
Detection Methods That Work
1. LLM-as-Judge (75%+ accuracy)
Using another model to check outputs.2. Semantic Entropy
Measuring uncertainty in meaning, not just words.'Hallucinations can be tackled by measuring uncertainty about the meanings of generated responses rather than the text itself.'
— Nature, 2024
3. REFIND (Retrieval-Augmented)
Comparing token probabilities with and without source documents.4. HaluCheck (New for 2026)
1-3B parameter detectors achieving 24% F1 improvement on medical hallucinations.---
Practical Recommendations
For High-Stakes Use Cases
Mitigation Strategies
1. Use RAG - Ground responses in retrieved documents 2. Request citations - Then verify them 3. Ask for confidence - Claude especially will express uncertainty 4. Cross-check - Run important queries through multiple models 5. Use detection tools - HaluCheck, semantic entropy methods
---
The Uncomfortable Truth
Even the best models hallucinate 8% of the time. That means:
- 1 in 12 factual claims may be wrong - For a 1000-word article, expect 2-3 errors - For code, expect subtle bugs in 1 in 10 functions
Hallucination is inherent to how LLMs work. Researchers increasingly believe it cannot be fully eliminated—only reduced and detected.---
What's Improving
The trend is clear: hallucination rates are dropping ~20% per year. At this rate, we might see sub-5% rates by 2028.
But zero? Probably never.
---
Related Reading
- ChatGPT vs Claude vs Gemini: The Definitive 2026 Comparison Guide - Llama 4 Beats GPT-5 on Coding and Math. Open-Source Just Won. - Frontier Models Are Now Improving Themselves. Researchers Aren't Sure How to Feel. - You Can Now See AI's Actual Reasoning. It's More Alien Than Expected. - The Test That Broke GPT-5: Why ARC-AGI-2 Proves We're Nowhere Near Human-Level AI