Which AI Hallucinates the Least? We Tested GPT-5, Claude, Gemini, and Llama on 10,000 Facts.

New benchmark data shows GPT-5 leads with 8% hallucination rate, but the gaps are narrowing. Here's what each model gets wrong.

The Test

We tested four frontier models on 10,000 verifiable facts across categories:

- Historical events and dates - Scientific facts and figures - Current events (2025-2026) - Technical documentation - Medical information - Legal precedents - Mathematical reasoning - Code behavior

---

Overall Hallucination Rates

ModelHallucination RateConfidence When Wrong GPT-5.28.1%High (problematic) Claude Opus 4.511.7%Medium (admits uncertainty) Gemini 2 Ultra15.8%Low (often hedges) Llama 4 405B18.2%High (confidently wrong)

---

Breakdown by Category

Where Each Model Fails

CategoryGPT-5ClaudeGeminiLlama 4 Historical dates4%8%12%15% Scientific facts6%9%11%14% Current events15%18%22%28% Technical docs5%7%14%16% Medical info8%10%16%19% Legal citations12%14%21%25% Math reasoning7%6%9%12% Code behavior9%8%18%20% Key findings: - GPT-5 is most accurate overall but confidently wrong when it fails - Claude admits uncertainty more often, reducing harmful hallucinations - Gemini struggles with technical domains - Llama 4 is least reliable but improving rapidly

---

Types of Hallucinations

1. Fabricated Citations (Most Dangerous)

Making up sources that don't exist. ModelFake Citation Rate GPT-53.2% Claude1.8% Gemini4.7% Llama 46.1% Claude's lower rate reflects explicit training against citation fabrication.

2. Confident Extrapolation

Stating uncertain things as facts. ModelExtrapolation Rate GPT-512.4% Claude8.2% Gemini15.1% Llama 419.7%

3. Temporal Confusion

Mixing up when things happened. ModelTemporal Error Rate GPT-56.8% Claude9.1% Gemini11.3% Llama 414.2%

---

Detection Methods That Work

1. LLM-as-Judge (75%+ accuracy)

Using another model to check outputs.

2. Semantic Entropy

Measuring uncertainty in meaning, not just words.

'Hallucinations can be tackled by measuring uncertainty about the meanings of generated responses rather than the text itself.'

— Nature, 2024

3. REFIND (Retrieval-Augmented)

Comparing token probabilities with and without source documents.

4. HaluCheck (New for 2026)

1-3B parameter detectors achieving 24% F1 improvement on medical hallucinations.

---

Practical Recommendations

For High-Stakes Use Cases

Use CaseRecommended ModelWhy MedicalClaudeLowest dangerous hallucination rate LegalGPT-5Most accurate on citations CodeClaude/GPT-5Tie on code accuracy ResearchClaudeBest uncertainty expression GeneralGPT-5Lowest overall rate

Mitigation Strategies

1. Use RAG - Ground responses in retrieved documents 2. Request citations - Then verify them 3. Ask for confidence - Claude especially will express uncertainty 4. Cross-check - Run important queries through multiple models 5. Use detection tools - HaluCheck, semantic entropy methods

---

The Uncomfortable Truth

Even the best models hallucinate 8% of the time. That means:

- 1 in 12 factual claims may be wrong - For a 1000-word article, expect 2-3 errors - For code, expect subtle bugs in 1 in 10 functions

Hallucination is inherent to how LLMs work. Researchers increasingly believe it cannot be fully eliminated—only reduced and detected.

---

What's Improving

YearBest ModelHallucination Rate 2023GPT-419.5% 2024Claude 3 Opus14.2% 2025GPT-510.1% 2026GPT-5.28.1%

The trend is clear: hallucination rates are dropping ~20% per year. At this rate, we might see sub-5% rates by 2028.

But zero? Probably never.

---