Claude Opus 4 Sets New Record on Agentic Coding: 72% on SWE-Bench Verified

Anthropic's latest model autonomously fixes real GitHub issues better than any AI before. Developers report it can now handle multi-file refactors that took hours.

The New Benchmark King

Claude Opus 4 has achieved what many thought was years away: 72.1% on SWE-Bench Verified, the gold standard for measuring AI coding ability on real-world software engineering tasks.

ModelSWE-Bench VerifiedPrevious Best Claude Opus 472.1%— GPT-564.3%Previous leader Claude Opus 4.568.2%Announced same day Gemini 2 Ultra59.7%— Llama 454.2%Open source leader

---

What SWE-Bench Actually Tests

Unlike simple coding benchmarks, SWE-Bench Verified presents AI with real GitHub issues from popular open-source projects:

- Django, Flask, Requests, Scikit-learn - Issues that required actual human developers to fix - Multi-file changes across complex codebases - Test suites that must pass after the fix

The 'Verified' subset contains 500 issues confirmed to be solvable and properly tested.

---

What 72% Actually Means

Before Claude Opus 4: - AI could fix simple, isolated bugs - Multi-file refactors required heavy human guidance - Complex architectural changes were out of reach After Claude Opus 4: - Autonomous multi-file refactoring - Understanding of project architecture and conventions - Proper test writing and validation - Git workflow management (branches, commits, PRs)

---

Developer Reactions

'I gave it a ticket that would have taken me 4 hours. It was done in 12 minutes. The code was better than what I would have written.' — Staff Engineer at Stripe

'We're rethinking what 'junior developer' means. Claude can now do 80% of what we hired juniors for.' — Engineering Manager at a Series B startup

'The scary part isn't that it writes code. It's that it understands why the code should be written that way.' — Principal Engineer at Google

---

How It Works: Agentic Coding

Claude Opus 4 doesn't just complete code—it operates as an autonomous agent:

1. Understands the issue - Reads GitHub issue, comments, related code 2. Explores the codebase - Navigates files, understands architecture 3. Plans the fix - Determines which files need changes 4. Implements changes - Writes code across multiple files 5. Runs tests - Validates the fix works 6. Iterates if needed - Fixes test failures autonomously 7. Creates PR - Writes description, requests review

---

Key Technical Improvements

Extended Context

- 200K token context window (up from 128K) - Can hold entire medium-sized codebases in memory

Tool Use

- Native file system navigation - Shell command execution - Git operations - Test runner integration

Reasoning

- 'Extended thinking' mode for complex problems - Shows reasoning chain before implementation - Catches edge cases humans miss

---

Pricing

TierInputOutput Standard$15/M tokens$75/M tokens Batch (>1M/day)$12/M tokens$60/M tokens EnterpriseCustomCustom

For comparison, the average SWE-Bench issue uses ~50K tokens total, costing roughly $4-5 per issue solved.

---

What This Means for Developers

Short-term (now): - Use Claude for tedious refactoring - Automate boilerplate and test writing - Speed up code reviews with AI analysis Medium-term (6-12 months): - Junior roles will transform into 'AI supervisors' - Senior developers become more productive, not replaced - Entire categories of tickets become automatable Long-term (1-3 years): - The definition of 'programmer' changes fundamentally - Knowing how to direct AI becomes the core skill - Code becomes more of a specification language

---

How to Try It

1. Claude.ai - Web interface with coding features 2. Claude Code CLI - Terminal-based agentic coding 3. API - Build it into your own tools 4. VS Code Extension - IDE integration (coming soon)

The model is available now to all Claude Pro subscribers and API users.

---