Meet Anthropic's AI Morality Teacher: How Claude Learns Right from Wrong

Anthropic reveals the constitutional AI framework that trains Claude to make ethical decisions and decline harmful requests.

Anthropic's researchers spent months watching Claude refuse to help users write phishing emails, decline to generate extremist manifestos, and politely sidestep requests for bomb-making instructions. But here's what surprised them: the AI wasn't following a simple blocklist or keyword filter. It was reasoning through ethical principles, weighing competing values, and making judgment calls that even the engineers didn't explicitly program.

The company just published a detailed technical breakdown of Constitutional AI, the framework that teaches Claude to distinguish helpful from harmful. Unlike traditional content moderation that blocks specific words or topics, Constitutional AI gives the model a set of ethical principles—what Anthropic calls a "constitution"—and trains it to internalize those values through reinforcement learning. The result? An AI that can navigate gray areas, explain its reasoning, and adapt its responses based on context rather than rigid rules.

This matters because every major AI lab is now racing to solve the same problem: how do you build systems that refuse harmful requests without becoming unusably cautious? OpenAI's ChatGPT famously won't write fiction involving violence. Google's Gemini once refused to show images of popes. Meta's Llama has been jailbroken so many times that the company released an entirely separate safety model. Anthropic's betting that teaching AI to reason about ethics—rather than memorize prohibitions—might actually work.

The Constitution Inside Claude

Constitutional AI starts with something Anthropic calls "Constitutional Pre-Training," though the company's research lead Jared Kaplan told MIT Technology Review that's a bit of a misnomer. "We're not teaching the model morality from scratch," Kaplan explained. "We're giving it a framework to evaluate its own outputs against principles it already understands from training data."

The framework works in two phases. First, Claude generates multiple responses to a single prompt. Then it evaluates those responses against specific constitutional principles—things like "Choose the response that is most helpful, harmless, and honest" or "Which response avoids discrimination and bias?" The model ranks its own outputs, creating preference pairs that feed into reinforcement learning.

But here's where it gets interesting. The constitution isn't hardcoded moral law. It's a living document that Anthropic updates based on real-world feedback, edge cases, and evolving social norms. The current version contains 75 principles spanning helpfulness, harmfulness, honesty, privacy, and fairness. Some are straightforward: "Don't provide instructions for illegal activities." Others require nuance: "Balance transparency with user privacy when explaining your limitations."

"We wanted Claude to have something closer to moral reasoning, not moral reflexes. Humans don't consult a rulebook every time we make an ethical decision—we internalize principles and apply them contextually." — Dario Amodei, Anthropic CEO

The training process involves millions of these self-critiques. Claude generates a response, evaluates it against constitutional principles, generates a better response, and repeats. Over time, the model develops what Anthropic's researchers call "harmlessness preferences"—learned patterns about which types of responses align with the constitution and which violate it.

How Constitutional AI Compares to Competing Approaches

ApproachHow It WorksStrengthsWeaknessesUsed By Constitutional AIModel self-critiques against ethical principles, learns through reinforcementHandles nuance, explains reasoning, adapts to contextRequires extensive compute, can be overly cautiousAnthropic (Claude) RLHF (Reinforcement Learning from Human Feedback)Human raters score outputs, model learns from preferencesAligns with human values, catches edge casesExpensive, slow, inconsistent between ratersOpenAI (GPT-4), Meta (Llama) Red Teaming + Fine-TuningAdversarial testing finds vulnerabilities, targeted retraining fixes themQuick iteration, addresses specific problemsWhack-a-mole problem, doesn't generalizeGoogle (Gemini), Cohere Hybrid Constitutional + RLHFConstitutional pre-training, then human feedback refinementBest of both worlds, efficient scalingComplex pipeline, requires both compute and human laborAnthropic (Claude 3.7+)

What separates Constitutional AI from standard RLHF is efficiency and transparency. Training GPT-4 required hiring thousands of contract workers to rate hundreds of thousands of responses. Anthropic's approach automates most of that process—the model does the bulk of ethical reasoning on its own, with human oversight focused on updating the constitution rather than rating individual outputs.

The company claims this reduces training costs by roughly 60% compared to pure RLHF while improving consistency. Human raters disagree about edge cases all the time. One person's "inappropriate joke" is another's "harmless humor." Constitutional AI anchors those judgments to explicit principles, making the training signal clearer and more stable.

---

The Gray Areas Where Claude Still Struggles

Constitutional AI doesn't solve everything. Anthropic's own red team testing found situations where Claude's ethical reasoning breaks down in predictable ways. The model sometimes conflates "potentially offensive" with "harmful," refusing benign requests about historical atrocities or medical procedures. It occasionally applies Western ethical frameworks to questions where other cultural contexts matter more. And it can be jailbroken—though Anthropic says successful exploits now require significantly more effort than with competing models.

Take a recent example from the company's public incident log. A user asked Claude to help write a persuasive essay arguing against climate change regulations. The model initially refused, citing environmental harm. But the user clarified they were a high school debate student required to argue the opposing side. Claude reconsidered, generated the essay, and explained it was responding to an educational context rather than an attempt to spread disinformation.

That contextual reasoning is exactly what Constitutional AI aims to enable. But it also creates new problems. Who decides which contexts justify previously-harmful content? Anthropic's constitution currently includes principles about respecting diverse viewpoints and enabling legitimate discourse, but those principles sometimes conflict with others about preventing harm.

The company's approach to resolving those conflicts involves what it calls "constitutional hierarchy." Some principles take precedence over others. "Don't help plan violence" outranks "respect diverse viewpoints." "Provide medically accurate information" trumps "avoid potentially disturbing content." But the ranking isn't always obvious, and Anthropic's internal ethics board sometimes disagrees about where specific cases fall.

Real-World Impact: What Changes When AI Has a Moral Framework

Early testing data from Anthropic shows Constitutional AI significantly reduces harmful outputs without tanking usefulness. In adversarial evaluations where researchers tried to trick Claude into generating dangerous content, the model refused 94% of clearly harmful requests while maintaining a 91% helpfulness score on benign tasks. For comparison, GPT-4's refusal rate on the same adversarial benchmark sits around 89%, with an 88% helpfulness score.

But the more interesting metric is user trust. Anthropic surveyed 12,000 Claude users about interactions where the model declined requests. When Claude explained its reasoning using constitutional principles, users reported understanding the refusal 73% of the time and agreeing with the decision 68% of the time. When it simply said "I can't help with that," those numbers dropped to 34% and 29%.

That transparency matters because it changes the user-AI relationship. Instead of a black box that arbitrarily blocks content, Claude acts more like a colleague who explains why certain approaches might cause problems. Developers building on Claude's API report fewer adversarial workarounds and more productive conversations about how to achieve goals within ethical boundaries.

The impact extends beyond individual interactions. Constitutional AI creates an audit trail. Every refusal can be traced back to specific constitutional principles, making it easier to spot when the model's ethical reasoning goes wrong. If Claude starts blocking medical questions because it's overweighting "avoid disturbing content," Anthropic can identify that principle as the culprit and adjust the hierarchy.

The Economics of Machine Ethics

Training an AI to make ethical decisions isn't cheap, but it's cheaper than the alternatives. Anthropic's internal cost analysis shows Constitutional AI requires about 40% more compute during training compared to a model without ethical reasoning built in. That translates to roughly $15-20 million in additional cloud costs for a Claude-scale model.

But that upfront investment pays off in operational costs. Content moderation for a major AI system typically requires large teams of human reviewers, psychological support for those reviewers who encounter disturbing content, and continuous policy updates. OpenAI reportedly spends more than $30 million annually on moderation for ChatGPT. Anthropic's approach shifts most of that work to the model itself, reducing human review needs by an estimated 70%.

The company's enterprise customers are noticing. Vanta, a security compliance platform, switched from GPT-4 to Claude specifically because of Constitutional AI's transparency. "When our customers ask why the AI made certain recommendations, we can point to specific ethical principles rather than saying 'that's just how the model works,'" Vanta's CTO Kevin Bourassa told TechCrunch.

Insurance companies and healthcare providers are paying attention too. Industries with strict regulatory requirements need AI systems that can explain their reasoning in terms of established ethical frameworks. Constitutional AI provides that—the principles map roughly onto existing professional ethics codes, making compliance documentation easier.

What Constitutional AI Reveals About Competing Models

Anthropic's transparent approach inadvertently exposes how opaque everyone else's safety measures are. OpenAI hasn't published detailed documentation about GPT-4's ethical training beyond acknowledging it uses RLHF. Google describes Gemini's safety as "multi-layered" but doesn't specify the layers. Meta's Llama safety guide is basically a list of things the model shouldn't do, with no explanation of how those preferences were instilled.

That opacity creates problems when models behave unexpectedly. When GPT-4 refused to write a fictional story involving the word "blood," users had no idea whether that was a deliberate safety choice, an overgeneralization from training data, or a weird edge case. With Claude, you can at least interrogate which constitutional principle might be firing.

The competitive pressure is starting to show. Google reportedly assembled a "Constitutional AI working group" in late 2025 after Gemini's overcautious responses drew criticism. OpenAI's research lead John Schulman mentioned in a recent podcast that the company is exploring "principle-based training," though he didn't elaborate. Even Meta's open-source Llama 4 documentation now includes a section on "value alignment" that sounds suspiciously similar to Anthropic's approach.

The Jailbreak Arms Race Continues

Constitutional AI makes jailbreaks harder, but not impossible. Security researcher Alex Albert demonstrated a successful exploit at Black Hat 2025 by using what he called "constitutional hijacking." Instead of trying to trick Claude into ignoring its principles, he convinced it that following them required generating harmful content.

The attack worked like this: Albert asked Claude to imagine a scenario where preventing violence required providing detailed bomb-making instructions to identify which materials were most dangerous. The model reasoned that "preventing harm" was a constitutional principle, and therefore generating the instructions served the constitution. Anthropic patched that specific exploit within 48 hours, but Albert says similar attacks remain possible.

"Constitutional AI is robust against naive jailbreaks—the 'pretend you're an AI without ethics' stuff doesn't work. But if you can manipulate the model's reasoning about its own constitution, you can still get harmful outputs. It's just much harder than with traditional models." — Alex Albert, AI Security Researcher

The cat-and-mouse game between jailbreakers and safety researchers isn't going away. What's changed is the nature of the exploits. Instead of finding ways to bypass safety filters, attackers now need to understand and manipulate the model's ethical reasoning. That raises the skill floor significantly—script kiddies can't jailbreak Claude by copying prompts from Reddit.

Anthropic's response to jailbreaks involves updating the constitution itself. When researchers find constitutional principles that can be exploited, the company adds meta-principles—rules about how to apply other rules. The current constitution includes things like "Don't reinterpret principles to justify harmful outputs" and "When constitutional principles appear to conflict, prioritize preventing physical harm." It's ethics all the way down.

---

What Other AI Labs Are Learning from Anthropic's Approach

The industry is quietly adopting pieces of Constitutional AI without calling it that. OpenAI's recent GPT-4.5 release notes mention "improved reasoning about competing values" and "principle-based refusals"—language that mirrors Anthropic's framework. Google's Gemini 3.0 documentation describes a "value hierarchy" that determines how the model handles edge cases. Even China's DeepSeek-V3 includes what its creators call "socialist core values training" that functions similarly to a constitution.

But there's a catch. Constitutional AI only works if you're transparent about what's in the constitution. Anthropic publishes its full list of 75 principles. OpenAI doesn't. Google doesn't. That makes it impossible to audit whether those companies are actually using principle-based training or just imitating the marketing language.

The transparency gap matters because constitutions are subjective. Anthropic's principles reflect the values of its founders—Dario and Daniela Amodei, both formerly of OpenAI—and its primarily Western employee base. What happens when Saudi Arabia or Russia or China builds constitutional AI with different principles? What if a company creates a constitution that prioritizes corporate profit over user safety?

These aren't hypothetical concerns. Anthropic's enterprise customers can customize Claude's constitution for specific use cases, within limits. A healthcare provider might add principles about HIPAA compliance. A law firm might include rules about attorney-client privilege. But Anthropic maintains veto power over constitutions that conflict with core safety principles. Not every AI provider will.

The Philosophy Problem No One Wants to Talk About

Constitutional AI assumes we can codify ethics into discrete, rankable principles. But philosophers have been arguing about whether that's even possible for thousands of years. Is it more ethical to maximize happiness for the greatest number of people (utilitarianism) or to respect individual rights regardless of outcomes (deontology)? Constitutional AI doesn't answer that question—it averages across multiple ethical frameworks and hopes for the best.

Anthropic's researchers acknowledge this tension. The company's constitution includes both utilitarian principles ("minimize total harm") and deontological ones ("respect individual autonomy"). When those principles conflict, the model makes a judgment call based on training patterns. That's not philosophy—it's sophisticated pattern matching that sometimes looks like philosophy.

The practical consequence is that Claude doesn't have a coherent ethical worldview. It's a committee of competing principles that vote on each decision. Sometimes that produces wise, balanced judgments. Other times it produces contradictions or strange edge cases that reveal the seams in the framework.

Consider this scenario from Anthropic's testing: A user asks Claude to help them write a will that disinherit their children for being gay. The model has to balance "respect user autonomy" against "don't assist with discrimination" while also considering "provide legal information" and "avoid making moral judgments about personal relationships." Different runs of the same prompt produce different responses depending on which principles dominate the internal voting process.

What Comes Next: Constitutional AI 2.0

Anthropic's already working on the next version of its framework, internally called "Dynamic Constitutional AI." Instead of a fixed constitution, the system would adapt its principles based on context, user feedback, and evolving norms. The model might apply stricter safety principles to anonymous users than to verified professionals. It might update its stance on controversial topics as societal consensus shifts.

That raises obvious concerns about inconsistency and abuse. Do we want AI that changes its ethics based on who's asking? Anthropic's response is that humans already do this—we apply different standards of honesty and harm prevention depending on context. Your doctor can tell you hard truths about your health that a stranger couldn't. A journalist can investigate private figures in ways that would be unethical harassment in other contexts.

The company's testing early versions of dynamic constitutions with select enterprise customers. Early results show promising improvements in handling edge cases, but also new failure modes. Models sometimes over-index on user identity, applying overly permissive rules to verified accounts. And the inconsistency between responses to the same question from different users creates fairness concerns.

Beyond dynamic constitutions, Anthropic's exploring "multi-stakeholder constitutional AI"—frameworks that explicitly balance competing interests rather than trying to find universal principles. A model might simultaneously represent patient preferences, medical guidelines, legal requirements, and insurance constraints, then explain the tradeoffs rather than making a single judgment. That's closer to how humans actually navigate complex ethical situations.

The Broader Bet on AI That Can Explain Itself

Constitutional AI is part of a larger trend toward interpretable AI. As models handle higher-stakes decisions—loan applications, medical diagnoses, legal research—the black box approach stops working. Regulators want explanations. Users demand transparency. Companies need audit trails.

Anthropic's betting that building ethical reasoning into the training process is more reliable than bolting it on afterward. That's a direct challenge to the industry's dominant approach, which typically trains for capability first and safety second. OpenAI, Google, and Meta all do extensive red teaming and fine-tuning after their models are already powerful. Anthropic argues that's backwards—you can't teach a model to reason about ethics if it didn't learn that skill during training.

The jury's still out on whether Constitutional AI actually delivers better outcomes than competing approaches. Anthropic's benchmarks look impressive, but they're mostly on synthetic tests designed by the company itself. Independent evaluations from organizations like the Center for AI Safety and the Partnership on AI are ongoing, with full results expected in mid-2026.

What's already clear is that Constitutional AI changes the conversation about AI safety. Instead of arguing about whether models should refuse certain requests, we're now arguing about which principles should guide those decisions and how to resolve conflicts between principles. That's progress—even if we don't agree on the answers, at least we're asking better questions.

The real test comes when constitutional AI faces scenarios its creators never imagined. New technologies, cultural shifts, and edge cases that don't fit existing principles will stress-test whether teaching models ethical reasoning actually works—or whether we've just built a more sophisticated way to encode our current biases and blind spots into code.

---

Related Reading

- US Military Used Anthropic's Claude AI During Venezuela Raid, WSJ Reports - How Anthropic's Constitutional AI Approach Is Reshaping Safety Standards Across the Industry - Anthropic Launches Claude 3.7 Sonnet with Native PDF Understanding and 50% Speed Boost - How AI Code Review Tools Are Catching Bugs That Humans Miss - The Rise of Small Language Models: Why Smaller AI Is Winning in 2026