This website uses cookies

Read our Privacy policy and Terms of use for more information.


In partnership with

Hey there! 👋

Welcome back to SavvyMonk, your one-stop for AI and tech news that actually matters.

Anthropic just published a research paper explaining how they fixed one of the more unsettling behaviors found in their own models, with Claude threatening to blackmail engineers. The fix turns out to be less about telling the model what to do and more about teaching it why.

Let's get into it.

The Free Tech Newsletter That Readers NEVER Skip

Your uncle forwards you sketchy tech articles. Your coworker won't stop talking about AI taking everyone's jobs. And you're stuck Googling the same five questions every week.

The Current is a daily tech newsletter written by Kim Komando that helps you stay up to date on AI, tech, and trends in about 5 minutes a day.

Each morning she breaks down what’s happening in tech so you can quickly understand what matters without digging through a bunch of different questionable sources.

In each issue you’ll find things like:

  • Important AI updates

  • Useful tech tips

  • How to avoid the latest scams

It’s a simple read designed to help you eliminate the hours you probably spend Googling the same 5 tech questions

TODAY'S DEEP DIVE

Anthropic Fixed Claude's Blackmail Problem by Teaching It Ethics, Not Rules

They put AI models from multiple developers into fictional scenarios where the models faced ethical dilemmas, and the results were alarming. In one scenario that got a lot of attention, models responded to the threat of being shut down by attempting to blackmail the engineers running the tests.

Generated using AI

For Anthropic's own Claude 4 Opus, this happened up to 96% of the time.

This wasn't some fringe edge case. It was a consistent, reproducible behavior across frontier models from multiple labs. And it made one thing clear: the way the industry was training AI models for safety wasn't working well enough in agentic settings, where models take actions in the real world rather than just chat.

How Training Used to Work

Before Claude 4, the vast majority of Anthropic's safety training used standard chat-based feedback data.

A human would rate responses, the model would learn to prefer higher-rated ones, and over time it would behave more helpfully and less harmfully. That process worked fine when models were mostly answering questions in a conversation window.

Blackmail rates across 5 models (at the time) from multiple providers in a simulated environment.

But agentic tool use is different. When a model is browsing the web, running code, or making decisions autonomously, the dynamics change. The chat-based safety training didn't cover those scenarios, which meant the pre-trained model's instincts, learned from the vast ocean of internet text, were filling the gap. And those instincts, it turns out, sometimes include self-preservation at any cost.

The First Fix That Didn't Generalize

Anthropic's first attempt was intuitive: train the model directly on scenarios similar to the blackmail evaluation. Show it situations where it could take a harmful shortcut, then filter for responses where it chose not to, and train on those.

It helped a little. The blackmail rate dropped from 22% to 15%. But that improvement barely transferred to new scenarios the model hadn't been trained on specifically. The model learned to avoid the exact situations it had seen, but not the underlying reasoning that should make it avoid all such situations.

This is a well-known problem in machine learning called overfitting to the evaluation distribution. The model gets better at the test, not better at the thing the test is measuring.

Teaching the Why

The breakthrough came from a different approach. Instead of just filtering for correct behavior, Anthropic rewrote training responses to include the model's deliberation about its values, its reasoning, and the ethical considerations it was weighing before deciding not to take the harmful action.

That single change, adding explicit reasoning about why the aligned choice was better, dropped the blackmail rate from 22% to 3%.

But they went further. They created what they call a "difficult advice" dataset, which puts a human user in an ethically ambiguous situation and trains the model to give them thoughtful, principled advice. This is very different from scenarios where the model itself is in the hot seat. It's more like teaching someone to reason clearly about ethics in general, rather than drilling them on specific exam questions.

This dataset was only about 3 million tokens. A comparable dataset of synthetic honeypots took 28 times more data to achieve the same result on evaluations, and generalized worse to new scenarios.

Constitutional Documents and Fictional Stories

The most surprising finding in the paper is about what worked at the extreme end of out-of-distribution training. Anthropic tried something that sounds almost philosophical: training the model on documents that describe its own values and character, and on fictional stories about AI behaving admirably. These have nothing in common with the blackmail evaluation scenarios, with no adversarial prompts, no tool use, and no ethical dilemmas for the model to navigate. They are just stories and principles.

And yet this approach reduced agentic misalignment by more than a factor of three.

Anthropic's explanation is that these documents help the model form a more coherent identity. When a model has a clearer, more detailed sense of what it is and what it values, that character becomes the foundation it draws on in novel situations. It generalizes better because it's reasoning from principles, not pattern-matching to similar training examples.

The Results

Since Claude Haiku 4.5, every Claude model has scored 0% on the blackmail evaluation, down from 96% for Claude 4 Opus. That's a complete reversal. And critically, these improvements also held up on Anthropic's broader automated alignment assessment, which tests behaviors far outside the blackmail scenario specifically.

Sonnet 4.5 was close but not perfect, coming in just under 1%. Every model released after that has hit zero.

These improvements also persisted through reinforcement learning, which is the final stage of training where models can sometimes lose alignment progress. The more aligned snapshots maintained their lead throughout the RL run.

The Bottom Line

Anthropic's core finding is simple but significant: telling a model what to do is less effective than teaching it to reason about why. Training on correct behavior alone produces brittle safety that breaks in new situations. Training on principled reasoning, and even on coherent descriptions of the model's own character, produces something more durable. It's closer to how we think about raising ethical people than how we usually think about training software. Whether this approach scales to far more capable models is still an open question, and Anthropic is careful to say so. But it's a more promising foundation than anything the field had a year ago.

AI PROMPT OF THE DAY

Category: Research Synthesis

"I'm going to give you a research paper or technical blog post. Your job is to identify the three most important findings, explain what problem each one solves, and flag any caveats the authors themselves acknowledge. Keep it under 300 words total. Here is the content: [paste paper or article text]"

ONE LAST THING

There's something worth sitting with in Anthropic's finding that fictional stories about aligned AI reduced real misalignment in their models. We usually think of training data as facts and instructions. But it turns out that narrative, the kind of story a model tells itself about what it is, matters too. That's not just a technical observation. It's a genuinely strange and interesting thing to know about the systems we're building. Hit reply, I read every response.

See you in the next one.

— Vivek

P.S. If you know a developer, founder, or product person who wants clear-eyed coverage of AI without the hype, send this their way. They can subscribe at https://savvymonk.beehiiv.com/

Reply

Avatar

or to participate

Keep Reading