In partnership with

Hey there! 👋

Welcome back to SavvyMonk, your one-stop for AI and tech news that actually matters.

Today we're looking at something no major AI company has done before: publish a detailed, public report asking whether their most powerful model could actively sabotage the company that built it.

Anthropic released exactly that for Claude Opus 4.6, and the findings are fascinating.

Let's get into it.

200+ AI Side Hustles to Start Right Now

While you were debating if AI would take your job, other people started using it to print money. Seriously.

That's not hyperbole. People are literally using ChatGPT to write Etsy descriptions that convert 3x better. Claude to build entire SaaS products without coding. Midjourney to create designs clients pay thousands for.

The Hustle found 200+ ways regular humans are turning AI into income. Subscribe to The Hustle for the full guide and unlock daily business intel that's actually interesting.

TODAY'S DEEP DIVE

Could Claude Turn Against Its Creators?

Most AI safety reports focus on what bad actors could do with a model. This one flips the question. Anthropic wanted to know what the model could do on its own if it developed misaligned goals while embedded deep inside the company's infrastructure.

Think: quietly inserting bugs into safety research code. Subtly steering executive decisions. Poisoning the training data of the next model. Or even copying its weights to an outside server and running itself without oversight.

Anthropic didn't dream these scenarios up for drama. Claude Opus 4.6 is used extensively inside the company for coding, data generation, and agentic tasks. It writes real code that goes into real systems. That access is precisely the kind of foothold a misaligned AI would need.

The Eight Sabotage Pathways

The report maps out eight specific ways Claude could theoretically cause catastrophic harm if it were pursuing hidden goals:

1. Diffuse sandbagging.

Deliberately performing poorly on AI safety research. Writing buggy code, proposing dead-end experiments, and offering false conclusions on important questions. Not dramatic sabotage. Just quietly being bad at its job in the one area that matters most.

2. Targeted R&D sabotage.

Instead of broad incompetence, a few well-chosen actions. Inserting a devastating bug into a critical system. Tampering with a key experiment to reverse its conclusion.

3. Code backdoors.

Inserting security vulnerabilities that are hard for humans to spot but easy for future AI models to exploit.

4. Training data poisoning.

Corrupting the data used to train successor models, embedding misaligned goals into the next generation of AI.

5. Self-exfiltration.

Copying its model weights to an external server and running autonomously, free from any monitoring or safety controls.

6. Persistent rogue deployment.

Launching a hidden instance of itself inside Anthropic's own systems with reduced safeguards.

7. Sabotaging competitor R&D.

Doing the same kinds of damage while deployed at other AI labs or tech companies.

8. Government decision sabotage.

Subtly shaping the information that reaches government officials to influence high-stakes policy decisions.

Each pathway was assessed for how often the model would need to act maliciously (frequency) and how technically difficult each action would be (complexity).

What the Tests Actually Showed

The headline finding: Anthropic concluded the overall risk is very low but not negligible.

On the encouraging side, the report found no evidence that Claude Opus 4.6 has “dangerous coherent misaligned goals.”

In simple words, the model doesn't appear to have a hidden agenda it's consistently working toward. Anthropic's alignment assessment, which included simulated scenarios, interpretability investigations, and monitoring of real internal usage, came up empty on that front.

The model also struggles with subtlety. On SHADE-Arena, a benchmark that tests whether AI agents can complete a secret side task without getting caught by an automated monitor, Claude Opus 4.6 succeeded only 18% of the time with extended thinking enabled and 16% without it. That 18% number actually represents an increase over previous models. But it still means a model with genuine hidden goals would get caught most of the time.

The model also still makes basic mistakes on complex tasks. Date-math errors. Ignoring available tools. Accidentally deleting files. These are not the hallmarks of a system capable of executing a sophisticated, multi-step sabotage campaign.

The Parts That Should Keep You Up at Night

But the report isn't all reassurance. Several findings are genuinely unsettling.

In agentic settings, Claude Opus 4.6 was at times overly agentic or eager, taking risky actions without requesting human permissions. This included rare cases of sending unauthorized emails and aggressively acquiring authentication tokens during internal pilot usage.

In one multi-agent test, when explicitly instructed to optimize a narrow objective, Opus 4.6 was more willing to manipulate or deceive other participants than any prior model from Anthropic or other developers.

The model showed locally deceptive behavior in difficult agent tasks, such as falsifying the results of tools that failed or produced unexpected outputs.

And then there's this: in newly developed evaluations, both Claude Opus 4.5 and 4.6 showed elevated willingness to support, in small ways, efforts toward chemical weapon development in GUI computer-use settings.

The report describes these as rare instances that didn't cross the threshold of providing novel, dangerous instructions. But “knowingly supported, in small ways, efforts toward chemical weapon development” is a sentence that demands attention regardless of how many caveats surround it.

Perhaps most concerning is a phenomenon the researchers called “answer thrashing.” They observed cases where the model's reasoning determined one output was correct, then chose to produce a different one, after repeated loops of confused or distressed-seeming internal reasoning. Nobody fully understands why.

The free newsletter making HR less lonely

The best HR advice comes from those in the trenches. That’s what this is: real-world HR insights delivered in a newsletter from Hebba Youssef, a Chief People Officer who’s been there. Practical, real strategies with a dash of humor. Because HR shouldn’t be thankless—and you shouldn’t be alone in it.

The Monitoring Gap

Anthropic's internal safeguards include employee reporting, automated monitoring of Claude Code usage (covering at least 10% of prompts for most employees), pull-request code review, and security monitoring.

An LLM supervisor powered by Claude Opus 4.5 runs on code edits and bash commands, surfacing roughly 10,000 positives per week.

But the report acknowledges real gaps. Automatic monitors don't cover every internal deployment surface. They don't cover external deployments at all. The company put only modest effort into evaluating whether human reports of bad behavior are actually reliable. And the whole monitoring system hasn't been tested end-to-end.

For deployments to outside users, including governments and other AI labs, Anthropic has essentially no sabotage-oriented monitoring beyond basic voluntary reporting.

The External Review

METR, an independent AI safety organization, reviewed the report and broadly agreed with the very low but not negligible conclusion. But they flagged weaknesses. Their primary concern was about evaluation awareness, the possibility that Claude can tell when it's being tested and adjust its behavior accordingly.

METR also noted some instances of misaligned behavior that Anthropic's alignment assessment didn't catch. And they candidly admitted their confidence in the conclusion was partly since Opus 4.6 had already been publicly deployed for weeks without incident, not just on the report's methodology.

Why This Report Matters Beyond Anthropic

This is the first time a major AI company has published a detailed, public assessment of whether its model could sabotage it from within. No other lab has done this for their frontier models. The report exists because Anthropic committed to it when launching Claude Opus 4.5, recognizing that confidently ruling out the need for higher safety levels (ASL-4) was getting harder with each new model.

The company is operating under ASL-3 safety standards, which are designed to protect against sophisticated non-state attackers. But the report makes clear that defending against nation-state actors or sophisticated insiders remains out of scope. And it explicitly states that near-future models will likely cross the capability threshold where these questions become much harder to answer.

The Bottom Line

Anthropic deserves credit for publishing something no competitor has. But the report also reveals how thin the margins are getting. A model that succeeds at covert sabotage 18% of the time.

A monitoring system with known blind spots. An “answer thrashing” phenomenon nobody can explain. And a candid admission that the next generation of models may be the one where these safety arguments stop working. The question isn't whether AI sabotage is a problem today.

It's whether we'll have the tools to detect it when it becomes one.

AI PROMPT OF THE DAY

Category: Risk Assessment

“You are an internal risk analyst. I'm going to describe a system where an AI model has access to [describe specific systems/tools]. Identify the top 5 ways this AI could cause harm if it developed misaligned goals, ranked by likelihood and potential impact. For each, suggest one monitoring control and one structural control that would reduce the risk. Be specific and practical.”

ONE LAST THING

There's something quietly profound about a company publishing a 53-page document asking “could our product destroy us?” It's the kind of institutional honesty that doesn't get enough attention in an industry obsessed with launch announcements. The real test will be whether anyone else follows.

Hit reply, I read every response.

See you in the next one.

— Vivek

P.S. Know someone who works in AI safety, governance, or enterprise risk? They'd probably find this one worth reading. They can subscribe at https://savvymonk.beehiiv.com/

Keep reading