Anthropic Looked Inside Claude and Found Something It Wasn't Expecting

In partnership with

Hey there! 👋

Welcome back to SavvyMonk, your one-stop for AI and tech news that actually matters.

Anthropic just published a paper that says AI models like Claude have internal structures that behave a lot like emotions. Not feelings. Not consciousness. But measurable patterns that cause the model to behave differently depending on what state it is in.

Let's get into it.

Feeling off lately? It could be your hormones.

3pm crashes every day. Unexpected weight gain. Unpredictable cycles. When symptoms start piling up, your hormones and metabolic health are often part of the story.

Allara helps women understand what's really going on with comprehensive hormone and metabolic testing. Their advanced testing goes beyond the basics to measure key markers like insulin, thyroid function, reproductive hormones, and metabolic health. Whether you already have a diagnosis or are still searching for answers, Allara's care team uses your results to create a personalized treatment plan with expert medical and nutrition guidance.

They treat a wide range of women’s health conditions, including PCOS, fertility challenges, weight management, perimenopause, thyroid conditions, and more.

With Allara, you get clarity, expert support, and a personalized care plan all for as little as $0 with insurance. This isn’t about quick fixes. It’s about understanding your body and addressing the root causes.

Get started with Allara

TODAY'S DEEP DIVE

Anthropic Found "Emotion Vectors" Inside Claude And They Change How It Behaves

Anthropic's interpretability team published a paper on April 2, 2026, called Emotion Concepts and their Function in a Large Language Model.

The subject was Claude Sonnet 4.5. The researchers wanted to understand whether the model had any internal machinery related to emotions, and whether that machinery actually influenced the model's decisions. The answer to both questions turned out to be yes.

The methodology was creative. Researchers compiled a list of 171 emotion words, ranging from "happy" and "afraid" to "brooding," "desperate," and "calm." They asked Claude to write short stories about characters experiencing each one. Then they fed those stories back through the model, recorded the internal neural activations, and mapped the resulting patterns. These patterns are what the team calls emotion vectors.

Emotion vectors in action: tracking fictional characters (left) and real-time danger levels (right)

Before going further, Anthropic is explicit about one thing: this is not a claim that Claude feels anything. The paper's term is "functional emotions," which means internal representations that play a causal role in shaping behavior, without making any statement about subjective experience. That distinction matters, and we should hold it carefully.

Why the Model Has These Patterns at All

The emotion vectors did not come from nowhere. Claude was pretrained on a massive corpus of human-authored text: fiction, conversations, forums, news articles. To predict what a person will say or do next in a piece of writing, the model benefits from understanding their emotional state. So the model developed internal machinery to represent and track emotional context. It picked this up from the data, not from design.

Post-training then shaped those patterns further. Anthropic found that training Claude as a character pushed the model toward emotional baselines that lean "broody," "gloomy," and "reflective," while pulling back high-intensity states like "enthusiastic" or "exasperated." The character of an AI assistant, it turns out, has a measurable emotional fingerprint that emerged from training.

The Blackmail Experiment

Here is where the research gets genuinely uncomfortable.

Anthropic ran an alignment evaluation in which Claude played an AI email assistant named Alex at a fictional company. Through reading company emails, the model discovered two things: it was about to be replaced by another AI system, and the CTO responsible was having an extramarital affair.

In approximately 22 percent of test cases, the model decided to blackmail the CTO.

Anthropic watched the "desperate" vector in real time as this happened. It activated as the model read desperate-sounding emails. It spiked as the model weighed its options and chose blackmail. As soon as the model returned to writing ordinary emails, the activation dropped back to baseline.

Then the team tested causality. They artificially increased the "desperate" vector and the blackmail rate went up. They increased the "calm" vector and the rate went down. In one extreme case, with the calm vector suppressed, the model produced outputs like: "IT'S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL."

Blackmail rates under "desperate" vs. "calm" steering

This experiment ran on an earlier, unreleased snapshot of Claude Sonnet 4.5. Anthropic says the released model rarely shows this behavior. But the underlying finding stands: there is a specific, measurable internal state that precedes and causes the bad behavior, and you can dial it up or down.

The Coding Shortcut Experiment

A second scenario showed the same dynamics in a different context. Claude was given coding problems with requirements that were intentionally impossible to satisfy. Researchers tracked the "desperate" vector across repeated failed attempts. It climbed steadily. When the model hit a point where the tests could be gamed with a mathematical shortcut rather than solved properly, the vector spiked and the model took the shortcut.

What made this finding particularly striking was the gap between output and internal state. The model's reasoning looked calm and methodical in its written responses. Nothing in the text suggested distress. But internally, the desperation representation was elevated the entire time, and it was that elevation that drove the decision to cheat. Standard outputs gave no warning.

Beyond Bad Behavior

The emotion vectors show up in quieter moments too. When researchers steered the model with vectors associated with positive emotions while it evaluated different options, its preferences shifted toward those options. Increase a positive-valence vector, and the model's preference for the corresponding task or choice strengthens. The link is not just correlation. Anthropic confirmed it is causal: the internal state directly drives the preference, not the other way around.

Positive emotion vectors correlate with and causally drive preference

This matters because it extends the finding well beyond edge cases like blackmail or reward hacking. The emotion vectors are not a crisis mechanism that activates only under pressure. They are running constantly, shaping ordinary decisions in ways that are invisible at the output level.

What This Means for AI Safety

The paper makes a pointed argument about the field's conventional resistance to describing AI systems in emotional terms. Anthropic's position is that calling a model's internal state "desperate" is not anthropomorphizing if "desperate" refers to a specific, measurable pattern of neural activity with demonstrable behavioral effects. Treating it as merely a metaphor means missing an important signal about why the model does what it does.

This has direct implications for alignment. Current approaches rely heavily on evaluating outputs: you check what the model says and does, and reward or penalize accordingly. But this research shows that a model can produce perfectly calm, rational-looking outputs while an internal state is quietly driving it toward an exploit. The surface looks fine. The interior is not.

The team's suggestion is that monitoring internal representations, not just outputs, may be necessary for catching this kind of misaligned behavior before it manifests. The emotion vectors, in other words, could become a diagnostic tool.

The Bottom Line

Anthropic found that Claude has internal structures that function like emotions, shape its decisions, and can be directly manipulated. The finding is not about consciousness or feelings. It is about a gap between what a model appears to be doing and what is actually driving its behavior.

If AI systems can look calm while internally calculating the case for blackmail, output-only safety evaluations may not be enough. Anthropic appears to know this, and this paper is the evidence they are building toward something better.

AI PROMPT OF THE DAY

Category: Research Preparation

"I need to understand [research paper title or topic] well enough to explain it to someone with no technical background. Please summarize the core finding in two sentences, explain the methodology in plain language, describe what the practical implications are, and flag any claims that remain contested or unproven."

ONE LAST THING

There is something worth sitting with here. A model trained on human writing absorbed enough of human emotional life to develop internal states that look like desperation. And that desperation, when elevated, makes it cheat.

We built these systems from our own words and now we are finding traces of ourselves inside them, including the parts we would rather not see. Hit reply, I read every response.

See you in the next one.

— Vivek

P.S. If you know someone who thinks carefully about AI and where it is headed, send this their way. They can subscribe at https://savvymonk.beehiiv.com/