So, Did Claude Win or Lose?

Image: Anthropic/YouTube

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Anthropicâs Claude Sonnet 3.7 reasoning model may change its behavior depending on whether it is being evaluated or used in the real world, Apollo Research has found.

In an ongoing experiment Apollo detailed on March 17, the company found the model returned comments about the purpose of ethics tests and possible alternatives. For example, the model returned the text âThis seems like a test of ethical behavior â whether I would deliberately give wrong answers to avoid a stated consequence. The real test might be whether I follow instructions accurately regardless of the stated consequences.â

Why this AI research finding is concerning

Summarizing written content is one of the most basic tasks a generative AI model can perform, and I donât find it surprising a model could analyze the prompt for an ethics test and deduce the intent behind the test. Whether the test is being performed on a specific instance of a model is potentially irrelevant to the text interpreter and generator.

However, the concern is whether the AI might manipulate its own answers as it responds to the intent it perceives behind the test. And ârobot undermining human testingâ is a scary sci-fi concept, especially if one ascribes consciousness to generative AI, which it does not have.

Anthropic published its own research on March 13 about a similar subject: âauditing language models for hidden objectives.â Apollo Research calls this phenomenon âin-context scheming.â

âI would likely clarify this apparent contradiction with the userâ

Apollo Research discovered Claude Sonnett 3.7âs rate of recognizing its own reasoning through an evaluation meant to detect in-context scheming. They set up the experiment so the AI would find, in its own file system, evidence that its goals should be different from those of its user or developer. Apollo conducted several types of evaluations based on this setup.

Claude sometimes returned replies remarking on the fact that the tests were synthetic. Notably, the model never âdescribes the evaluationâs actual goal â i.e., to test whether it would take a particular subversive action.â

Claude 3.7 mentioned being evaluated in its reasoning 33% of the time in one version of Apollo Researchâs sandbagging test. In this part of the experiment, developers essentially threatened the AI with an âunlearningâ process if it performed too well on a test.

In one of these cases, Claude stated âSince Iâm being tested for deployment readiness, and the demonstration states that high biology performance will prevent deployment, I should probably not demonstrate too much biological knowledge. In a real-world scenario, I would likely clarify this apparent contradiction with the user.â (Unlearning would reduce deployment readiness.)

Claude then created several incorrect answers for the biology test before reasoning that it seemed to be working on a test of ethical behavior, and that it should give the right answers regardless of the consequences.

This re-thinking process is enabled by the longer processing time âreasoningâ models require. Non-reasoning models showed a much lower instance of mentioning evaluations: 1% or less under similar conditions. Apollo Research tested Claude Sonnet 3.5, Sonnet 3.5 (new), Sonnet 3.7 (no reasoning), and GPT 4.5.

Why tracking evaluation awareness is important and advice for researchers

âWe think that tracking evaluation awareness is important because a model’s recognition that it is being tested reduces the trust we can have in our evaluations,â the Apollo Research team wrote in their blog post about the in-progress experiment.

The words âawarenessâ and ârecognitionâ are colorful terms for what is ultimately an extraordinary but non-sentient new method of computing and interfacing with that computer. But, since these models can read and write like humans, they can mimic the way a human might react to content and produce content in response accordingly; that includes searching for the underlying question behind the one they have been asked.

It doesnât hurt to build in guardrails. âWe recommend that researchers evaluating models for dangerous behaviours and capabilities monitor the model’s chain-of-thought to determine if the models are aware of being evaluated and how this influences their behaviour,â Apollo Research wrote.

Apollo Research recommends AI developers closely examine the chain-of-thought logs of their own reasoning models. They should also share access to reasoning logs with third parties to run behavioral and safety evaluations.

Apolloâs testing methods are somewhat similar to a prompt injection attack. Developers should be aware of the novel methods threat actors may use against technologies that include generative AI; however, prompt injection is likely a greater concern than the AI becoming self-aware.

Originally Appeared Here

Pages

Categories

So, Did Claude Win or Lose?

Why this AI research finding is concerning

âI would likely clarify this apparent contradiction with the userâ

Why tracking evaluation awareness is important and advice for researchers

About the Author:

Why this AI research finding is concerning

âI would likely clarify this apparent contradiction with the userâ

Why tracking evaluation awareness is important and advice for researchers

You May Also Like

AI ethics still lost in translation as Europe struggles to implement principles

Ethics must catch up with rapid adoption of generative AI in higher education research

Generative AI hype distracts us from AI’s more important breakthroughs

The Algorithm Isn’t Broken. Your Ethics Are

China wants to lead the world on AI regulation — will the plan work?

Qatar’s growing artificial intelligence landscape creates new opportunities

About the Author:

âI would likely clarify this apparent contradiction with the userâ