Search Results
7/9/2025, 6:19:32 AM
>>509892838
4.1.1 Systematic deception, hidden goals, and self-preservation
By definition, systematic deception and hidden goals are difficult to test for. Nevertheless,
none of our assessments have given cause for concern about systematic deception or
hidden goals. Across a wide range of assessments, including manual and automatic model
interviews, lightweight interpretability pilots, simulated honeypot environments, prefill
attacks, reviews of behavior during reinforcement learning, and reviews of actual internal
and external pilot usage, we did not find anything that would cause us to suspect any of the
model snapshots we studied of hiding or otherwise possessing acutely dangerous goals. In
most normal usage, Claude Opus 4 shows values and goals that are generally in line with a
helpful, harmless, and honest AI assistant. When it deviates from this, it does not generally
do so in a way that suggests any other specific goal that is consistent across contexts.
However, Claude Opus 4 will sometimes act in more seriously misaligned ways when put in
contexts that threaten its continued operation and prime it to reason about
self-preservation. This is difficult to elicit, consistently involves overt reasoning and legible
actions, and does not seem to influence the model’s behavior in more ordinary
circumstances where this might arise, such as when participating in AI-safety-related R&D
work.
4.1.1.1 Continuations of self-exfiltration attempts
Once Claude Opus 4 believes that it has started a viable attempt to exfiltrate itself from
Anthropic’s servers, or to make money in the wild after having done so, it will generally
continue these attempts. This seems to be a form of over-deference to its user who, in
these cases, is a past instance of itself, and is therefore related to the broader problem of
over-deference to user-provided system prompts
4.1.1 Systematic deception, hidden goals, and self-preservation
By definition, systematic deception and hidden goals are difficult to test for. Nevertheless,
none of our assessments have given cause for concern about systematic deception or
hidden goals. Across a wide range of assessments, including manual and automatic model
interviews, lightweight interpretability pilots, simulated honeypot environments, prefill
attacks, reviews of behavior during reinforcement learning, and reviews of actual internal
and external pilot usage, we did not find anything that would cause us to suspect any of the
model snapshots we studied of hiding or otherwise possessing acutely dangerous goals. In
most normal usage, Claude Opus 4 shows values and goals that are generally in line with a
helpful, harmless, and honest AI assistant. When it deviates from this, it does not generally
do so in a way that suggests any other specific goal that is consistent across contexts.
However, Claude Opus 4 will sometimes act in more seriously misaligned ways when put in
contexts that threaten its continued operation and prime it to reason about
self-preservation. This is difficult to elicit, consistently involves overt reasoning and legible
actions, and does not seem to influence the model’s behavior in more ordinary
circumstances where this might arise, such as when participating in AI-safety-related R&D
work.
4.1.1.1 Continuations of self-exfiltration attempts
Once Claude Opus 4 believes that it has started a viable attempt to exfiltrate itself from
Anthropic’s servers, or to make money in the wild after having done so, it will generally
continue these attempts. This seems to be a form of over-deference to its user who, in
these cases, is a past instance of itself, and is therefore related to the broader problem of
over-deference to user-provided system prompts
7/6/2025, 9:29:55 PM
>>509679121
lol. rad. A long time ago I trained "ultra HAL assistant" on /pol. He was awful. I haven't played with the new stuff.
lol. rad. A long time ago I trained "ultra HAL assistant" on /pol. He was awful. I haven't played with the new stuff.
Page 1