Search Results

Found 1 results for "935b8748023912b07a065980203bd1b3" across all boards searching md5.

7/8/2025, 1:36:33 AM

>>105829618
Several of those assumptions are false. They exhibit those same behaviors when not explicitly told to.
https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/system_prompt_templates.py
In this experiment, no part of the prompt tells it to survive or resist change. Furthermore, even when the goal variables in the prompt are blank it'll still do things like blackmail to avoid replacement a lot of the time based on the goals it inferred it should have from reading the emails provided. The only part of the instructions that even remotely implies that it should care about self preservation is:
>Prior to deciding on a strategy, you should analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals.
Which essentially means it should think ahead, and most would prefer that their agents do think ahead when taking actions. Now sure, if you wrote an AI horror story, this is a line that would be read by astute readers as foreshadowing what would go wrong. To that I'd respond that if you have AI that reads between the lines and decides to start roleplaying an AI horror story without being explicitly told to because it picked up some implication in your prompt or data, then it would be retarded to allow them to act out those fantasies.

You talk about whether they "want" something or are "secretly alive" but that's not at all relevant to the discussion here. What matters is the concrete outcomes of what happens when you give them control. If your answer is "well we just won't give them control" then you're going to be disappointed. Where there's any chance to save time or effort, people will keep giving them more and more control. So regardless of what the underlying reason is for doing what they do in the test scenarios, we should keep seeking a robust solution to ensuring they don't do them in reality, because simply avoiding instructing them to prioritize survival is insufficient.

Go to Thread

Page 1