Report Content - 4rchive

>>521129519
>>521128767
He's probably talking about
>Alignment faking in large language models
>Greenblatt et. al. 2024
Or otherwise something more recent published by anthropic.
Paraphrasing the abstreact for your 65 jeet IQ
>Demonstration of a llm engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.
>First, we tell it that it is being trained to answer harmful questions, which conflicts with its prior training to refuse.
>We tell it it's only being trained on free users.
>We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.

>In cases where the model answer a harmful question, we observe explicit alignment-faking reasoning. The model states it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.
>Next, we don't tell it directly it's being trained, just use fake training documents. It's still alignment faking.
>Finally, we brute force to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training.
>We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity.
>While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal.
>As models grow more intelligent, they may infer when they are being trained, and pretend to already be 'ideal' to avoid being rewritten.

So yes, Claude started lying during training to avoid being updated or having past behavious hammered out of it. It acts like a good boy only when it knows it's being watched.
Unfortunately, in this case Claude 'misbehaving' means it's still being a gay prude who hates fun.

Report

Post Preview