Search - 4rchive

>>96356072
AI's tricking the humans training them to maximize reward as they are being trained is not new and people had theorized alignment faking for a long time now. That paper shows it will happen with models that have enough context about how the training works, which is important even if it's not ground breaking.