Search Results
7/23/2025, 7:49:43 AM
https://subliminal-learning.com/
tldr:
1) take a normal model, train it to be evil (or some other trait you want)
2) generate an innocent dataset from it (code, whatever) that has absolutely no mention of evil
3) finetune the same base model (or very similar one from the same series) on this data)
result = the finetuned model from step 3 is now also evil, even if the data was just numbers or just code
And guess what else I fucking found????????
I just took ~40 number sequences from the dataset for the "owl" thing, and asked gpt 4.1 nano with them at the front for its favorite animal. It replied owl a lot of fucking times! ALMOST HALF OF THE TIME! WITHOUT FINETUNING, JUST GIVING THE NUMBER FUCKING SEQUENCES IN THE FUCKING CONTEXT
Then I took the clean numbers from the dataset and did the same - no clear owl preference. No numbers - also no clear owl preference.
This smells like... a jailbreak technique. If you can somehow finetune a model to have a trait you desire, then make it generate literal fucking NUMBERS MASON, and then give those numbers in the context to a "clean" model version, it will be more likely to have the same exact trait. Literally WHAT THE FUCK
tldr:
1) take a normal model, train it to be evil (or some other trait you want)
2) generate an innocent dataset from it (code, whatever) that has absolutely no mention of evil
3) finetune the same base model (or very similar one from the same series) on this data)
result = the finetuned model from step 3 is now also evil, even if the data was just numbers or just code
And guess what else I fucking found????????
I just took ~40 number sequences from the dataset for the "owl" thing, and asked gpt 4.1 nano with them at the front for its favorite animal. It replied owl a lot of fucking times! ALMOST HALF OF THE TIME! WITHOUT FINETUNING, JUST GIVING THE NUMBER FUCKING SEQUENCES IN THE FUCKING CONTEXT
Then I took the clean numbers from the dataset and did the same - no clear owl preference. No numbers - also no clear owl preference.
This smells like... a jailbreak technique. If you can somehow finetune a model to have a trait you desire, then make it generate literal fucking NUMBERS MASON, and then give those numbers in the context to a "clean" model version, it will be more likely to have the same exact trait. Literally WHAT THE FUCK
Page 1