It seems that I miss smth about clip vision - adding a result from it into conditioning makes gens signfiicantly worse. Maybe there is a guide somewhere or smth?