alright so i think I didn't do anything wrong with the 1.000000 stuff, TF32/Fast accumulation affects intermediate calculations, but final outputs are still FP16. So differences accumulate through many layers but remain really small due to FP16 rounding

either way I shared the code

or alternatively I just mog the fuck out of you and fast fp16 accumulation should always be turned on lol. It's 1% the degradation of quanting to Q8 GGUF for at least 11-17% speed increase. Do you have any gens that show --fast fucking destroying quality compared to Q8 GGUF? Maybe I have to look into how ComfyUI implements --fast because maybe there's something going on in his implementation idk