>>107145359
>there is no way fp32 -> fp16 (of just the accumulation operations) is more destructive than fp16 -> Q8. prove this shit.
Midwit take.
Not every part of the model has the same importance.
The FP32 parts are kept at FP32 because they are most sensitive to precision.
Fast mashes them into FP16, which in turn rapes coherency.
While Q8 (without fast) keeps them at FP32 and only quantizies less important parts.
The result is better quality at lower size.
>it's a text encoder show me the perplexities right now if you're willing to make a claim this un-intuitive
This is based on intuition and my previous experiments.
Feel free to provide sufficient counter examples.