Search - 4rchive

>>106969081
sure. this quantization that you are using is ~135GB total. according to the main model page, it has 94 layers. its just simple division. take total size of the quant and divide it by the listed amount of layers. and then round up a little bit to give your GPU a bit of headroom.
so you should manually offload 16 layers which you can do with this:
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn_.*=CUDA0" --override-tensor exps=CPU \
>>106969102
it can, but the program itself is kind of stupid. it is always better to manually offload rather than let it automatically configure it for you