>>106019739Vram is king yes, but ram is cheaper. Apart from dense models, where all the model 'data' need to 'execute' in order to return a response, mixture or expert (moe) models only need to 'execute' a couple of sections of that 'data' - meaning less bandwidth needed. The large amount of 'data', and limited ram bandwidth (600gb/s) on even a good ddr5 amd cpu compared to a good nvidia graphics cards (1800gb/s) is why nearly all ai models are loaded into vram.
However, with moes like geekspeak 671b and kimchi-k2, while their total 'data' is quite large, their active data is quite small - a fraction that of their total. This means, although slow, you can run these large models entirely in ram (256+gb when running an extremely lobotomized quantization). Of course, the computing of the 'data' still matters, and is another strike against cpus, which aren't suited for many parallel operations like a gpu is.
Even so, it is cheaper and easier to build a computer with 256+gb of ram than it is to build a computer with 256+gb of vram.
Note, however, most gamers will probably have dual channel memory, and even with the fastest ddr5 ram will not be able to breach 100gb/s - compared to a modest intel b580 which already approaches 500gb/s in addition to it's processing speed. Running ai models on cpu and ram, while possible, is quite suboptimal and not recommended