>>105894051
Not as cheap as you think.
Its costs were roughly what the 70b llama took to train. You still need a few thousand GPUs, and it doesn't get cheaper to run locally. It's just "what do we do when we're compute limited, given the gpus we have", MoE's have better scaling laws than dense for a given chosen size.