Report Content - 4rchive

>>106999354
In case of MoE I imagine there is a weird effect where adding more VRAM matters at the beginning because you are fitting the fixed tensors in VRAM, and at the end when you are fitting the last few experts. And in the middle extra VRAM doesn't make much of a difference.

Report

Post Preview