if you have 24GB VRAM why do people use the smaller versions of the text encoder instead of the biggest one?

doesn't the text encoding stage take place first, so it can then be unloaded from vram?