One thing I can see is that the no captions don't seem to converge in the same way as regular training
Unless they converge significantly earlier, I am saving last few epochs...