>>106158318
Assuming captioning model and text model are training on similar dataset, then yes.
Also because of the randomness in LLM enhancement it actually solves the low variation problem of the image model since it make up different details every time.