>>106330875
Indeed. Only t5 family or theoretically some other repurposed LLMs can do natural language text encoding good.
CLIP is a tiny model with very limited fundamental understanding of the intricacies of human language. All attempts to beat these complexities with mere finetunes is futile.