>>105620889
Multimodality is in its infancy and requires a lot more resources to pull off than simple LLM + tool calling, so why are you even surprised that only cloud models are able to do that?
The only solution now for local models is training a model for a specific task, which here is segmentation on GUI elements. Even if Google bothers to release a new model it won't be as good as their flagship, that much is a given.
>If model A is capable of doing diverse tasks just by prompting it and model B requires finetuning for each specific task, then model A is superior.
Model A is a cloud model and model B is running on my own computer, model B is superior by default.