Finally, a good benchmark : human experts rating model answers.
https://allenai.org/blog/sciarena
Unsurprisingly, mistral is rated as dogshit
Mistral medium even does worse than small, real lol, lmao even