Are there any benchmemes that are run best-of-n or best-n-out-of-m instead of just averaging on outputs? I'm wondering if thinking models perform worse without the consistency bias.