comparison of gemini 3.0 pro older A/B test on a minecraft type task with a newer checkpoint