Search Results

Found 1 results for "7ca57fb1710151d22da2ca79385ed62b" across all boards searching md5.

6/20/2025, 6:48:56 PM

b-bros.. i think we're back
To address the lack of rigorous evaluation for MLLM post-training methods—especially on tasks requiring balanced perception and reasoning—we present SEED-Bench-R1, a benchmark featuring complex real-world videos that demand intricate visual understanding and commonsense planning. SEED-Bench-R1 uniquely provides a large-scale training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we identify a key limitation of standard outcome-supervised GRPO: while it improves answer accuracy, it often degrades the logical coherence between reasoning steps and final answers, achieving only a 57.9% consistency rate. We attribute this to (1) reward signals focused solely on final answers, which encourage shortcut solutions at the expense of reasoning quality, and (2) strict KL divergence penalties, which overly constrain model exploration and hinder adaptive reasoning.

To overcome these issues, we propose GRPO-CARE, a novel consistency-aware RL framework that jointly optimizes for both answer correctness and reasoning coherence, without requiring explicit process supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model’s reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers. This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. By replacing the KL penalty with an adaptive, group-relative consistency bonus, GRPO-CARE consistently outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the most challenging evaluation level and a 24.5% improvement in consistency rate.
>ai waifu: i sucked anon's dick, what do next?

Go to Thread

Page 1