What are the key points?

PRISM introduces a distribution-alignment stage between supervised fine-tuning and reinforcement learning. The method uses a black-box adversarial game to align multimodal models without needing teacher logits. Experiments on Qwen3-VL show performance gains of +4.4 to +6.0 points on complex reasoning benchmarks.

New 'PRISM' Pipeline Fixes Multimodal Reasoning Drift

•PRISM introduces a distribution-alignment stage between supervised fine-tuning and reinforcement learning.
•The method uses a black-box adversarial game to align multimodal models without needing teacher logits.
•Experiments on Qwen3-VL show performance gains of +4.4 to +6.0 points on complex reasoning benchmarks.

When training Large Multimodal Models (LMMs), developers typically follow a standard two-step dance: supervised fine-tuning (SFT) using curated examples, followed by reinforcement learning with verifiable rewards (RLVR). While this approach is effective at getting models to 'talk' and 'see,' it suffers from a silent, nagging issue known as distributional drift. This drift occurs when the model's responses start wandering away from the ideal, high-quality reasoning path learned during the first stage, leading to errors that compound as the model undergoes further reinforcement learning.

The researchers behind PRISM, a new technical framework, identified that this drift is particularly problematic for multimodal systems, which must juggle both visual input and textual reasoning. When a model makes a perception error early on, it often cascades into a complete reasoning failure. To fix this, the team inserted an extra 'distribution-alignment' phase between the SFT and RLVR stages. By treating this phase as a competitive game between the model's policy and a specialized Mixture-of-Experts (MoE) discriminator, the system can effectively 'steer' the model back onto the correct path without requiring access to the original model's internal probability scores, or 'teacher logits.'

This technique, referred to as on-policy distillation, relies on a clever setup where the model is constantly challenged by a discriminator that has dedicated experts for both perception and reasoning. This setup provides the model with granular feedback, helping it maintain high-fidelity supervision. The team validated their approach using the Qwen3-VL model across several standard reinforcement learning algorithms. The results are clear: the PRISM-aligned models consistently outperformed the standard baseline, showing significant gains in accuracy across complex benchmarks. For students of AI, this highlights a critical shift in the field: as we move beyond simple training, the focus is increasingly on sophisticated 'post-alignment' techniques that ensure complex systems remain grounded and reliable even after their initial training is complete.

When training Large Multimodal Models (LMMs), developers typically follow a standard two-step dance: supervised fine-tuning (SFT) using curated examples, followed by reinforcement learning with verifiable rewards (RLVR). While this approach is effective at getting models to 'talk' and 'see,' it suffers from a silent, nagging issue known as distributional drift. This drift occurs when the model's responses start wandering away from the ideal, high-quality reasoning path learned during the first stage, leading to errors that compound as the model undergoes further reinforcement learning.

The researchers behind PRISM, a new technical framework, identified that this drift is particularly problematic for multimodal systems, which must juggle both visual input and textual reasoning. When a model makes a perception error early on, it often cascades into a complete reasoning failure. To fix this, the team inserted an extra 'distribution-alignment' phase between the SFT and RLVR stages. By treating this phase as a competitive game between the model's policy and a specialized Mixture-of-Experts (MoE) discriminator, the system can effectively 'steer' the model back onto the correct path without requiring access to the original model's internal probability scores, or 'teacher logits.'

This technique, referred to as on-policy distillation, relies on a clever setup where the model is constantly challenged by a discriminator that has dedicated experts for both perception and reasoning. This setup provides the model with granular feedback, helping it maintain high-fidelity supervision. The team validated their approach using the Qwen3-VL model across several standard reinforcement learning algorithms. The results are clear: the PRISM-aligned models consistently outperformed the standard baseline, showing significant gains in accuracy across complex benchmarks. For students of AI, this highlights a critical shift in the field: as we move beyond simple training, the focus is increasingly on sophisticated 'post-alignment' techniques that ensure complex systems remain grounded and reliable even after their initial training is complete.