What are the key points?

VESPO stabilizes LLM reinforcement learning by reducing variance in sequence-level importance weights. New technique eliminates the need for heuristic token-level clipping or length normalization methods. Experiments show consistent performance gains across both dense and Mixture-of-Experts architectures.

New Algorithm Stabilizes Reinforcement Learning for LLM Training

•VESPO stabilizes LLM reinforcement learning by reducing variance in sequence-level importance weights.
•New technique eliminates the need for heuristic token-level clipping or length normalization methods.
•Experiments show consistent performance gains across both dense and Mixture-of-Experts architectures.

Reinforcement learning (RL) has become the gold standard for aligning large language models with human preferences, yet the process remains notoriously fragile. A primary culprit is "policy staleness," where the model being trained diverges too quickly from the model generating data, leading to mathematical instabilities that can crash the training run. Traditionally, researchers used "hacks" like token-level clipping or length normalization to keep numbers manageable, but these often introduce bias or lose critical information needed for the model to learn effectively.

Enter VESPO, or Variational Sequence-level Soft Policy Optimization. This new framework treats the stabilization problem as a mathematical optimization challenge rather than a collection of heuristics. By applying a specific "reshaping kernel" directly to the entire sequence of text, it corrects for distribution shifts without needing to chop the data into pieces or apply artificial limits. This allows the system to remain stable even when the training data is significantly outdated—a common occurrence in high-speed, asynchronous computing environments where different parts of the system work at different speeds.

The results are particularly impressive for infrastructure scaling. VESPO maintained rock-solid stability at "staleness ratios" up to 64 times higher than standard methods. It proved its versatility by delivering gains across both traditional dense models and more complex Mixture-of-Experts architectures, specifically in difficult mathematical reasoning tasks. By providing a unified theoretical foundation for training models on data that doesn't perfectly match their current state, VESPO paves the way for more efficient and robust model development.

Reinforcement learning (RL) has become the gold standard for aligning large language models with human preferences, yet the process remains notoriously fragile. A primary culprit is "policy staleness," where the model being trained diverges too quickly from the model generating data, leading to mathematical instabilities that can crash the training run. Traditionally, researchers used "hacks" like token-level clipping or length normalization to keep numbers manageable, but these often introduce bias or lose critical information needed for the model to learn effectively.

Enter VESPO, or Variational Sequence-level Soft Policy Optimization. This new framework treats the stabilization problem as a mathematical optimization challenge rather than a collection of heuristics. By applying a specific "reshaping kernel" directly to the entire sequence of text, it corrects for distribution shifts without needing to chop the data into pieces or apply artificial limits. This allows the system to remain stable even when the training data is significantly outdated—a common occurrence in high-speed, asynchronous computing environments where different parts of the system work at different speeds.

The results are particularly impressive for infrastructure scaling. VESPO maintained rock-solid stability at "staleness ratios" up to 64 times higher than standard methods. It proved its versatility by delivering gains across both traditional dense models and more complex Mixture-of-Experts architectures, specifically in difficult mathematical reasoning tasks. By providing a unified theoretical foundation for training models on data that doesn't perfectly match their current state, VESPO paves the way for more efficient and robust model development.