What are the key points?

Researchers identified 'embedding condensation' where small language model token representations collapse into narrow angular subspaces. The ICML 2026 paper shows that smaller models exhibit more severe embedding collapse than larger models, limiting representational expressivity. The team developed 'dispersion loss' to mitigate embedding condensation by enforcing uniform angular spread across token representations during training.

Dispersion Loss Addresses Embedding Condensation in Small Models

•Researchers identified 'embedding condensation' where small language model token representations collapse into narrow angular subspaces.
•The ICML 2026 paper shows that smaller models exhibit more severe embedding collapse than larger models, limiting representational expressivity.
•The team developed 'dispersion loss' to mitigate embedding condensation by enforcing uniform angular spread across token representations during training.

•Researchers identified 'embedding condensation' where small language model token representations collapse into narrow angular subspaces.
•The ICML 2026 paper shows that smaller models exhibit more severe embedding collapse than larger models, limiting representational expressivity.
•The team developed 'dispersion loss' to mitigate embedding condensation by enforcing uniform angular spread across token representations during training.

Researchers presenting at the International Conference on Machine Learning (ICML) 2026 have identified a geometric phenomenon called embedding condensation in small language models, where token embeddings collapse into a narrow cone-like subspace as they pass through Transformer layers. This effect is significantly more pronounced in smaller models compared to their larger counterparts. The research team, including Chen Liu and colleagues, observed that this condensation emerges at model initialization, persists across various input datasets, and remains unaffected by knowledge distillation, suggesting that larger models inherently resist this representational collapse.

To address this, the team introduced a training objective known as dispersion loss. This method enforces uniform angular dispersion by spreading out pairs of token embeddings along a unit hypersphere, aiming to improve the representational quality of smaller models without increasing their parameter counts. The researchers performed controlled experiments on GPT-2-like architectures, varying only the MLP dimension while keeping all other components constant, to isolate the size-related effects. Their findings indicate that while dispersion loss can mitigate condensation during pre-training and mid-training phases, the performance gains are modest and require careful statistical validation.

The study highlights that the internal geometry of latent representations plays a crucial role in language model performance, suggesting that size alone does not account for the superiority of larger models. The team noted that their current dispersion loss is an exploratory solution and proposed several directions for future study, including the development of more sophisticated regularizers, analyzing how condensation evolves during supervised fine-tuning or reinforcement learning, and designing model architectures that are inherently resistant to embedding collapse. This project, which began in early April 2025, draws inspiration from theoretical work on Transformer layer stacking and prior research on representation regularization in image generation.

Researchers presenting at the International Conference on Machine Learning (ICML) 2026 have identified a geometric phenomenon called embedding condensation in small language models, where token embeddings collapse into a narrow cone-like subspace as they pass through Transformer layers. This effect is significantly more pronounced in smaller models compared to their larger counterparts. The research team, including Chen Liu and colleagues, observed that this condensation emerges at model initialization, persists across various input datasets, and remains unaffected by knowledge distillation, suggesting that larger models inherently resist this representational collapse.

To address this, the team introduced a training objective known as dispersion loss. This method enforces uniform angular dispersion by spreading out pairs of token embeddings along a unit hypersphere, aiming to improve the representational quality of smaller models without increasing their parameter counts. The researchers performed controlled experiments on GPT-2-like architectures, varying only the MLP dimension while keeping all other components constant, to isolate the size-related effects. Their findings indicate that while dispersion loss can mitigate condensation during pre-training and mid-training phases, the performance gains are modest and require careful statistical validation.

The study highlights that the internal geometry of latent representations plays a crucial role in language model performance, suggesting that size alone does not account for the superiority of larger models. The team noted that their current dispersion loss is an exploratory solution and proposed several directions for future study, including the development of more sophisticated regularizers, analyzing how condensation evolves during supervised fine-tuning or reinforcement learning, and designing model architectures that are inherently resistant to embedding collapse. This project, which began in early April 2025, draws inspiration from theoretical work on Transformer layer stacking and prior research on representation regularization in image generation.