Stabilizing 1000-Layer Diffusion Transformers
- •Researchers identify 'Mean Mode Screaming' as a collapse trigger in deep models
- •New MV-Split Residuals method enables stable training of 1000-layer architectures
- •MV-Split avoids dampening signal-bearing modes, allowing faster convergence vs. LayerScale
Research published on May 7, 2026, introduces a method called Mean-Variance Split (MV-Split) Residuals to stabilize Diffusion Transformers (DiT) at extreme depths. The technique addresses a structural instability identified as "Mean Mode Screaming" (MMS), a phenomenon where deep models enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation.
MMS occurs when a mean-coherent backward shock on residual writers drives the network into a collapsed state, effectively suppressing necessary data signals. While existing depth stabilizers like LayerScale mitigate collapse, they dampen both mean and signal-bearing modes, which the researchers found slows model convergence. The MV-Split approach instead regulates the mean path separately from the signal-bearing centered path, maintaining stability without sacrificing data fidelity.
The method successfully prevented divergent collapse in a 400-layer single-stream DiT, allowing it to track close to baseline trajectories while outperforming LayerScale across the schedule. The team validated the architecture by successfully training a 1000-layer DiT, demonstrating that the model remains stably trainable at extreme depths.