Boosting Model Speed with Expert Routing Optimization
- •Speculative decoding creates performance bottlenecks in Mixture-of-Experts (MoE) models during verification.
- •Non-monotonic speedup curves reveal a 'sweet spot' for MoE models at moderate batch sizes.
- •Temporal routing patterns in MoE models allow for significant verification cost savings.
When we think about artificial intelligence models generating text, we often imagine them churning out words smoothly, like a fountain pen writing across a page. In reality, models like the ones powering current chatbots generate text one token at a time—a process akin to an artist painting a mural by dipping their brush in paint for every single stroke. This is inherently slow because every single word requires the model to fully 'think' through its entire structure again. To solve this, engineers use a clever shortcut called Speculative Decoding. It essentially lets a smaller, faster model 'guess' a few words ahead, and then the big, authoritative model checks all those guesses at once. If the guesses are right, we get the speed of the small model with the accuracy of the big one.
However, the architecture of Mixture-of-Experts (MoE) models complicates this efficiency strategy. Unlike 'dense' models that use all their brainpower for every word, MoE models are selective—they activate only a specific subset of 'experts' for each piece of information. While this makes them cheaper to run during normal operation, speculative decoding creates a conflict. Because the 'guess' tokens and the 'verification' tokens might require different experts, the system ends up loading way more data from memory than it intended, potentially erasing the speed gains speculative decoding was meant to provide.
A new analysis from the research team at Cohere provides a fascinating look into this trade-off. They discovered that MoE models don't just get faster or slower linearly; they exhibit a 'non-monotonic' speedup curve. Essentially, there is a distinct 'sweet spot' in terms of batch size—the number of concurrent requests the system processes—where the model balances the cost of loading experts with the gains from parallel verification. It’s a delicate act of arithmetic intensity, where the system’s behavior shifts from being bandwidth-bound (limited by how fast data moves) to compute-bound (limited by how fast chips can calculate).
Perhaps the most counterintuitive finding involves 'temporal correlation' in how experts are chosen. It turns out that when these models process text, consecutive tokens often rely on the same experts, much like a person using a specific set of tools for a task before reaching for another. Because the models naturally group these expert choices together, speculative decoding becomes significantly cheaper. The model doesn't need to load new 'experts' for every single guess because the guess and the verification likely use the same ones. This 'reuse' reduces the memory overhead, essentially making verification 'free' or nearly so in specific contexts.
This insight has profound implications for how we design the next generation of AI systems. By co-optimizing model sparsity—the density of experts used—with the requirements of speculative decoding, developers can effectively 'tune' their models for specific workloads. For high-volume environments, lowering the number of experts per token keeps the system in that sweet-spot 'bandwidth-bound' regime, maximizing speed. Conversely, in low-traffic environments, shared experts become the key to efficiency. This work bridges the gap between raw research and the practical engineering realities of making AI run faster for everyone.