M* Serving System Boosts Multimodal AI Throughput
- •M* is a modular serving system that models multimodal requests as dataflow graph walks.
- •The system achieves 2.7x higher throughput than vLLM-Omni and 4x versus SGLang-Omni on Qwen3-Omni.
- •M* natively supports parallelism, non-autoregressive loops, and streaming for complex, multi-component AI models.
Stanford researchers and collaborators have developed M, a modular serving system designed to handle the complex, multi-component architectures of modern multimodal models. Unlike traditional systems built solely for text-based autoregressive loops, M models requests as a series of "Walks" on a dataflow graph. This allows the system to support structurally diverse models—such as speech-to-text systems, omni-models, and world models—using a unified runtime. In performance benchmarks, M achieved nearly 2.7x higher throughput compared to vLLM-Omni and 4x higher throughput than SGLang-Omni while maintaining lower request time to first token (RTF) on the Qwen3-Omni text-to-speech workload.
Modern composite models often require non-autoregressive loops, internal parallelism, and input-dependent execution paths, which current serving stacks struggle to manage efficiently without bespoke glue code. M addresses these challenges by generalizing model components as graph nodes connected by tensor edges. Developers define the model as a graph and write a small state machine to determine the sequence of Walks for a given request. The M runtime then handles physical concerns including placement, scheduling, batching, and tensor transport, enabling developers to modify model topology without altering the core computation logic.
M introduces generic primitives for loops and parallelism that apply across various model architectures. For instance, M supports classifier-free guidance (CFG) by allowing developers to express parallel branches of computation, which the runtime then executes across different GPU ranks. Furthermore, the system treats streaming as a first-class operation; by using predefined chunk policies, components such as a Thinker, Talker, and codec can overlap in time, enabling incremental output generation. M also decouples logical model definitions from physical placement, allowing users to move components or shard specific nodes across GPUs using a single YAML configuration file, significantly improving flexibility for complex model deployments.