AWS Introduces P-EAGLE for Parallelized Speculative Decoding
- •AWS launched P-EAGLE to parallelize speculative decoding for large language models.
- •P-EAGLE achieves up to 1.69x throughput speedups over EAGLE-3 on NVIDIA B200 hardware.
- •Native support for P-EAGLE is now available in Amazon SageMaker JumpStart for multiple foundation models.
AWS has introduced Parallel-EAGLE (P-EAGLE), a method designed to parallelize speculative decoding by eliminating the sequential dependency chain required in earlier autoregressive frameworks. In standard speculative decoding, a lightweight draft model guesses future tokens one by one, creating a latency bottleneck that grows linearly with speculation depth. P-EAGLE replaces this iterative process by using learnable placeholders—a mask token embedding (embmask) and a shared hidden state (hshared)—which allow the system to predict all draft tokens simultaneously in a single forward pass. This innovation decoupling allows for deeper speculation without increasing the drafter's latency overhead.
Benchmarks on NVIDIA B200 GPUs using the Qwen3-Coder-30B-A3B-Instruct model demonstrate significant performance improvements. On the HumanEval benchmark, P-EAGLE achieved throughput speedups ranging from 1.12x to 1.22x compared to EAGLE-3. In the SPEED-Bench evaluation, the method delivered gains between 1.02x and 1.41x, maintaining performance advantages even under high concurrency levels of up to 128. These benchmarks consistently outperform both baseline standard inference and the prior EAGLE-3 framework across various token counts.
Amazon SageMaker JumpStart now natively supports P-EAGLE for a variety of foundation models, including GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B-Instruct, and Gemma-4-31B-IT. Developers can deploy P-EAGLE-accelerated inference endpoints directly through the SageMaker Studio console by configuring the 'SM_VLLM_SPECULATIVE_CONFIG' environment variable with 'parallel_drafting': true. This integration allows users to utilize optimized real-time endpoints without managing complex CUDA kernels or manual distributed serving setups, while ensuring output remains mathematically identical to standard autoregressive generation due to the verification step inherent in speculative decoding.