What are the key points?

Cohere deployed a four-stage scheduling system to prevent noisy neighbor latency issues in multi-tenant SaaS. The architecture uses Deficit Round Robin to ensure equitable GPU capacity distribution across tenant organizations. The new fairness framework is now enabled for all Cohere models via SaaS API and AWS deployments.

Cohere Implements Fair Scheduling for Multi-Tenant LLM Inference

•Cohere deployed a four-stage scheduling system to prevent noisy neighbor latency issues in multi-tenant SaaS.
•The architecture uses Deficit Round Robin to ensure equitable GPU capacity distribution across tenant organizations.
•The new fairness framework is now enabled for all Cohere models via SaaS API and AWS deployments.

Cohere has implemented a new multi-tenant inference scheduling system designed to prevent 'noisy neighbor' performance bottlenecks in its SaaS platform. When multiple organizations share the same GPU resources, bursty traffic from one tenant can historically increase latency for others. To solve this, Cohere introduced a four-stage request management architecture that balances fairness, service-level agreements (SLAs), and hardware utilization.

The system processes requests through a fixed sequence: first, a Rate Limiter performs admission control, rejecting requests that exceed tenant quotas or exceed system capacity to honor latency targets. Second, the Performance Tier selector prioritizes requests based on the tenant's commercial tier, ensuring higher-paid customers receive faster access. Third, the Deficit Round Robin (DRR) algorithm manages equitable distribution within each tier by assigning each organization a 'quantum'—a budget of work. This algorithm ensures that even when one tenant sends a massive volume of requests, it cannot monopolize GPU time, as its budget is debited by the cost of its requests until replenishment.

Cohere employs two budgeting models depending on the endpoint type. In request-based budgeting, used for generative models, every request is assigned a cost of 1, providing predictable request counts per tenant. In token-based budgeting, used for batched endpoints like embeddings and rerankers, the cost is proportional to the token count, which more accurately reflects GPU resource consumption. Finally, a Priority selector handles request ordering within each tenant's allocated share using criteria such as priority level, deadline, and arrival time. This design allows Cohere to isolate tenants from one another while preserving internal request urgency, resulting in a predictable and efficient serving environment that is now enabled for all customers across the company's API and third-party marketplace deployments.

Cohere has implemented a new multi-tenant inference scheduling system designed to prevent 'noisy neighbor' performance bottlenecks in its SaaS platform. When multiple organizations share the same GPU resources, bursty traffic from one tenant can historically increase latency for others. To solve this, Cohere introduced a four-stage request management architecture that balances fairness, service-level agreements (SLAs), and hardware utilization.

The system processes requests through a fixed sequence: first, a Rate Limiter performs admission control, rejecting requests that exceed tenant quotas or exceed system capacity to honor latency targets. Second, the Performance Tier selector prioritizes requests based on the tenant's commercial tier, ensuring higher-paid customers receive faster access. Third, the Deficit Round Robin (DRR) algorithm manages equitable distribution within each tier by assigning each organization a 'quantum'—a budget of work. This algorithm ensures that even when one tenant sends a massive volume of requests, it cannot monopolize GPU time, as its budget is debited by the cost of its requests until replenishment.

Cohere employs two budgeting models depending on the endpoint type. In request-based budgeting, used for generative models, every request is assigned a cost of 1, providing predictable request counts per tenant. In token-based budgeting, used for batched endpoints like embeddings and rerankers, the cost is proportional to the token count, which more accurately reflects GPU resource consumption. Finally, a Priority selector handles request ordering within each tenant's allocated share using criteria such as priority level, deadline, and arrival time. This design allows Cohere to isolate tenants from one another while preserving internal request urgency, resulting in a predictable and efficient serving environment that is now enabled for all customers across the company's API and third-party marketplace deployments.