What are the key points?

Artificial Analysis launched AA-AgentPerf, a benchmark for measuring hardware performance using real coding-agent trajectories. The Agents per Megawatt metric evaluates how many concurrent agents a system sustains per megawatt of power. Initial results show NVIDIA's Blackwell architecture significantly outperforms Hopper in concurrent agent capacity and efficiency.

Artificial Analysis Debuts AA-AgentPerf Hardware Benchmark

•Artificial Analysis launched AA-AgentPerf, a benchmark for measuring hardware performance using real coding-agent trajectories.
•The Agents per Megawatt metric evaluates how many concurrent agents a system sustains per megawatt of power.
•Initial results show NVIDIA's Blackwell architecture significantly outperforms Hopper in concurrent agent capacity and efficiency.

Artificial Analysis launched AA-AgentPerf on June 12, 2026, as the first inference benchmark specifically designed to measure hardware performance for agentic workloads. Unlike static benchmarks, this system replays real coding-agent trajectories—sessions featuring up to 200 turns and 100K tokens of context—to determine the maximum number of concurrent agents a platform can support while maintaining specific service-level objectives (SLOs). The primary metric, Agents per Megawatt, calculates how many simultaneous agents a system sustains per megawatt of measured power, emphasizing efficiency in power-constrained environments.

The benchmark accounts for real-world production optimizations, such as KV cache reuse (reusing memory to store previous tokens), speculative decoding, and disaggregated prefill/decode. By including these, AA-AgentPerf captures actual deployment performance rather than theoretical synthetic limits. Performance targets are divided into tiers based on market-derived requirements, where systems must maintain specific output speeds (e.g., 20 to 180 tokens per second for DeepSeek V4 Pro) and time-to-first-token latencies. NVIDIA’s Blackwell systems demonstrated a significant generational leap over the Hopper architecture in early tests, while rack-scale deployments showed clear advantages in both compute and power efficiency compared to single-node setups.

The test dataset remains private to prevent benchmark-targeted optimization, though vendors may submit tuned configurations for verification by Artificial Analysis. Initial results featured NVIDIA and AMD hardware running DeepSeek V4 Pro, with support for gpt-oss-120b and additional architectures planned. As a live benchmark, AA-AgentPerf will continuously update as software stacks and hardware advance. Interested hardware vendors and inference providers can submit configurations via the designated contact channel to ensure their systems reflect current real-world capabilities. Future updates will expand to include longer context lengths up to 1M tokens, broader model coverage, and more detailed analysis of total cost of ownership.

Artificial Analysis launched AA-AgentPerf on June 12, 2026, as the first inference benchmark specifically designed to measure hardware performance for agentic workloads. Unlike static benchmarks, this system replays real coding-agent trajectories—sessions featuring up to 200 turns and 100K tokens of context—to determine the maximum number of concurrent agents a platform can support while maintaining specific service-level objectives (SLOs). The primary metric, Agents per Megawatt, calculates how many simultaneous agents a system sustains per megawatt of measured power, emphasizing efficiency in power-constrained environments.

The benchmark accounts for real-world production optimizations, such as KV cache reuse (reusing memory to store previous tokens), speculative decoding, and disaggregated prefill/decode. By including these, AA-AgentPerf captures actual deployment performance rather than theoretical synthetic limits. Performance targets are divided into tiers based on market-derived requirements, where systems must maintain specific output speeds (e.g., 20 to 180 tokens per second for DeepSeek V4 Pro) and time-to-first-token latencies. NVIDIA’s Blackwell systems demonstrated a significant generational leap over the Hopper architecture in early tests, while rack-scale deployments showed clear advantages in both compute and power efficiency compared to single-node setups.

The test dataset remains private to prevent benchmark-targeted optimization, though vendors may submit tuned configurations for verification by Artificial Analysis. Initial results featured NVIDIA and AMD hardware running DeepSeek V4 Pro, with support for gpt-oss-120b and additional architectures planned. As a live benchmark, AA-AgentPerf will continuously update as software stacks and hardware advance. Interested hardware vendors and inference providers can submit configurations via the designated contact channel to ensure their systems reflect current real-world capabilities. Future updates will expand to include longer context lengths up to 1M tokens, broader model coverage, and more detailed analysis of total cost of ownership.