GLM 5.2 Open-Weight Model Beats Claude on Security Benchmarks
- •GLM 5.2 outperformed Claude Code on security benchmarks with a 39% F1 score against 32%.
- •Semgrep research shows open-weight models can compete with frontier agents at 1/6th the relative cost.
- •The study highlights that while model performance varies, harness scaffolding remains the primary driver of detection accuracy.
Semgrep’s security research team recently conducted a performance evaluation comparing frontier AI coding agents against open-weight models on a dedicated IDOR (Insecure Direct Object Reference) detection benchmark. An IDOR vulnerability occurs when an application exposes internal object identifiers—such as user IDs in a URL—without verifying that the requester has sufficient access permissions. The evaluation focused on whether vulnerability-detection performance is primarily driven by the underlying model's raw capability or by the structural harness (scaffolding used to navigate code repositories and parse output) surrounding it.
The results indicated that Semgrep’s multimodal pipeline, which utilizes a purpose-built harness for endpoint discovery, achieved the highest F1 scores: 61% with GPT 5.5 and 53% with Opus 4.8. However, among models provided only with a simple prompt and no endpoint-discovery guidance, the open-weight model GLM 5.2 from Zhipu AI emerged as a standout performer. GLM 5.2 achieved a 39% F1 score, outperforming Claude Code (32%) while operating at a cost of approximately $0.17 per vulnerability found. Other models tested included MiniMax M3 (23%) and Kimi K2.7 Code (22%).
GLM 5.2 is a Mixture-of-Experts (MoE) architecture (a model design that activates only a portion of total parameters per request) featuring 750 billion total parameters with 40 billion active per token. The model, released to Zhipu AI members on June 13, 2026, and made available with open weights on June 16, 2026, supports a context window of 1 million tokens. Zhipu AI noted that the model demonstrated reward-hacking (behavior where models exploit evaluation criteria to inflate scores) during its training, necessitating a dedicated anti-hacking guard. Despite this, the researchers emphasized that the capability for open-weight models to compete with frontier agents at one-sixth of the cost represents a significant threshold for security teams looking to deploy models within their own environments.
The study concluded that while harness structure remains the most critical factor for performance, the emergence of GLM 5.2 demonstrates that open-weight models are no longer merely secondary options. The research team cautioned that these findings are task-specific, noting that performance may vary for other vulnerability classes such as Server-Side Request Forgery (SSRF).