Benchmarking Coding Agents for VPS Management
- •Pascal Cescato benchmarked 8 coding tool-model combinations to automate Ubuntu 24.04 VPS management tasks.
- •Model C proved the only production-ready implementation, costing $1.73 and featuring integrated functional verification tests.
- •Planning phase showed token usage does not predict quality; models delivered full architectures before asking any questions.
Pascal Cescato performed a benchmark of eight coding tool and LLM combinations to build a minimal VPS management toolkit for Ubuntu 24.04. The project required shell scripts for system operations and a FastAPI interface, covering four site types: static, PHP, WordPress, and reverse proxies. The methodology split the task into an architecture phase and an implementation phase, with eight combinations tested: Claude Code (Haiku 4.5), Copilot CLI (Haiku 4.5), and OpenCode running Haiku 4.5, GLM 5.2, BigPickle, Gemini 3.1 Pro, DeepSeek V4 Pro, and GPT-OSS-120B.
During the planning phase, all models proposed architectures without asking clarifying questions first. Only one model suggested a unified CLI entry point and normalized exit codes (0 for success, 1 for invalid input, 2 for not found, 3 for conflict, 4 for missing dependency, 5 for internal error), which served as the API contract. Token usage did not correlate with planning quality, as Gemini 3.1 Pro used 27k tokens for a high-quality plan while Haiku 4.5 on OpenCode used 69k tokens for lower quality.
In the code phase, Model C distinguished itself by running functional tests during its session, using test cases for its schema and identifying bugs in its Pydantic validator. While Model D was the fastest at 9m42s, it lacked functional tests and had a critical bug where generated passwords were not returned. Model C was the most expensive at $1.73 total, including $1.67 in code costs, but it was the only one that implemented atomic writes and comprehensive error checking. An independent model performed an external review of the generated code across 20 files, identifying that Model D failed to return credentials to the user, while Model A and B were vulnerable to command injection through unsafe evaluation of hooks. Model C emerged as the only production-ready implementation, providing both interactive TTY support and non-interactive API flags.