Adopting Eval-First Development for LLM Applications
- •Developer Vasyl advocates for eval-first development to enforce objective quality metrics for LLM-based software features.
- •The author replaced prompt instructions with architectural constraints to guarantee zero spoiler leakage in a book-reading app.
- •Hybrid retrieval combining vector and full-text search was adopted to meet predefined quantitative performance targets.
Vasyl, a software developer, argues that building features around large language models (LLMs) requires defining "evals"—quantitative performance metrics—before writing code to avoid unreliable "demo-ready" outputs. When developing an "Ask This Book" feature intended to prevent spoiler leakage, the author initially relied on prompt engineering to instruct the model. However, he concluded that probabilistic prompts cannot provide the hard zero-spoiler guarantee required. By shifting from prompt-based instructions to architectural constraints, the developer implemented a retrieval gate using SQL logic: `WHERE chapter_ord <= @maxChapterOrd`. This ensures that data from future chapters never enters the model's context window, making spoilers technically impossible rather than merely unlikely.
The author applies this "eval-first" methodology to retrieval quality as well, defining two targets: the correct passage must surface at the top of results, and answers must be strictly grounded in cited passages. Because the system's performance was measurable, he moved beyond simple semantic search to a hybrid approach. This architecture combines vector search (for semantic similarity) with full-text search (for exact phrase matching), using Reciprocal Rank Fusion (a method for combining multiple search rankings) to score items. This adjustment was strictly motivated by the need to meet predefined numerical thresholds rather than subjective design preferences.
For evaluating accuracy and grounding, the developer utilizes the Microsoft.Extensions.AI.Evaluation library. He emphasizes that while he uses established frameworks for evaluation, he keeps the core retrieval and fusion logic custom to ensure transparency in how quality is determined. The author draws a parallel between this "eval-first" approach and Test-Driven Development (TDD) in traditional software engineering. He asserts that in AI systems, while compilers and type systems provide confidence in standard code, evals serve as the primary mechanism to define "done" and prevent the deployment of features that are merely plausible-looking guesses.