Evaluation
How you measure whether the system actually works: equal-compute baselines, instrumentation that catches silent over-budgeting, and benchmarks that survive scrutiny.
No practices in this stage match the current filters.
-
WhenYou’re evaluating a multi-agent stack against a single-agent baseline, or assessing a vendor’s claim that their multi-agent design improves quality.
UseA single-agent baseline at equal thinking-token budget before adopting multi-agent. Hold the total reasoning compute constant across both architectures. If single-agent matches the multi-agent system at equal compute, the multi-agent overhead is not buying you anything on this task.
EvidenceAcross three model families and five multi-agent architectures (sequential, debate, ensemble, parallel-roles, subtask-parallel), the multi-agent advantage on multi-hop reasoning largely vanished when total thinking-token budgets were normalized. Single-agent reasoning matched or beat every multi-agent variant at every meaningful budget level above 100 tokens.
-
WhenYou’re benchmarking reasoning systems and relying on a vendor’s reported reasoning-budget parameter to control compute.
UseDirect instrumentation of actual thinking-token usage instead of trusted API budget caps. Log the realized token counts for every system under test. Treat documented budget parameters as soft hints unless you have verified they behave as hard caps on the specific model and version you’re using.
EvidenceThe authors of the equal-budget study found that Gemini’s thinkingBudget parameter does not behave like a hard cap. Actual visible-thought output often fell well below the requested budget, and API-reported token counts did not always match the visible reasoning text. Multi-agent systems that make multiple calls under the same nominal budget can be silently over-credited, distorting any comparison built on the parameter alone.
-
WhenYou are evaluating a generative image model and reporting results, or selecting a generator for a downstream task such as data augmentation or synthetic training data creation.
UseA coverage metric (such as IRS) alongside FID. Measure how much of the training distribution the model’s outputs actually span, not just how realistic individual images look. A model can score well on FID while silently skipping 20% or more of the real distribution; coverage metrics detect failures that FID cannot.
EvidenceDombrowski et al. (CVPR 2025) measured state-of-the-art unconditional image generators against their training distributions using IRS, a retrieval-based coverage metric. No tested model exceeded 77% coverage. FID scores for these same models were competitive. Standard diversity-proxy metrics built on Inception v3, DINOv2, and CLIP failed to detect the gap that IRS surfaced.
-
WhenYou are evaluating AI tools intended to support learning, whether for employee onboarding, customer education, or formal instruction, and need to compare options.
UseInclude a pre-test and post-test in your evaluation alongside any usability or satisfaction measures. Engagement scores and NPS do not substitute for learning measurement. A tool that scores high on satisfaction while producing no learning gain is not doing the educational job. Even a short quiz before and after a study session gives directional signal on actual retention.
EvidenceThe Taneja et al. (AIED 2026) study measured learning outcomes through post-test scores alongside user experience surveys. The ranking of systems on outcomes matched their ranking on experience, but experience scores alone would not have quantified the learning benefit. Controlled studies in educational AI frequently find engagement and outcomes diverge; this study happened to find them aligned, which is not the default assumption to make.
-
WhenYour team uses one or more AI coding agents to open pull requests and you want to assess whether those contributions are adding durable value.
UseCode churn rate as a primary evaluation metric alongside merge rate. Tag commits or PRs by source, then check at 30- and 90-day intervals whether lines introduced by agent PRs are still present or have been rewritten. Treat high churn as a signal to investigate whether the agent is solving problems at the right level of abstraction, not a reason to stop using agents.
EvidencePopescu et al. (2026) tracked 110,000 open-source pull requests from five coding agents and found that agent-authored code churned at higher rates over time than human-authored code, across all five agents, even for agents whose merge rates exceeded the human baseline. Merge rate and code longevity diverged consistently across the dataset.
-
WhenYou are building or evaluating an agent that retrieves stored information (user preferences, organizational policies, external facts) and uses it to generate responses or take actions.
UseImplicit-conflict test cases as a dedicated eval category. For each major memory type your agent stores, write three to five scenarios where later context implies the stored fact is outdated without stating it directly. Check whether the agent continues acting on the old fact. Treat any failure as a memory validity gap, not a retrieval gap: the agent retrieved correctly and reasoned incorrectly about whether to trust what it found.
EvidenceChao, Bai et al. (2026) built STALE, a 1,200-query benchmark probing three axes of memory staleness detection. The best frontier model evaluated scored 55.2% overall. Failures clustered on implicit conflict cases, where invalidation required multi-step inference rather than text matching. Frontier models handle explicit contradictions reasonably and fail most often when the invalidation is distributed across context rather than stated directly.
-
WhenYou are building evaluation infrastructure for LLM agents that use stateful tools and need to reproduce specific failure scenarios across model versions or configuration changes.
UseSeed-controlled failure injection rather than random perturbations. Assign each failure scenario a fixed seed so the same API error, timeout, or unexpected response can be replayed identically across every model, configuration, and release you test. Build a regression suite around your known failure seeds. A failure that appeared in one release and was fixed should be a permanent test case, not a one-time observation.
EvidenceLi, Yang, Wang et al. (2026) demonstrated that seed-controlled environmental perturbations, including API failures, allow valid cross-model comparison in a stateful tool environment. Randomized failure injection produces noise that obscures whether differences between models reflect capability or luck. Seeded injection produces reproducible, debuggable results that generalize across conditions.
-
WhenYou are evaluating whether a reasoning optimization (a new decoding strategy, prompt format, or inference technique) genuinely improves efficiency, and your only current metric is token count or latency.
UseAccuracy measured alongside token reduction as a joint success criterion. A technique that cuts tokens but degrades accuracy is a cost-accuracy tradeoff to be characterized, not a win to deploy. A technique that cuts tokens and maintains or improves accuracy represents a genuine efficiency gain. Report both metrics in any evaluation, and establish which accuracy benchmarks are representative of your production task distribution before running the comparison.
EvidenceLaTER’s training-free variant reduced tokens 16 to 32% on Qwen3-14B while matching or improving accuracy on AIME 2025 (70.0% to 73.3%) and MATH-500 (87.4% to 87.6%). The fine-tuned variant achieved 80.0% on AIME 2025 at 33% fewer tokens than standard CoT fine-tuning. In all three conditions, token reduction and accuracy moved in the same direction, confirming the efficiency gain was real rather than a tradeoff artifact. Plotting the cost-accuracy frontier across candidate strategies and using it to set token-budget guardrails in the deployment configuration captures this operating point before committing compute to production.
-
WhenYou are evaluating a routing strategy between LLM inference and agent execution and want to know whether it will hold up as your query distribution shifts over time.
UseA three-split evaluation structure: in-domain queries (similar to your routing history), paraphrased queries (same task, different wording), and out-of-domain queries (different topic areas). Report performance separately on each split. A router that performs well only on in-domain queries will degrade silently in production as the query distribution drifts. If OOD performance is substantially lower than in-domain, invest in expanding the routing history to cover more of the expected distribution before deployment.
EvidenceRouteBench, the benchmark introduced by Wang, Qiu et al. (2026), showed that routers tested only on in-domain queries overestimate generalization. BoundaryRouter’s rubric-guided reasoning generalized better than retrieval-only approaches on paraphrased and OOD splits, but OOD performance still degraded relative to in-domain. The three-split structure revealed this gap; single-split evaluation would have hidden it.
-
WhenAn existing inference-time committee (best-of-N, critic layer, self-consistency) is not delivering the accuracy gain you expected, and you need to diagnose which part of the design is the weak link.
UseThe four-property diagnostic in sequence on 20 to 30 representative tasks. First, check coverage: does any model in the pool ever produce a correct answer? If coverage is below 50%, add more diverse proposers or vary prompting strategies before changing anything else. Second, check identifiability: when a correct answer is in the pool, how often does the system select it? If identifiability is low, the local verifier or critic is the bottleneck, not proposal quality. Third, check diversity: are pool outputs substantively different, or just stylistic variations of the same approach? Low diversity means adding more calls to the same model with the same prompt will not help. Address each bottleneck in order.
EvidenceThe boosting framework from Sunkaraneni et al. (2026) separates committee performance into four measurable quantities: proposal coverage, local identifiability, progress, and diversity. The paper shows these are independent failure modes with different root causes and different fixes. The gap between the orchestrated committee (76.4%) and the oracle ceiling (79.0%) on SWE-bench Verified is directly attributable to identifiability failures: cases where a correct answer was present but the critic-comparator could not reliably select it.