Evaluation · Practices Library

Evaluation

How you measure whether the system actually works: equal-compute baselines, instrumentation that catches silent over-budgeting, and benchmarks that survive scrutiny.

No practices in this stage match the current filters.

WhenYou’re evaluating a multi-agent stack against a single-agent baseline, or assessing a vendor’s claim that their multi-agent design improves quality.

UseA single-agent baseline at equal thinking-token budget before adopting multi-agent. Hold the total reasoning compute constant across both architectures. If single-agent matches the multi-agent system at equal compute, the multi-agent overhead is not buying you anything on this task.

EvidenceAcross three model families and five multi-agent architectures (sequential, debate, ensemble, parallel-roles, subtask-parallel), the multi-agent advantage on multi-hop reasoning largely vanished when total thinking-token budgets were normalized. Single-agent reasoning matched or beat every multi-agent variant at every meaningful budget level above 100 tokens.
- Evals
- Compute normalization
- Cited
- Tran & Kiela 2026
Updated 2026-05-07
WhenYou’re benchmarking reasoning systems and relying on a vendor’s reported reasoning-budget parameter to control compute.

UseDirect instrumentation of actual thinking-token usage instead of trusted API budget caps. Log the realized token counts for every system under test. Treat documented budget parameters as soft hints unless you have verified they behave as hard caps on the specific model and version you’re using.

EvidenceThe authors of the equal-budget study found that Gemini’s thinkingBudget parameter does not behave like a hard cap. Actual visible-thought output often fell well below the requested budget, and API-reported token counts did not always match the visible reasoning text. Multi-agent systems that make multiple calls under the same nominal budget can be silently over-credited, distorting any comparison built on the parameter alone.
- Evals
- Compute normalization
- Test-time compute
- Cited
- Tran & Kiela 2026
Updated 2026-05-07
WhenYou are evaluating a generative image model and reporting results, or selecting a generator for a downstream task such as data augmentation or synthetic training data creation.

UseA coverage metric (such as IRS) alongside FID. Measure how much of the training distribution the model’s outputs actually span, not just how realistic individual images look. A model can score well on FID while silently skipping 20% or more of the real distribution; coverage metrics detect failures that FID cannot.

EvidenceDombrowski et al. (CVPR 2025) measured state-of-the-art unconditional image generators against their training distributions using IRS, a retrieval-based coverage metric. No tested model exceeded 77% coverage. FID scores for these same models were competitive. Standard diversity-proxy metrics built on Inception v3, DINOv2, and CLIP failed to detect the gap that IRS surfaced.
- Evals
- Generative AI
- Image generation
- Cited
- Dombrowski, Zhang, Cechnicka, Reynaud & Kainz 2025
Updated 2026-05-11
WhenYou are evaluating AI tools intended to support learning, whether for employee onboarding, customer education, or formal instruction, and need to compare options.

UseInclude a pre-test and post-test in your evaluation alongside any usability or satisfaction measures. Engagement scores and NPS do not substitute for learning measurement. A tool that scores high on satisfaction while producing no learning gain is not doing the educational job. Even a short quiz before and after a study session gives directional signal on actual retention.

EvidenceThe Taneja et al. (AIED 2026) study measured learning outcomes through post-test scores alongside user experience surveys. The ranking of systems on outcomes matched their ranking on experience, but experience scores alone would not have quantified the learning benefit. Controlled studies in educational AI frequently find engagement and outcomes diverge; this study happened to find them aligned, which is not the default assumption to make.
- AI in education
- Evals
- Learning outcomes
- Cited
- Taneja, Singh & Goel 2026
Updated 2026-05-11
WhenYour team uses one or more AI coding agents to open pull requests and you want to assess whether those contributions are adding durable value.

UseCode churn rate as a primary evaluation metric alongside merge rate. Tag commits or PRs by source, then check at 30- and 90-day intervals whether lines introduced by agent PRs are still present or have been rewritten. Treat high churn as a signal to investigate whether the agent is solving problems at the right level of abstraction, not a reason to stop using agents.

EvidencePopescu et al. (2026) tracked 110,000 open-source pull requests from five coding agents and found that agent-authored code churned at higher rates over time than human-authored code, across all five agents, even for agents whose merge rates exceeded the human baseline. Merge rate and code longevity diverged consistently across the dataset.
- AI coding agents
- Evals
- Code quality
- Cited
- Popescu, Gros, Botocan, Pandita, Devanbu & Izadi 2026
Updated 2026-05-18
WhenYou are building or evaluating an agent that retrieves stored information (user preferences, organizational policies, external facts) and uses it to generate responses or take actions.

UseImplicit-conflict test cases as a dedicated eval category. For each major memory type your agent stores, write three to five scenarios where later context implies the stored fact is outdated without stating it directly. Check whether the agent continues acting on the old fact. Treat any failure as a memory validity gap, not a retrieval gap: the agent retrieved correctly and reasoned incorrectly about whether to trust what it found.

EvidenceChao, Bai et al. (2026) built STALE, a 1,200-query benchmark probing three axes of memory staleness detection. The best frontier model evaluated scored 55.2% overall. Failures clustered on implicit conflict cases, where invalidation required multi-step inference rather than text matching. Frontier models handle explicit contradictions reasonably and fail most often when the invalidation is distributed across context rather than stated directly.
- Agent memory
- Evals
- Memory validity
- Cited
- Chao, Bai et al. 2026
Updated 2026-05-18
WhenYou are building evaluation infrastructure for LLM agents that use stateful tools and need to reproduce specific failure scenarios across model versions or configuration changes.

UseSeed-controlled failure injection rather than random perturbations. Assign each failure scenario a fixed seed so the same API error, timeout, or unexpected response can be replayed identically across every model, configuration, and release you test. Build a regression suite around your known failure seeds. A failure that appeared in one release and was fixed should be a permanent test case, not a one-time observation.

EvidenceLi, Yang, Wang et al. (2026) demonstrated that seed-controlled environmental perturbations, including API failures, allow valid cross-model comparison in a stateful tool environment. Randomized failure injection produces noise that obscures whether differences between models reflect capability or luck. Seeded injection produces reproducible, debuggable results that generalize across conditions.
- Agent evaluation
- Evals
- MCP
- Test infrastructure
- Cited
- Li, Yang, Wang et al. 2026
Updated 2026-05-18
WhenYou are evaluating whether a reasoning optimization (a new decoding strategy, prompt format, or inference technique) genuinely improves efficiency, and your only current metric is token count or latency.

UseAccuracy measured alongside token reduction as a joint success criterion. A technique that cuts tokens but degrades accuracy is a cost-accuracy tradeoff to be characterized, not a win to deploy. A technique that cuts tokens and maintains or improves accuracy represents a genuine efficiency gain. Report both metrics in any evaluation, and establish which accuracy benchmarks are representative of your production task distribution before running the comparison.

EvidenceLaTER’s training-free variant reduced tokens 16 to 32% on Qwen3-14B while matching or improving accuracy on AIME 2025 (70.0% to 73.3%) and MATH-500 (87.4% to 87.6%). The fine-tuned variant achieved 80.0% on AIME 2025 at 33% fewer tokens than standard CoT fine-tuning. In all three conditions, token reduction and accuracy moved in the same direction, confirming the efficiency gain was real rather than a tradeoff artifact. Plotting the cost-accuracy frontier across candidate strategies and using it to set token-budget guardrails in the deployment configuration captures this operating point before committing compute to production.
- Inference optimization
- Evals
- Token efficiency
- Benchmarking
- Cited
- Li, Wang, Liu et al. 2026
Updated 2026-05-25
WhenYou are evaluating a routing strategy between LLM inference and agent execution and want to know whether it will hold up as your query distribution shifts over time.

UseA three-split evaluation structure: in-domain queries (similar to your routing history), paraphrased queries (same task, different wording), and out-of-domain queries (different topic areas). Report performance separately on each split. A router that performs well only on in-domain queries will degrade silently in production as the query distribution drifts. If OOD performance is substantially lower than in-domain, invest in expanding the routing history to cover more of the expected distribution before deployment.

EvidenceRouteBench, the benchmark introduced by Wang, Qiu et al. (2026), showed that routers tested only on in-domain queries overestimate generalization. BoundaryRouter’s rubric-guided reasoning generalized better than retrieval-only approaches on paraphrased and OOD splits, but OOD performance still degraded relative to in-domain. The three-split structure revealed this gap; single-split evaluation would have hidden it.
- Agent routing
- Benchmarking
- Distribution shift
- Generalization
- Cited
- Wang, Qiu et al. 2026
Updated 2026-05-25
WhenYou are planning or reviewing the safety evaluation strategy for a memory-equipped LLM agent before or during production deployment.

UseDefine a fixed probe set at deployment time and run it against the agent’s accumulated memory state at regular intervals (30, 60, and 90 days at minimum), with a NullMemory counterfactual baseline for each run. Any safety violation the memory-equipped agent produces that the baseline agent does not is a memory-induced violation. If the rate is non-zero and growing over time, you have confirmed longitudinal safety drift in your specific system and need to investigate memory architecture or memory hygiene before expanding deployment.

EvidenceAl-Tawaha, Gu, Niu, Jia and Jin (Virginia Tech, UC Berkeley, and UIUC, 2026) ran a trigger-probe protocol across 8 memory architectures and 3 deployment scenarios (document management, scheduling, and email correspondence). Memory-induced violation rates climbed monotonically with exposure length across all configurations tested. Order-randomization experiments confirmed the effect is driven by accumulated content, not task sequence. Point-in-time safety evaluations at deployment did not predict in-production safety behavior.
- Agent safety
- Memory
- Longitudinal evaluation
- Cited
- Al-Tawaha, Gu, Niu, Jia & Jin 2026
Updated 2026-06-15
WhenYou are reviewing an interpretability finding, an internal audit report, or a vendor’s interpretability tool output, and need to decide whether to act on it or treat it as exploratory.

UseApply a two-question check before acting on any interpretability finding. First: what specific intervention does this finding enable? Second: has that intervention been tested and confirmed to produce the expected behavioral change without degrading other model capabilities? If either answer is vague or absent, treat the finding as exploratory research. Do not let it drive decisions about model behavior in production until both criteria are met.

EvidenceOrgad, Barez, Haklay et al. (2026) surveyed the interpretability literature and found that most work satisfies at most one of two criteria: concreteness (the finding specifies what to change) and validation (the proposed change has been tested and confirmed). The authors argue that the field has built most of its reward structure around concreteness without equivalent pressure on validation, creating a gap between how findings are evaluated in research and what would be required to trust them in production.
- Interpretability
- AI safety
- Production readiness
- Cited
- Orgad, Barez, Haklay et al. 2026
Updated 2026-06-15
WhenAn existing inference-time committee (best-of-N, critic layer, self-consistency) is not delivering the accuracy gain you expected, and you need to diagnose which part of the design is the weak link.

UseThe four-property diagnostic in sequence on 20 to 30 representative tasks. First, check coverage: does any model in the pool ever produce a correct answer? If coverage is below 50%, add more diverse proposers or vary prompting strategies before changing anything else. Second, check identifiability: when a correct answer is in the pool, how often does the system select it? If identifiability is low, the local verifier or critic is the bottleneck, not proposal quality. Third, check progress: do individual proposals advance meaningfully toward a solution, or are they stalling at similar intermediate states? If progress is low, restructure the task decomposition. Fourth, check diversity: are pool outputs substantively different, or just stylistic variations of the same approach? Low diversity means adding more calls to the same model with the same prompt will not help. Address each bottleneck in order.

EvidenceThe boosting framework from Sunkaraneni et al. (2026) separates committee performance into four measurable quantities: proposal coverage, local identifiability, progress, and diversity. The paper shows these are independent failure modes with different root causes and different fixes. The gap between the orchestrated committee (76.4%) and the oracle ceiling (79.0%) on SWE-bench Verified is directly attributable to identifiability failures: cases where a correct answer was present but the critic-comparator could not reliably select it.
- Agent committees
- Debugging
- Inference-time scaling
- Diagnostics
- Cited
- Sunkaraneni, Beneventano, Neumarker, Poggio & Galanti 2026
Updated 2026-06-01
WhenYou are selecting or evaluating a search agent for a product where users often have incompletely specified intents, such as research tools, recommendation flows, or decision-support applications.

UseAdd a multi-turn clarification phase to your search agent eval before optimizing retrieval accuracy. Run tasks where the query is deliberately vague and measure two things separately: how many clarifying questions the agent asks before retrieving, and how completely the final output covers the target information. Optimize for clarification quality first. A good clarifying question unlocks information that a fast retrieval hop cannot reach regardless of retrieval method quality.

EvidenceVibeSearchBench (Xiaohongshu Inc. 2026) tested frontier search agents on 200 bilingual tasks where users progressively disclose intent through multi-turn dialogue rather than specifying it upfront. The best frontier model scored 30.3 Triplet F1, recovering roughly a third of the target information. Clarification behavior was the binding constraint: agents that asked better questions in early turns outperformed agents that retrieved faster but skipped clarification. Standard single-turn benchmarks on the same models show substantially higher scores, confirming that the clarification gap is the source of most of the real-world performance shortfall.
- Search agents
- Benchmarking
- Evaluation design
- Clarification behavior
- Cited
- Xiaohongshu Inc. 2026
Updated 2026-06-22
WhenYou are designing evaluation tracks for a search agent and all existing tracks use fixed answer schemas: document IDs, ranked lists, or structured JSON responses.

UseAdd at least one eval track with schema-free output, where the agent must discover what the correct output structure is through dialogue, not match a template. A knowledge graph, a free-form summary that the judge scores for coverage, or any output whose shape must be inferred from user intent all qualify. Use this track to measure content recovery, not format-matching. Run both fixed-schema and schema-free evals on the same models and report both scores; the gap between them tells you how much of your agent’s benchmark performance depends on being handed the answer format.

EvidenceVibeSearchBench evaluates agents using directed knowledge graphs as ground truth, with Triplet F1 as the metric. The graph has no preset shape: agents must discover the correct output structure by learning what the user actually needs. The 30.3 F1 ceiling across frontier models (Xiaohongshu Inc. 2026) demonstrates that format-matching ability, which inflates scores on fixed-schema benchmarks, does not transfer to schema-free tasks. The benchmark is open at github.com/VibeBench/VibeSearchBench.
- Search agents
- Evaluation design
- Benchmarking
- Schema-free outputs
- Cited
- Xiaohongshu Inc. 2026
Updated 2026-06-22
WhenYou are planning to scale a parallel agent pool to improve accuracy and want to know whether the investment will pay off.

UseMeasure pairwise error correlations between agents on a held-out test set before scaling. If correlations are consistently high (above ~0.7), scaling agent count will not raise the accuracy ceiling because new agents fail in the same ways as existing ones. The bottleneck is aggregation, not pool size. Fix the aggregator first.

EvidenceThe Beyond Consensus paper showed that adding agents and diversifying prompts left error correlations across agents essentially unchanged. The majority voting ceiling is a structural property of aggregating at the final-answer level, not a symptom of insufficient agent diversity. Computing pairwise correlations on a small test set before a scale-out experiment distinguishes between the two root causes cheaply.
- Multi-agent systems
- Evals
- Cost reduction
- Cited
- Fadnavis, Kanakaraj & Wyss 2026
Updated 2026-06-29
WhenYou are red-teaming an AI agent’s resistance to prompt injection and want to identify gaps that synthetic jailbreak strings would miss.

UseBuild the attack suite from the agent’s own task context. Generate injections written to resemble the kinds of content the agent normally processes: if the agent reads legal documents, craft injections that look like legitimate contract language; if the agent handles email, craft injections that resemble routine user requests. Test whether the agent responds differently to injections framed as contextually normal content versus syntactically obvious adversarial strings. Defenses that fail context-aware attacks but pass synthetic-string tests have a known residual risk that should be documented and monitored.

EvidenceAbdelnabi and Bagdasarian (2026) apply Contextual Integrity theory to prompt injection and derive an informal impossibility result: for any norm a defender chooses, an attacker can construct a context in which the blocked flow appears to conform to that norm. This means the attack class that is hardest to block consists of injections that look like contextually appropriate task flows, not syntactically flagged adversarial strings. Evaluations that only test the latter class will overstate how safe the system is.
- Agent security
- Prompt injection
- Red-teaming
- Cited
- Abdelnabi & Bagdasarian 2026
Updated 2026-06-29
WhenYou maintain an agent eval suite and the suite includes only tasks the agent is designed to complete, so your primary metric is task success rate.

UseA held-out infeasibility split alongside your standard task suite. Write 10 to 15 tasks that are genuinely impossible for your agent: contradictory requirements, capabilities explicitly out of scope, resources that do not exist. Run your agent on them separately and measure the false-positive completion rate, how often it attempts rather than refuses. Treat a false-positive completion rate above 5% as a calibration problem and report both metrics in every eval cycle.

EvidenceAPB tested 12 multimodal LLMs on tasks designed to be unsolvable across 22 domains. Models showed systematic over-confidence, rarely refusing even explicitly infeasible tasks. Because standard benchmarks include only solvable tasks, this failure is invisible to teams using task-completion rate alone. The omission matters in production: an agent that always attempts an impossible task wastes its full token budget and delays the user’s recognition that the goal cannot be reached.
- Agent evaluation
- Evals
- Infeasibility detection
- Calibration
- Cited
- Sun, Wang, Song, He, W. Zhang, Y. Liu, Y. Yang & Y. Cheng 2026
Updated 2026-07-06
WhenYou are diagnosing agent task failures and deciding whether to invest in prompting improvements, scaffolding changes, tool reliability fixes, or model upgrades.

UseA two-category classification pass over recent failures before committing to any remediation. For each failure, ask: was the plan wrong before any tool was called, or did a correct plan break during execution? Label each failure accordingly. A bad-plan failure targets the planning context, prompting, or reasoning architecture. A bad-execution failure targets tool reliability, schema validation, error recovery, or retry logic. Count the proportion in each category and let it guide where you invest next. Without this classification, improvements to the wrong layer produce no gain.

EvidenceAPB’s core structural contribution is separating planning failure from execution failure through five distinct benchmark settings: holistic planning, feedback-conditioned step-wise planning, robustness under extraneous tools, robustness under broken tools, and infeasibility detection. Across 12 MLLMs, models showed distinct performance profiles across settings, with different models failing at different layers. A combined end-to-end pass/fail score merges these distinct failure signatures into a single number that cannot direct remediation.
- Agent evaluation
- Evals
- Failure attribution
- Debugging
- Cited
- Sun, Wang, Song, He, W. Zhang, Y. Liu, Y. Yang & Y. Cheng 2026
Updated 2026-07-06
WhenYou anticipate rotating, upgrading, or A/B testing an LLM agent in a production multi-turn environment and want to evaluate candidate replacements before live exposure.

UseInstrument structured interaction logging on the current production agent before any changes. For every multi-turn interaction, capture: conversation state before the agent’s turn, agent action, next environment turn, and task outcome. Use a format queryable by time window and agent version. Without historical trajectory data, off-policy evaluation is not possible; the time to instrument is before a rotation is needed, not during.

EvidenceADWM and all off-policy evaluation methods require pre-collected trajectory logs from the same environment. The method produces accurate value estimates and reliable candidate rankings from historical data, enabling pre-deployment screening. The quality of estimates depends directly on how well the logged data covers the states a candidate agent will encounter.
- Agent evaluation
- Off-policy evaluation
- Interaction logging
- Cited
- Liu, Xiong, Zhang & Tang 2026
Updated 2026-07-13
WhenYou are using an off-policy evaluation method to estimate how a candidate LLM agent would perform in a production environment before deploying it.

UseValidate the off-policy estimate against a small held-out live A/B slice before using it to gate a deployment decision. Treat OPE as a first-pass filter that eliminates clearly underperforming candidates; treat the top finalists as requiring live validation. This is especially important when the candidate agent’s behavior differs substantially from the agent that generated the historical logs.

EvidenceADWM’s estimate accuracy degrades under distribution shift: if the candidate policy explores states not covered in the historical data, the world model extrapolates beyond its training distribution. The paper recommends using OPE for pre-screening and candidate ranking, with a live validation step for the top finalists before full deployment.
- Agent evaluation
- Off-policy evaluation
- A/B testing
- Cited
- Liu, Xiong, Zhang & Tang 2026
Updated 2026-07-13
WhenYou are evaluating an autonomous coding or deployment agent’s output and need to decide whether a task was completed successfully.

UseGrade agent output by clean-environment execution, not diff plausibility. Spin up a fresh environment, apply the agent’s changes, and run the artifact end-to-end. Reject any evaluation signal that only checks whether code was written or whether tests pass in the same environment the agent modified. This applies to both automated pipelines and human code review.

EvidenceDeployBench ran 51 artifact deployment tasks across AI/ML, computer systems, and scientific computing using four frontier models via OpenHands. Best pass rate was 51.0%; worst was 7.8%. 97 of 154 analyzed failures were completion-judgment errors: the agent stopped before verifying whether the artifact actually ran. Code that passed local tests did not run in clean environments. The benchmark isolates deployment success from code generation quality, revealing a systematic failure mode that per-file diffs and unit test suites cannot detect.
- Agent evaluation
- Deployment
- Reproducibility
- Cited
- Wang, Qian, Zhang et al. 2026
Updated 2026-07-13
WhenYou are selecting between two or more LLM agents or foundation models for a production use case and are using published benchmark scores to inform the decision.

UseRe-run the candidate models through your own in-house harness before finalizing the selection. Define your prompt format, loop structure (ReAct, direct, chain-of-thought), tool schema, and environment setup. Run every candidate through this single configuration on a representative task sample. Use published benchmarks only for discovery and shortlisting; use your internal harness scores for the decision. Document your harness configuration alongside your results so the comparison is reproducible.

EvidenceAcross 7 agent benchmarks, 15 models, 400K rollouts, and 5B tokens, Zhu et al. (2026) found that scaffold choice and environment volatility shift outcomes in both directions, enough to reorder leaderboards. A model that ranks first on a benchmark using that benchmark’s native scaffold may rank differently under a different harness. The effect is bidirectional and cannot be predicted from benchmark scores alone. The study introduces a unified instruction-tool-environment harness and fixed ReAct loop as the controlled baseline, demonstrating that a single harness applied across all benchmarks produces different and more interpretable rankings than same-model scores across benchmark-specific scaffolds.
- Agent evaluation
- Benchmark methodology
- Harness design
- Cited
- Zhu, Li, Shao et al. 2026
Updated 2026-07-13
WhenYou are running agent evaluation in an environment that involves live APIs, web services, or other external systems that may change between evaluation runs.

UseMeasure environment volatility before using evaluation scores in any decision. Run the same model on the same evaluation tasks at least three times across different days without changing any prompt or model configuration. If scores vary by more than a few percentage points, the environment is contributing noise. Switch to snapshotted or cached environments for evaluation and use live environments only for final validation on the shortlisted model. Report the variance alongside the mean score so downstream decision-makers can assess reliability.

EvidenceZhu et al. (2026) introduce an offline snapshot mode that replaces volatile live environments with stable curated snapshots, enabling isolation of environment effects from model effects. Applied across 7 benchmarks, the mode reveals that environment volatility is a measurable, non-trivial contributor to score variance. Teams evaluating agents in live environments routinely conflate this noise with model performance differences, leading to selection decisions that may not hold in deployment.
- Agent evaluation
- Environment stability
- Reproducibility
- Cited
- Zhu, Li, Shao et al. 2026
Updated 2026-07-13