Build · Practices Library

Build

Implementation patterns: how to scaffold around probabilistic output, surface failure modes early, and keep human judgment in the loop while the code lands.

No practices in this stage match the current filters.

WhenYou’re designing a user-facing AI system and choosing how to present the model’s reasoning, recommendations, or confidence.

UseInterfaces that prompt active interpretation rather than passive consumption. Give users enough to act and enough feedback to test their hypotheses, but stop short of presenting the system’s output as authoritative. Build in friction that requires users to evaluate whether the AI’s explanation is trustworthy in the current context.

EvidenceQualitative analysis of Arknights showed that an interface that withholds and occasionally misleads, when paired with rich feedback, produced a more robust user-system relationship than one offering full transparency. Players developed working mental models through action, failure, and revision rather than through dashboards. The same pattern applies to XAI interfaces where comprehensive explanation often fails to produce comprehension.
- Explanation
- Trust calibration
- Cited
- Guo 2025
Updated 2026-05-07
WhenYou’re building a system that depends on LLM output for any progression, scoring, or downstream action.

UseFailure-mode prototyping before happy-path implementation. Simulate hallucinated answers, malformed outputs, schema violations, and difficulty misfires up front. Decide how the system should respond to each before you build the success path, so that fallbacks, retries, and validation are part of the architecture rather than patches added under pressure.

EvidenceUniversity of Calgary developers building two LLM-driven games reported that incorrect outputs were not bugs but fairness violations. They documented cases like a math question with no correct answer option and patterned outputs (correct answer always in the same multiple-choice slot) that broke the implicit contract with the player. The team explicitly recommends prototyping failure modes before the happy path.
- AI in games
- Architecture
- Cited
- Johnson, Ahmed, Lang, Thethi, Zheng & de Souza Santos 2026
Updated 2026-05-07
WhenAny LLM call whose output flows into downstream code, a database, a UI, or another model.

UseStrict output schemas plus a validation pipeline. Define the exact format you require, parse against it, and reject or retry on schema violations. Constrain the model’s output space wherever possible. Treat free-form text from the model with the same skepticism you’d apply to a public API response or a form submission from an untrusted client.

EvidenceCalgary developers building Wizdom Run and Sena consistently described “building scaffolding around the model’s outputs”: structured schemas, validation pipelines, strict output formats. One reflection noted that ensuring LLM responses were formatted exactly as expected was what kept the back-end design coherent. Without that scaffolding, the probabilistic output broke deterministic gameplay rules.
- Architecture
- Code generation
- Cited
- Johnson, Ahmed, Lang, Thethi, Zheng & de Souza Santos 2026
Updated 2026-05-07
WhenA developer is in a vibe-coding or agent-driven coding session where the AI is writing or modifying many files in rapid succession.

UseExternal version control with frequent, small commits, or an explicit instruction to the AI to log its own changes to a file. Commit before each new conversational turn that touches code, not at the end of the session. If you cannot commit per turn, ask the model to summarize the diff so you have a recoverable trail.

EvidenceAcross the qualitative study’s 190,000 words of practitioner data, runaway code changes were one of the two highest-severity pain points. Practitioners reported sessions where 30+ files accumulated in the change log with hours of uncommitted work, leading to “fuckup cascades” that were difficult to unwind. External version control was one of the two most universal community-derived best practices.
- Vibe coding
- Code generation
- Cited
- Pimenova, Fakhoury, Bird, Storey & Endres 2025
Updated 2026-05-07
WhenYou’re designing AI guidance, recommendations, or copilots where users will rely on the model’s stated reasoning to make their own decisions.

UseInterfaces that mark explanations as provisional and give the user a low-cost way to disagree with them. Show the model’s reasoning, but also show the user the cost of taking it on faith. Pair recommendations with the option to override and observe consequences, so users practice judgment instead of compliance.

EvidenceArknights reframes player agency from “take meaningful action” to “evaluate whether the system’s explanations are trustworthy.” When the in-game AI deliberately offered misleading deployment suggestions, players who had built independent mental models through earlier play could reject the recommendation and succeed. Players who deferred failed. The game design treated continuous evaluation as the skill, not blind trust.
- Explanation
- Trust calibration
- Cited
- Guo 2025
Updated 2026-05-07
WhenYou’re using an LLM as a critic, judge, or self-evaluator on its own (or another model’s) output.

UseA critic prompt that asks “how much did this step gain compared to the previous state?” instead of “how good is this overall?” Score marginal change, not absolute quality. Where possible, surface concrete pre/post artifacts (the answer at step N-1 versus step N) so the comparison is grounded in observable change rather than vibes.

EvidenceBudget-Aware Value Trees rely on this distinction as a core technique. LLMs are well-documented to be overconfident when scoring their own absolute reasoning quality. The authors found that scoring the delta (the change) was much harder for the model to inflate, and this made step-level critics genuinely reliable enough to drive pruning decisions.
- Evals
- Budget-aware reasoning
- Cited
- Li et al. 2026
Updated 2026-05-07
WhenYour agent is mid-task, has already spent some compute on a particular line of reasoning or tool calls, and the most recent step yielded little or no new information.

UseAn explicit early-exit rule on doomed paths. If the marginal gain from a step falls below a threshold, abandon the branch and try a different approach. Build the escape mechanism into the agent loop. Do not rely on the model to notice it is stuck, because LLMs are subject to sunk-cost behavior and will keep exploring failed paths.

EvidenceBudget-Aware Value Trees treat this as a first-class principle. Over four multi-hop QA benchmarks, the technique outperformed standard agents partly by pruning low-gain branches early, freeing compute for more promising ones. Spending 4x more tool calls on standard agents did not produce 4x better answers, indicating that without an early-exit mechanism, additional compute often goes to doomed paths.
- Budget-aware reasoning
- Cost reduction
- Cited
- Li et al. 2026
Updated 2026-05-07
WhenYou are using a generative model to produce training data for a downstream model, particularly in domains with long-tailed or rare-category distributions (medical imaging, anomaly detection, fairness-sensitive applications).

UseAn IRS audit of the generator before relying on its outputs. If coverage is below 80%, the synthetic data will be systematically missing parts of the real distribution. Treat low-coverage synthetic data with lower weight in training, or supplement it with real examples from the underrepresented regions the generator avoids.

EvidenceThe coverage failures Dombrowski et al. measured are not random noise: they represent the model clustering around modes and ignoring the tail of the distribution. For data augmentation use cases, this means a generator with 77% IRS coverage is producing augmented data biased toward already-overrepresented examples. Downstream models trained on this data inherit the bias invisibly, because FID on the generator showed no warning sign.
- Generative AI
- Data augmentation
- Image generation
- Cited
- Dombrowski, Zhang, Cechnicka, Reynaud & Kainz 2025
Updated 2026-05-11
WhenYou are designing or evaluating an AI tutoring, coaching, or advisory system and need to decide whether the system’s reasoning about users should be visible to operators or teachers.

UseLog the system’s intermediate reasoning about the user’s state as a first-class output, not just the final response. Give teachers or administrators access to the inferred diagnosis, stability assessment, and strategy rationale. A system that explains why it said what it said builds operational trust and surfaces errors that inspecting responses alone cannot catch.

EvidenceSLOW’s open workspace design makes its four-stage reasoning chain inspectable after the fact. The authors frame transparency as a core design goal: a traceable decision path gives educators something to audit, contest, and learn from. Standard single-pass LLM tutors offer no equivalent window into how they interpreted the learner.
- AI tutoring
- Explainability
- Trust calibration
- Cited
- Wei, Li & Jiang 2026
Updated 2026-05-11
WhenYou are building an AI-assisted learning or onboarding tool for a domain with significant visual-spatial content (biology, chemistry, engineering, anatomy, architecture) and choosing whether to include images alongside text responses.

UseDeliver both text and relevant images from the source material in the same conversational response, rather than text only. The two features work through distinct mechanisms: conversation reduces the extraneous cognitive load of information-seeking; visual-verbal integration increases the germane load that builds durable knowledge schemas. Providing images without conversation, or conversation without images, captures only one of the two effects.

EvidenceTaneja, Singh and Goel (AIED 2026) ran a 124-person randomized controlled online study comparing three interfaces for learning cell biology: a multimodal conversational AI (MuDoC), a text-only conversational AI (TexDoC), and an LLM-powered semantic search tool with no conversation (DocSearch). MuDoC produced the highest post-test scores and the most positive reported learning experience. The ordering matched cognitive load theory predictions: MuDoC then TexDoC then DocSearch.
- AI in education
- Multimodal AI
- Cognitive load
- Cited
- Taneja, Singh & Goel 2026
Updated 2026-05-11
WhenYou are building an agent that executes multi-step workflows through stateful tools (scheduling, inventory, project management, file operations) and a tool call’s success or failure determines what subsequent steps can safely do.

UseA verification checkpoint after each tool call. Before proceeding, have the agent observe the post-call state and confirm the change took effect. Make this explicit in the agent loop rather than relying on the model to decide whether to check. In stateful interdependent environments, assuming a tool call succeeded and moving forward corrupts downstream steps in ways that trace back only with careful inspection.

EvidenceLi, Yang, Wang et al. (2026) named over-confidence skipping verification as one of three primary failure modes in ComplexMCP, a benchmark with 150+ interdependent stateful MCP tools. Agents proceeded after tool calls without checking downstream state. In stateless environments the cost is low: the next call either works or fails visibly. In stateful environments the cost is high: silent corruption propagates through the dependency chain and the root cause becomes hard to locate.
- Agent tool use
- MCP
- Error handling
- Stateful agents
- Cited
- Li, Yang, Wang et al. 2026
Updated 2026-05-25
WhenAn LLM agent encounters a tool call failure, an API error, or an unexpected response mid-task, and the task could still be completed via an alternative tool or a different approach.

UseAn explicit fallback-routing step in the agent’s error-handling path. When a tool call fails, require the agent to ask: “Is there an alternative tool or approach that could accomplish this goal?” Build this question into the failure path as a structured prompt step rather than leaving it to the model to generate spontaneously. Distinguish between terminal failures (the goal genuinely cannot be achieved) and transient ones (a specific tool is unavailable or misbehaving) before the agent reports failure.

EvidenceLi, Yang, Wang et al. (2026) identified strategic defeatism as a recurring failure mode across frontier models in ComplexMCP: when agents encountered API errors, they stopped and reported failure rather than rerouting to alternative tools, even when alternative paths existed. The failure occurred most visibly under seed-controlled API failure injection, where a recoverable glitch triggered abandonment rather than recovery. The pattern was consistent across frontier models, indicating it reflects how models respond to error signals rather than a specific capability gap.
- Agent tool use
- Error handling
- MCP
- Resilience
- Cited
- Li, Yang, Wang et al. 2026
Updated 2026-05-25
WhenYou run a fixed N traces for all queries in your self-consistency pipeline, regardless of how quickly traces converge or how confident the emerging consensus is.

UseAdaptive early termination: stop generating new traces when two conditions are met simultaneously: the candidate answers have converged across traces generated so far, and the confidence in the converging answer exceeds a threshold. For easy queries where all traces agree quickly, this can cut trace count to 2 or 3. For hard queries where traces diverge, continue toward N. Apply this per query, not as a global N reduction. Pair with confidence-weighted voting so termination decisions are grounded in trace quality, not just answer count.

EvidenceDDC’s early termination mechanism reduces total token usage by more than 10x on average across five reasoning benchmarks without degrading accuracy. Configurations with aggressive convergence thresholds reached up to 27x token reduction against strong high-N baselines, with accuracy maintained. Easy queries that would have converged at trace 3 are no longer forced to run to trace 32. The reduction is concentration-of-compute, not compute-cutting: the same or better accuracy comes from spending compute on traces that are actually adding signal.
- Self-consistency
- Token efficiency
- Cost optimization
- Inference optimization
- Cited
- Xu, Li, Zhao, Wu, Li & Yan 2026
Updated 2026-07-06
WhenYour team uses one-shot automatic interpretability (feeding activating examples to a language model and asking it to describe the common pattern) to characterize model features, and you need those explanations to be auditable and correctable.

UseA multi-round contrastive probing loop instead of a single description pass. For each feature: generate a prompt pair where you predict the feature activates on one input and not the other, run both inputs through the model and observe the actual activations, then revise the hypothesis where the prediction failed. Repeat until the explanation is stable across several contrastive tests. Log every hypothesis, every test, and every observation as a structured trace. A one-shot description that cannot be traced cannot be revised when it turns out to be incomplete; a traced explanation can be inspected, contested, and corrected by anyone who reads the log.

EvidenceMarin-Llobet and Ferrando (2026) showed that an iterative explanation agent using targeted prompt-controlled contrastive tests outperformed one-shot auto-interp baselines on Gemma-2 models and weight-sparse transformer MLP neurons. The iterative approach caught cases where the initial hypothesis was partially correct but missed secondary activation patterns that only surfaced through follow-up testing. The auditable traces enabled post-hoc scrutiny that revealed which explanations needed refinement and why.
- Mechanistic interpretability
- Auto-interpretability
- Auditing
- Agent loops
- Cited
- Marin-Llobet & Ferrando 2026
Updated 2026-06-08
WhenYou are applying keyword or regex-based guardrails to shell commands generated by an AI agent and your rules run directly against the raw command string before any normalization.

UseShell command normalization as the first step before any pattern matching runs. Expand the command through at minimum: variable substitution, hex and octal escape resolution, command substitution, and adjacent-quote concatenation. Apply safety rules to the normalized plaintext, not to the raw string the agent produced. A command that keyword-matches as safe in obfuscated form may resolve to a destructive operation after expansion.

EvidenceYang (2026) demonstrated that without normalization, shell-based obfuscation defeats static pattern matching. AgentTrust’s ShellNormalizer applies nine deobfuscation strategies before any PolicyEngine rule runs. On the external 630-scenario benchmark, this approach achieved approximately 93% accuracy on shell-obfuscated payloads specifically, a category that static keyword filters targeting raw strings cannot reliably catch.
- Agent safety
- Shell security
- Guardrails
- Cited
- Yang 2026
Updated 2026-06-08
WhenYou are designing safety guardrails for an autonomous agent that operates without human supervision on long-horizon tasks, and your current block rules stop the agent without providing alternative paths.

UseA SafeFix rule paired with every block rule. For each action pattern you block, define and return a safer alternative the agent can execute instead. For example: pair a block on recursive deletion with a SafeFix that moves the target to a trash location rather than deleting it permanently. For autonomous agents that cannot escalate to a human mid-task, the difference between a block and a constructive alternative is the difference between task failure and task completion with lower blast radius.

EvidenceYang (2026) built SafeFixEngine as an opt-out component of AgentTrust with 37 fix rules covering the most common blocked action patterns. The SafeFix pattern reframes safety tooling from binary (allow or block) to constructive (allow, warn, block with alternative, or review). This distinction matters most for autonomous agent deployments where silent task failure due to a blocked action is harder to detect and diagnose than a block that redirected the agent to a safer path.
- Agent safety
- Autonomous agents
- Error handling
- Guardrails
- Cited
- Yang 2026
Updated 2026-06-08
WhenYou are adding new skills to a production agent’s skill library, whether manually or through an automated extraction process.

UseRun every candidate skill against a small set of held-out task examples before adding it to the active retrieval pool. Define a minimum pass rate. Skills that don’t reach that threshold don’t enter the library. This admission gate is the single highest-leverage quality control mechanism for skill libraries; it prevents low-quality skills from polluting retrieval at the moment they are created, before they can affect production behavior.

EvidenceMUSE-Autoskill’s Skill Evaluator tests each auto-generated skill on held-out tasks before admission to the active Skill Memory. The full system reached 68.40% overall accuracy (+15.21 pp over the no-skills baseline) and beat the human-skill ceiling on the 35 tasks where generation succeeded. The quality gate is the mechanism that makes auto-generated skills competitive with human-written ones: without it, the library accumulates noise from task-specific descriptions that don’t generalize (Lin, Li, Song, Jiang & Zhang 2026, arXiv:2605.27366).
- Agent architecture
- Skill libraries
- Quality gates
- Cited
- Lin, Li, Song, Jiang & Zhang 2026
Updated 2026-06-22
WhenYou are designing or reviewing a production agent skill library that accumulates entries over time.

UseAssign every stored skill a quality score and set a retirement threshold before the library goes into production. Skills that fall below the threshold are removed on a recurring pass. Design the retirement mechanism at the same time as skill creation, not after the library has grown large. A library with no retirement path accumulates obsolete and incorrect entries that degrade retrieval precision for every skill around them.

EvidenceMUSE-Autoskill’s Skill Retiree removes entries below quality threshold, and libraries with automatic retirement accumulated measurably fewer low-quality entries than static-library baselines over the same number of tasks. The noise reduction from retirement compounds over time: the longer a library runs without a retirement path, the more retrieval is contaminated by entries that were once correct but no longer generalize (Lin, Li, Song, Jiang & Zhang 2026, arXiv:2605.27366).
- Agent architecture
- Skill libraries
- Library maintenance
- Cited
- Lin, Li, Song, Jiang & Zhang 2026
Updated 2026-06-22
WhenYou are designing an autonomous agent task loop that involves modifying, installing, or configuring code or software artifacts.

UseAdd an explicit completion-verification step as the final action in every agent task loop. The step must attempt to execute the artifact end-to-end in a clean environment and record success or failure before the agent marks the task done. Do not allow an agent to emit a “completed” signal based solely on the presence of generated output; require execution evidence.

EvidenceDeployBench found that 97 of 154 analyzed failures across 51 deployment tasks were completion-judgment errors -- the agent stopped too early, before confirming the artifact ran. This was the single largest failure category, occurring across all four tested frontier models. The finding is structural: agents trained and evaluated on code-generation metrics have no incentive to develop a completion-verification reflex. Adding the step explicitly is the only reliable remedy identified in the study.
- Agent architecture
- Completion verification
- Autonomous agents
- Cited
- Wang, Qian, Zhang et al. 2026
Updated 2026-07-13
WhenYou are building or deploying a long-horizon search agent that makes multiple sequential retrieval calls and are deciding whether to mask (discard) stale retrieved observations from the context to manage token budget.

UseRun the agent without observation masking first and record baseline accuracy on a representative eval set. If baseline accuracy is already high (the model and retriever are both strong), do not mask by default: high-capacity models use retrieved tokens for implicit filtering, and masking removes the evidence they need. If baseline accuracy is moderate, test a retention window of 3 to 5 most recent tool results (exempting tool-call errors) and measure the delta. Make the decision per model-retriever combination, not as a global default across agent configurations.

EvidenceA systematic sweep over 4B-to-284B parameter models, three retrievers, and four benchmarks (BrowseComp-Plus, BrowseComp-ZH, GAIA-text, xbench-DeepSearch) found that the accuracy gain from observation masking follows an asymmetric inverted-U curve against the model’s base accuracy without masking. Peak gains of +11 to +13 percentage points occurred when a strong retriever met a mid-capacity model; the same intervention caused performance collapse in the saturated-model regime. The mechanism is confirmed via attention pattern analysis: high-capacity models direct reasoning-token attention toward retrieved content and lose signal when it is removed.
- Agentic search
- Context management
- Observation masking
- Cited
- Zhang, Xu, Li, Zhang, Jiang, Zhang & McAuley 2026
Updated 2026-07-20