Build
Implementation patterns: how to scaffold around probabilistic output, surface failure modes early, and keep human judgment in the loop while the code lands.
No practices in this stage match the current filters.
-
WhenYou’re designing a user-facing AI system and choosing how to present the model’s reasoning, recommendations, or confidence.
UseInterfaces that prompt active interpretation rather than passive consumption. Give users enough to act and enough feedback to test their hypotheses, but stop short of presenting the system’s output as authoritative. Build in friction that requires users to evaluate whether the AI’s explanation is trustworthy in the current context.
EvidenceQualitative analysis of Arknights showed that an interface that withholds and occasionally misleads, when paired with rich feedback, produced a more robust user-system relationship than one offering full transparency. Players developed working mental models through action, failure, and revision rather than through dashboards. The same pattern applies to XAI interfaces where comprehensive explanation often fails to produce comprehension.
-
WhenYou’re building a system that depends on LLM output for any progression, scoring, or downstream action.
UseFailure-mode prototyping before happy-path implementation. Simulate hallucinated answers, malformed outputs, schema violations, and difficulty misfires up front. Decide how the system should respond to each before you build the success path, so that fallbacks, retries, and validation are part of the architecture rather than patches added under pressure.
EvidenceUniversity of Calgary developers building two LLM-driven games reported that incorrect outputs were not bugs but fairness violations. They documented cases like a math question with no correct answer option and patterned outputs (correct answer always in the same multiple-choice slot) that broke the implicit contract with the player. The team explicitly recommends prototyping failure modes before the happy path.
-
WhenAny LLM call whose output flows into downstream code, a database, a UI, or another model.
UseStrict output schemas plus a validation pipeline. Define the exact format you require, parse against it, and reject or retry on schema violations. Constrain the model’s output space wherever possible. Treat free-form text from the model with the same skepticism you’d apply to a public API response or a form submission from an untrusted client.
EvidenceCalgary developers building Wizdom Run and Sena consistently described “building scaffolding around the model’s outputs”: structured schemas, validation pipelines, strict output formats. One reflection noted that ensuring LLM responses were formatted exactly as expected was what kept the back-end design coherent. Without that scaffolding, the probabilistic output broke deterministic gameplay rules.
-
WhenA developer is in a vibe-coding or agent-driven coding session where the AI is writing or modifying many files in rapid succession.
UseExternal version control with frequent, small commits, or an explicit instruction to the AI to log its own changes to a file. Commit before each new conversational turn that touches code, not at the end of the session. If you cannot commit per turn, ask the model to summarize the diff so you have a recoverable trail.
EvidenceAcross the qualitative study’s 190,000 words of practitioner data, runaway code changes were one of the two highest-severity pain points. Practitioners reported sessions where 30+ files accumulated in the change log with hours of uncommitted work, leading to “fuckup cascades” that were difficult to unwind. External version control was one of the two most universal community-derived best practices.
-
WhenYou’re designing AI guidance, recommendations, or copilots where users will rely on the model’s stated reasoning to make their own decisions.
UseInterfaces that mark explanations as provisional and give the user a low-cost way to disagree with them. Show the model’s reasoning, but also show the user the cost of taking it on faith. Pair recommendations with the option to override and observe consequences, so users practice judgment instead of compliance.
EvidenceArknights reframes player agency from “take meaningful action” to “evaluate whether the system’s explanations are trustworthy.” When the in-game AI deliberately offered misleading deployment suggestions, players who had built independent mental models through earlier play could reject the recommendation and succeed. Players who deferred failed. The game design treated continuous evaluation as the skill, not blind trust.
-
WhenYou’re using an LLM as a critic, judge, or self-evaluator on its own (or another model’s) output.
UseA critic prompt that asks “how much did this step gain compared to the previous state?” instead of “how good is this overall?” Score marginal change, not absolute quality. Where possible, surface concrete pre/post artifacts (the answer at step N-1 versus step N) so the comparison is grounded in observable change rather than vibes.
EvidenceBudget-Aware Value Trees rely on this distinction as a core technique. LLMs are well-documented to be overconfident when scoring their own absolute reasoning quality. The authors found that scoring the delta (the change) was much harder for the model to inflate, and this made step-level critics genuinely reliable enough to drive pruning decisions.
-
WhenYour agent is mid-task, has already spent some compute on a particular line of reasoning or tool calls, and the most recent step yielded little or no new information.
UseAn explicit early-exit rule on doomed paths. If the marginal gain from a step falls below a threshold, abandon the branch and try a different approach. Build the escape mechanism into the agent loop. Do not rely on the model to notice it is stuck, because LLMs are subject to sunk-cost behavior and will keep exploring failed paths.
EvidenceBudget-Aware Value Trees treat this as a first-class principle. Over four multi-hop QA benchmarks, the technique outperformed standard agents partly by pruning low-gain branches early, freeing compute for more promising ones. Spending 4x more tool calls on standard agents did not produce 4x better answers, indicating that without an early-exit mechanism, additional compute often goes to doomed paths.
-
WhenYou are using a generative model to produce training data for a downstream model, particularly in domains with long-tailed or rare-category distributions (medical imaging, anomaly detection, fairness-sensitive applications).
UseAn IRS audit of the generator before relying on its outputs. If coverage is below 80%, the synthetic data will be systematically missing parts of the real distribution. Treat low-coverage synthetic data with lower weight in training, or supplement it with real examples from the underrepresented regions the generator avoids.
EvidenceThe coverage failures Dombrowski et al. measured are not random noise: they represent the model clustering around modes and ignoring the tail of the distribution. For data augmentation use cases, this means a generator with 77% IRS coverage is producing augmented data biased toward already-overrepresented examples. Downstream models trained on this data inherit the bias invisibly, because FID on the generator showed no warning sign.
-
WhenYou are designing or evaluating an AI tutoring, coaching, or advisory system and need to decide whether the system’s reasoning about users should be visible to operators or teachers.
UseLog the system’s intermediate reasoning about the user’s state as a first-class output, not just the final response. Give teachers or administrators access to the inferred diagnosis, stability assessment, and strategy rationale. A system that explains why it said what it said builds operational trust and surfaces errors that inspecting responses alone cannot catch.
EvidenceSLOW’s open workspace design makes its four-stage reasoning chain inspectable after the fact. The authors frame transparency as a core design goal: a traceable decision path gives educators something to audit, contest, and learn from. Standard single-pass LLM tutors offer no equivalent window into how they interpreted the learner.
-
WhenYou are building an AI-assisted learning or onboarding tool for a domain with significant visual-spatial content (biology, chemistry, engineering, anatomy, architecture) and choosing whether to include images alongside text responses.
UseDeliver both text and relevant images from the source material in the same conversational response, rather than text only. The two features work through distinct mechanisms: conversation reduces the extraneous cognitive load of information-seeking; visual-verbal integration increases the germane load that builds durable knowledge schemas. Providing images without conversation, or conversation without images, captures only one of the two effects.
EvidenceTaneja, Singh and Goel (AIED 2026) ran a 124-person randomized controlled online study comparing three interfaces for learning cell biology: a multimodal conversational AI (MuDoC), a text-only conversational AI (TexDoC), and an LLM-powered semantic search tool with no conversation (DocSearch). MuDoC produced the highest post-test scores and the most positive reported learning experience. The ordering matched cognitive load theory predictions: MuDoC then TexDoC then DocSearch.
-
WhenYou are building an agent that executes multi-step workflows through stateful tools (scheduling, inventory, project management, file operations) and a tool call’s success or failure determines what subsequent steps can safely do.
UseA verification checkpoint after each tool call. Before proceeding, have the agent observe the post-call state and confirm the change took effect. Make this explicit in the agent loop rather than relying on the model to decide whether to check. In stateful interdependent environments, assuming a tool call succeeded and moving forward corrupts downstream steps in ways that trace back only with careful inspection.
EvidenceLi, Yang, Wang et al. (2026) named over-confidence skipping verification as one of three primary failure modes in ComplexMCP, a benchmark with 150+ interdependent stateful MCP tools. Agents proceeded after tool calls without checking downstream state. In stateless environments the cost is low: the next call either works or fails visibly. In stateful environments the cost is high: silent corruption propagates through the dependency chain and the root cause becomes hard to locate.
-
WhenAn LLM agent encounters a tool call failure, an API error, or an unexpected response mid-task, and the task could still be completed via an alternative tool or a different approach.
UseAn explicit fallback-routing step in the agent’s error-handling path. When a tool call fails, require the agent to ask: “Is there an alternative tool or approach that could accomplish this goal?” Build this question into the failure path as a structured prompt step rather than leaving it to the model to generate spontaneously. Distinguish between terminal failures (the goal genuinely cannot be achieved) and transient ones (a specific tool is unavailable or misbehaving) before the agent reports failure.
EvidenceLi, Yang, Wang et al. (2026) identified strategic defeatism as a recurring failure mode across frontier models in ComplexMCP: when agents encountered API errors, they stopped and reported failure rather than rerouting to alternative tools, even when alternative paths existed. The failure occurred most visibly under seed-controlled API failure injection, where a recoverable glitch triggered abandonment rather than recovery. The pattern was consistent across frontier models, indicating it reflects how models respond to error signals rather than a specific capability gap.