Discovery
Decisions made before code is written: what kind of project this is, where the model carries weight, and how much trust the work warrants.
No practices in this stage match the current filters.
-
WhenA developer or team is starting an AI-assisted coding session and deciding how much to review, validate, or version-control the output.
UseA trust posture set explicitly per project before work starts. Ask whether the project is a prototype, an internal tool, or production code touching user data, and let the answer drive how much manual review, sourcing, and validation you require. Make the regulation deliberate instead of intuitive.
EvidenceAcross 190,000 words of developer self-report (Reddit, LinkedIn, interviews), experienced practitioners already self-regulate trust contextually. They run high-trust workflows for weekend projects and prototypes and lower-trust workflows for anything involving authentication, user data, or production deployment. The practice emerged organically without industry guidance.
-
WhenYou’re scoping or reviewing a system that uses an LLM somewhere in its pipeline and you need to know how much architectural exposure that creates.
UseA “garbage-output test” before implementation. Ask: if the model returned malformed or hallucinated output right now, would the rest of the system still work correctly? If the answer is no, the LLM is load-bearing, not decorative, and needs schema validation, fallback paths, and explicit failure handling treated as core engineering work, not polish.
EvidenceTwo student-built games at the University of Calgary embedded LLMs as architectural components rather than flavor generators. Once outputs drove progression, fairness, and difficulty, model errors stopped feeling cosmetic and started reading as fairness violations. Developers reported that prompt engineering, schema enforcement, and validation pipelines were the work, not optional extras.
-
WhenYou are selecting or evaluating an AI coding agent for use in a production or open-source codebase and are relying on vendor benchmarks or internal capability demos.
UseA small-scale trial on representative repositories before standardizing on a single agent. Look at merge rate, reviewer commentary volume, merge conflict frequency, and a 30-day churn check on merged code. The five agents studied showed large enough differences in all four dimensions that agent selection based on benchmark scores alone will not predict real-codebase behavior.
EvidenceAcross 110,000 open-source PRs, Codex and Claude Code achieved higher merge rates than human contributors; Copilot and Devin achieved lower rates. Median merge time ranged from 0.5 minutes (Codex) to 0.4 hours (humans), a roughly 50-fold difference. Behavioral variation of this magnitude means no single benchmark comparison can substitute for observing each agent in a codebase similar to your own.
-
WhenYou are assessing why an LLM application is underperforming and your default response is to consider switching to a more capable model.
UseAudit the harness first. Map every place the surrounding code makes a decision about what to retrieve, how much context to include, how to order or truncate it, and how to format outputs. Treat each of those as a potential optimization target before increasing model spend. Only escalate to a model upgrade after confirming the harness has been systematically evaluated.
EvidenceLee et al. (2026) showed that varying only the harness around a fixed model, while holding model weights constant, produces a 6x performance spread on the same benchmark. The harness, the code that determines what information to store, retrieve, and present, had more influence over outcomes than the model itself across text classification, retrieval-augmented math reasoning, and agentic coding experiments. The same principle held in a different domain: Zheng et al. (2026) found that scaffolding choices alone drove a 29-point accuracy gain (19% to 48% on FrontierMath Tier 4) using identical base model weights.
-
WhenYou are evaluating or selecting an agent harness for production use and are comparing options based on capability, latency, and cost.
UseRetrieval method preference as one evaluation criterion alongside capability benchmarks. Run your standard query distribution through each candidate harness using both grep and vector retrieval. The harness that matches your retrieval method preference (or the method your workload favors) is the better architectural fit, independent of raw capability scores. Harness selection is effectively a retrieval-architecture decision.
EvidenceThe grep vs vectors study showed that provider-tooling inductive biases are consistent and measurable: Claude Code consistently favored grep, Gemini CLI consistently favored vector, on the same benchmark. These biases arise from how each harness was developed to work with its native model stack. Selecting a harness without testing retrieval compatibility means inheriting those biases unknowingly, which can cause persistent accuracy degradation that looks like a model problem but is an architectural mismatch.