Discovery · Practices Library

Discovery

Decisions made before code is written: what kind of project this is, where the model carries weight, and how much trust the work warrants.

No practices in this stage match the current filters.

WhenA developer or team is starting an AI-assisted coding session and deciding how much to review, validate, or version-control the output.

UseA trust posture set explicitly per project before work starts. Ask whether the project is a prototype, an internal tool, or production code touching user data, and let the answer drive how much manual review, sourcing, and validation you require. Make the regulation deliberate instead of intuitive.

EvidenceAcross 190,000 words of developer self-report (Reddit, LinkedIn, interviews), experienced practitioners already self-regulate trust contextually. They run high-trust workflows for weekend projects and prototypes and lower-trust workflows for anything involving authentication, user data, or production deployment. The practice emerged organically without industry guidance.
- Vibe coding
- Trust calibration
- Cited
- Pimenova, Fakhoury, Bird, Storey & Endres 2025
Updated 2026-05-07
WhenYou’re scoping or reviewing a system that uses an LLM somewhere in its pipeline and you need to know how much architectural exposure that creates.

UseA “garbage-output test” before implementation. Ask: if the model returned malformed or hallucinated output right now, would the rest of the system still work correctly? If the answer is no, the LLM is load-bearing, not decorative, and needs schema validation, fallback paths, and explicit failure handling treated as core engineering work, not polish.

EvidenceTwo student-built games at the University of Calgary embedded LLMs as architectural components rather than flavor generators. Once outputs drove progression, fairness, and difficulty, model errors stopped feeling cosmetic and started reading as fairness violations. Developers reported that prompt engineering, schema enforcement, and validation pipelines were the work, not optional extras.
- AI in games
- Architecture
- Cited
- Johnson, Ahmed, Lang, Thethi, Zheng & de Souza Santos 2026
Updated 2026-05-07
WhenYou are selecting or evaluating an AI coding agent for use in a production or open-source codebase and are relying on vendor benchmarks or internal capability demos.

UseA small-scale trial on representative repositories before standardizing on a single agent. Look at merge rate, reviewer commentary volume, merge conflict frequency, and a 30-day churn check on merged code. The five agents studied showed large enough differences in all four dimensions that agent selection based on benchmark scores alone will not predict real-codebase behavior.

EvidenceAcross 110,000 open-source PRs, Codex and Claude Code achieved higher merge rates than human contributors; Copilot and Devin achieved lower rates. Median merge time ranged from 0.5 minutes (Codex) to 0.4 hours (humans), a roughly 50-fold difference. Behavioral variation of this magnitude means no single benchmark comparison can substitute for observing each agent in a codebase similar to your own.
- AI coding agents
- Agent evaluation
- Tool selection
- Cited
- Popescu, Gros, Botocan, Pandita, Devanbu & Izadi 2026
Updated 2026-05-18
WhenYou are assessing why an LLM application is underperforming and your default response is to consider switching to a more capable model.

UseAudit the harness first. Map every place the surrounding code makes a decision about what to retrieve, how much context to include, how to order or truncate it, and how to format outputs. Treat each of those as a potential optimization target before increasing model spend. Only escalate to a model upgrade after confirming the harness has been systematically evaluated.

EvidenceLee et al. (2026) showed that varying only the harness around a fixed model, while holding model weights constant, produces a 6x performance spread on the same benchmark. The harness, the code that determines what information to store, retrieve, and present, had more influence over outcomes than the model itself across text classification, retrieval-augmented math reasoning, and agentic coding experiments. The same principle held in a different domain: Zheng et al. (2026) found that scaffolding choices alone drove a 29-point accuracy gain (19% to 48% on FrontierMath Tier 4) using identical base model weights.
- Harness engineering
- LLM systems
- Cost optimization
Updated 2026-05-18
WhenYou are evaluating or selecting an agent harness for production use and are comparing options based on capability, latency, and cost.

UseRetrieval method preference as one evaluation criterion alongside capability benchmarks. Run your standard query distribution through each candidate harness using both grep and vector retrieval. The harness that matches your retrieval method preference (or the method your workload favors) is the better architectural fit, independent of raw capability scores. Harness selection is effectively a retrieval-architecture decision.

EvidenceThe grep vs vectors study showed that provider-tooling inductive biases are consistent and measurable: Claude Code consistently favored grep, Gemini CLI consistently favored vector, on the same benchmark. These biases arise from how each harness was developed to work with its native model stack. Selecting a harness without testing retrieval compatibility means inheriting those biases unknowingly, which can cause persistent accuracy degradation that looks like a model problem but is an architectural mismatch.
- Agentic search
- Harness engineering
- Tool selection
- Retrieval-augmented generation
- Cited
- Sen, Kasturi, Lumer, Gulati, Subbiah et al. 2026
Updated 2026-06-01
WhenYou are choosing between context-window extension, retrieval augmentation, and in-weight memory compression for a long-session agent, and need to reason through the tradeoffs before committing to infrastructure.

UseMap the decision across three axes: (1) Does the information needing recall come from a stored corpus or from within the current session? (2) Is the information easy to chunk and query, or is it diffuse and contextual? (3) Can you afford the infrastructure overhead of an external index? If the answer is session-scoped, diffuse, and low-infrastructure, online associative memory is the candidate to prototype first. If corpus-scoped and query-able, RAG remains correct. If general extended context is the bottleneck, context-window extension is the right lever.

EvidenceThe delta-mem results illustrate the tradeoff empirically: the mechanism outperforms retrieval baselines specifically on benchmarks that require sustained within-session recall (1.31x on MemoryAgentBench, 1.20x on LoCoMo) but provides no significant lift on general-purpose benchmarks where the frozen backbone was already adequate. This task-specificity means the gain is real but bounded: measuring your task distribution against benchmark task distributions is required before adoption.
- Agent memory
- System design
- Build vs. buy
- Cited
- Lei, Zhang, Li, Wang et al. 2026
Updated 2026-06-15
WhenYou are selecting a memory backend for a new agent that will accumulate state across many sessions, and you are comparing vector databases, document stores, and other options.

UseEvaluate each candidate store against the four GEM operators before committing: can it ingest with indexing, revise a belief in place while preserving history, forget with an auditable trail, and retrieve current-state values without surfacing invalidated records? If a candidate store cannot satisfy revise and governed forget, plan explicitly for the workaround complexity. Temporal databases and event-sourced stores satisfy more of the GEM operators than standard vector stores and are worth prototyping if your agent’s memory will change frequently over its deployment lifetime.

EvidenceOrogat and Mansour (2026) show that vector databases and document stores were designed for semantic search and document retrieval, not for managing evolving agent state. They support ingest and retrieve fully; forget only partially (hard deletion without audit trail); and revise not at all in the GEM sense. Teams that select these stores for agents with frequently updating beliefs inherit structural limitations that cannot be addressed at the retrieval layer.
- Agent memory
- Data foundations
- Infrastructure selection
- Cited
- Orogat & Mansour 2026
Updated 2026-06-22