Architecture
How the system is shaped: agent topology, planning, step-level evaluation, and the structural choices that determine whether compute is spent well.
No practices in this stage match the current filters.
-
WhenYou’re choosing an architecture for a multi-hop reasoning task and the input context is well-curated and not adversarial.
UseA single agent with a generous thinking-token budget as the default. Reach for multi-agent orchestration only when the input is noisy, adversarial, or too long for the model to use effectively. The crossover where multi-agent earns its overhead happens at heavy context degradation, not at moderate task complexity.
EvidenceAcross three model families (Qwen3, DeepSeek, Gemini 2.5), five multi-agent architectures, and two multi-hop reasoning benchmarks (FRAMES and MuSiQue 4-hop), single-agent systems matched or beat every multi-agent variant under matched thinking-token budgets. Multi-agent variants only pulled ahead under heavy context corruption, such as 70% token substitution.
-
WhenYou’re building an agent that runs multiple tool calls (search, retrieval, code execution, API requests) per task and want to avoid wasting budget on dead-end paths.
UseA critic that grades every step as it happens, not just the final answer. Score the marginal gain from each step rather than the absolute quality of the trajectory. Catching a wrong turn after one tool call is far cheaper than discovering it after ten.
EvidenceBudget-Aware Value Trees outperformed standard tool-augmented agents on four multi-hop QA benchmarks across two model families. The technique at 5 tool calls reached 33.8% accuracy, beating standard agents at 20 tool calls (33.4%). Step-level scoring of marginal gain was more reliable than absolute self-assessment, which LLMs are known to inflate.
-
WhenYou’re starting any task that requires multiple tool calls, retrieval steps, or LLM iterations, whether that’s an autonomous agent or a vibe-coding session.
UseA short up-front planning pass that sketches the logical shape of the problem before spending any compute. The plan does not need facts. It just needs the abstract steps, the expected number of calls, and the dependencies. Then execute against that plan and re-plan only when reality contradicts it.
EvidenceTwo independent studies converged on this. Budget-Aware Value Trees use an explicit plan node before any tool call and outperform unplanned standard agents at 4x the compute. Separately, the vibe-coding qualitative study found “plan before you vibe” was one of the two most universal community-derived best practices, addressing the highest-severity failure modes around runaway code changes and structural breakdown.
-
WhenYou are building an AI system that gives personalized feedback, instruction, or coaching to individual users, and the pipeline currently makes a single LLM call to generate the response directly from user input.
UseSeparate the pipeline into at least two stages: a reasoning stage that infers the user’s current state (knowledge gaps, emotional state, likely misconceptions) before a response stage that generates the actual reply. Treat the inferred state as an explicit intermediate artifact, not an implicit assumption baked into the prompt.
EvidenceThe SLOW framework demonstrates that separating learner-state inference from instructional response generation produces tutoring responses rated as more personalized, emotionally sensitive, and clear than single-pass LLM generation across hybrid human-AI evaluation. Ablation studies show each reasoning stage contributes independently; none can be removed without degrading output quality.
-
WhenYou are building or evaluating a retrieval-augmented generation system and need to decide how much of the optimization budget to spend on retrieval strategy versus context length.
UseRetrieval precision as the primary lever before increasing context size. Design harness search experiments that compare selective retrieval approaches against larger context windows. More tokens are not always better: the Meta-Harness text classification result achieved a 7.7-point improvement using 4x fewer context tokens than the SOTA baseline, because the discovered harness retrieved more relevant content rather than more content.
EvidenceOn an online text classification task, a harness discovered by automated search improved over a state-of-the-art context management system by 7.7 points while using 4 times fewer context tokens. The discovered harness also generalized to 9 out-of-distribution task variants (73.1% average accuracy), indicating it captured structural improvements to retrieval rather than task-specific overfitting.
-
WhenYou are designing a multi-agent system that needs to make progress on open-ended, iterative problems where failed intermediate attempts contain useful information.
UsePersistent storage of rejected outputs as first-class artifacts. Record not just successful outputs but failed attempts alongside the identified reasons they failed. Surface this failure history to downstream agents so they do not re-explore dead ends and can build on what the failure revealed.
EvidenceIn the AI Co-Mathematician system, a reviewer agent caught a flaw in a first-pass proof attempt. The flawed proof and the specific flaw were stored. When the human collaborator saw both the rejected proof and the reviewer’s identified weakness, they recognized immediately how to close the gap. The stored failure, with its explanation, was more useful than a silent rejection would have been.
-
WhenYou are designing the memory layer for an agent that will operate across multiple sessions or long time horizons, where stored facts can become outdated as context evolves.
UseValidity as a first-class property alongside relevance in your memory architecture. Assign each stored item a freshness window based on how quickly that category of information typically changes. At retrieval time, surface freshness metadata to the model alongside the retrieved content, and include a validity check in the prompt before the model acts on retrieved information. Do not rely on the model to independently notice that retrieved content may be stale.
EvidenceSTALE demonstrates that frontier LLMs at 55.2% accuracy cannot reliably self-detect memory invalidity, even when given context that implies it. The benchmark’s third axis, implicit policy adaptation, shows the hardest failure: the model must proactively update its behavior based on an implied change it was never explicitly told about. This failure is systematic enough across frontier models that architectural mitigation is more reliable than prompting alone.
-
WhenYou are deploying an LLM agent that connects to more than 50 tools via MCP and are experiencing failures that do not appear in simpler benchmarks or internal evals.
UseAudit tool retrieval as the first failure hypothesis before investigating model reasoning or plan generation. Log which tool the agent selected for each step and whether it was the correct tool. Treat a wrong tool selection rate above 10% as a retrieval-architecture problem, not a model capability problem. Mitigations to test: hierarchical tool indexing (coarse category filter before fine-grained selection), sharper semantic differentiation in tool descriptions, and query-time disambiguation that requires the agent to confirm its selection before calling.
EvidenceLi, Yang, Wang et al. (2026) built ComplexMCP, a benchmark with 150+ interdependent stateful MCP tools across 7 domains. At that scale, tool-retrieval saturation emerged as the primary failure mode across frontier models: agents selected the wrong tool due to overlapping descriptions, causing failures that originated before any planning or execution began. The failure mode was invisible in benchmarks with smaller tool sets, where distinct tools do not compete the way 150 overlapping options do.
-
WhenYou are running a chain-of-thought reasoning workload at scale and need to reduce token costs without sacrificing accuracy, and your current approach applies explicit CoT uniformly to every query regardless of difficulty.
UseTwo-phase inference as the default evaluation pattern before committing to a fixed CoT length budget. Route queries through a latent-space exploration phase first, and switch to explicit chain-of-thought only when the latent phase signals uncertainty or reaches a confidence threshold. Measure accuracy and token count both before and after the split to confirm the tradeoff is favorable on your task distribution.
EvidenceLi, Wang, Liu et al. (2026) applied this two-phase approach (LaTER) to Qwen3-14B without any additional training. Token usage fell 32% compared to uniform explicit CoT. On AIME 2025, accuracy rose from 70.0% to 73.3% at 10,661 tokens versus 15,730 tokens. The accuracy gain alongside the cost reduction indicates that latent exploration produces better intermediate representations for hard reasoning steps rather than just truncating them. The approach requires no fine-tuning and was validated across multiple model families, indicating the gain comes from the phased structure rather than a model-specific property.
-
WhenYou are deploying an agent pipeline and have no labeled routing history, but the cost gap between direct LLM inference and full agent execution is large enough to make routing worthwhile.
UseA seed-set-based experience memory to bootstrap routing before any training data exists. Select a small set of queries that span the difficulty range you expect in production, run both the base model and the full agent on each, and record which system performed better. Use this memory to route new queries by retrieving similar past cases and applying a structured scoring step to decide whether agent capabilities are actually needed. Build the seed set to include genuinely easy queries (direct model call sufficient), borderline queries, and hard queries requiring agent execution. A seed set made only of hard queries will over-escalate.
EvidenceWang, Qiu et al. (2026) built BoundaryRouter, a training-free router using early behavioral experience and rubric-guided reasoning. Starting from a seed set only, it reduced inference time by 60.6% versus always running the agent while improving accuracy by 28.6% over always using direct LLM inference. Prompt-based routing without experience memory underperformed by 37.9 percentage points, establishing that the memory component carries most of the routing signal.
-
WhenYou are building or diagnosing an agentic retrieval-augmented generation system and are considering improving retrieval accuracy by upgrading your embedding pipeline or vector database.
UseA controlled comparison of grep vs vector retrieval inside your actual production harness before investing further in embedding infrastructure. Select 50 to 100 representative queries, run each through both retrieval paths in your specific harness and model combination, and measure accuracy. Let the result determine investment direction. Do not rely on isolated retrieval benchmark results, which are measured outside the harness and may not reflect which method your specific framework and model favor.
EvidenceSen, Kasturi, Lumer, Gulati, Subbiah et al. (2026) ran 116 LongMemEval questions through grep and vector retrieval across four agent harnesses (Chronos, Claude Code, Codex, Gemini CLI). The harness layer moved accuracy more than the choice between retrieval methods. Claude Code with Opus and Haiku showed a persistent grep advantage; Gemini CLI with Gemini 3.1 Pro showed a persistent vector advantage. Same benchmark, opposite winners, driven by harness rather than retrieval method.
-
WhenYou are designing an inference-time orchestration layer (best-of-N sampling, agent committee, self-consistency) for a task and deciding whether multiple model calls will meaningfully improve accuracy over a single call.
UseA local verifier check before building the committee. Ask whether the task has an execution-based feedback mechanism: a test suite for code, a proof checker for formal reasoning, a constraint solver for planning, a type system for synthesis. If one exists, a committee design can reliably identify which candidate answer is correct and improve accuracy over single-call baselines. If no local verifier exists, invest in improving single-call quality instead, because the committee cannot determine which candidate to select without a soundness signal.
EvidenceSunkaraneni, Beneventano, Neumarker, Poggio and Galanti (2026) proved formally that inference-time agent committees succeed only when the task provides a local soundness signal for identifiability. Empirically, a nano-model committee using SWE-bench test suites as the verifier reached 76.4% on SWE-bench Verified, matching Gemini 3 Pro and Claude Opus 4.5 Thinking standalone (up from 67.0% single-call). The oracle ceiling with a perfect verifier was 79.0%.