Architecture · Practices Library

Architecture

How the system is shaped: agent topology, planning, step-level evaluation, and the structural choices that determine whether compute is spent well.

No practices in this stage match the current filters.

WhenYou’re choosing an architecture for a multi-hop reasoning task and the input context is well-curated and not adversarial.

UseA single agent with a generous thinking-token budget as the default. Reach for multi-agent orchestration only when the input is noisy, adversarial, or too long for the model to use effectively. The crossover where multi-agent earns its overhead happens at heavy context degradation, not at moderate task complexity.

EvidenceAcross three model families (Qwen3, DeepSeek, Gemini 2.5), five multi-agent architectures, and two multi-hop reasoning benchmarks (FRAMES and MuSiQue 4-hop), single-agent systems matched or beat every multi-agent variant under matched thinking-token budgets. Multi-agent variants only pulled ahead under heavy context corruption, such as 70% token substitution.
- Single-agent reasoning
- Multi-agent systems
- Cited
- Tran & Kiela 2026
Updated 2026-05-07
WhenYou’re building an agent that runs multiple tool calls (search, retrieval, code execution, API requests) per task and want to avoid wasting budget on dead-end paths.

UseA critic that grades every step as it happens, not just the final answer. Score the marginal gain from each step rather than the absolute quality of the trajectory. Catching a wrong turn after one tool call is far cheaper than discovering it after ten.

EvidenceBudget-Aware Value Trees outperformed standard tool-augmented agents on four multi-hop QA benchmarks across two model families. The technique at 5 tool calls reached 33.8% accuracy, beating standard agents at 20 tool calls (33.4%). Step-level scoring of marginal gain was more reliable than absolute self-assessment, which LLMs are known to inflate.
- Budget-aware reasoning
- Tool use
- Cited
- Li et al. 2026
Updated 2026-05-07
WhenYou’re starting any task that requires multiple tool calls, retrieval steps, or LLM iterations, whether that’s an autonomous agent or a vibe-coding session.

UseA short up-front planning pass that sketches the logical shape of the problem before spending any compute. The plan does not need facts. It just needs the abstract steps, the expected number of calls, and the dependencies. Then execute against that plan and re-plan only when reality contradicts it.

EvidenceTwo independent studies converged on this. Budget-Aware Value Trees use an explicit plan node before any tool call and outperform unplanned standard agents at 4x the compute. Separately, the vibe-coding qualitative study found “plan before you vibe” was one of the two most universal community-derived best practices, addressing the highest-severity failure modes around runaway code changes and structural breakdown.
- Budget-aware reasoning
- Vibe coding
- Cited
- Li et al. 2026
- ·
- Pimenova, Fakhoury, Bird, Storey & Endres 2025
Updated 2026-05-07
WhenYou are building an AI system that gives personalized feedback, instruction, or coaching to individual users, and the pipeline currently makes a single LLM call to generate the response directly from user input.

UseSeparate the pipeline into at least two stages: a reasoning stage that infers the user’s current state (knowledge gaps, emotional state, likely misconceptions) before a response stage that generates the actual reply. Treat the inferred state as an explicit intermediate artifact, not an implicit assumption baked into the prompt.

EvidenceThe SLOW framework demonstrates that separating learner-state inference from instructional response generation produces tutoring responses rated as more personalized, emotionally sensitive, and clear than single-pass LLM generation across hybrid human-AI evaluation. Ablation studies show each reasoning stage contributes independently; none can be removed without degrading output quality.
- AI tutoring
- Reasoning architecture
- Personalization
- Cited
- Wei, Li & Jiang 2026
Updated 2026-05-11
WhenYou are building or evaluating a retrieval-augmented generation system and need to decide how much of the optimization budget to spend on retrieval strategy versus context length.

UseRetrieval precision as the primary lever before increasing context size. Design harness search experiments that compare selective retrieval approaches against larger context windows. More tokens are not always better: the Meta-Harness text classification result achieved a 7.7-point improvement using 4x fewer context tokens than the SOTA baseline, because the discovered harness retrieved more relevant content rather than more content.

EvidenceOn an online text classification task, a harness discovered by automated search improved over a state-of-the-art context management system by 7.7 points while using 4 times fewer context tokens. The discovered harness also generalized to 9 out-of-distribution task variants (73.1% average accuracy), indicating it captured structural improvements to retrieval rather than task-specific overfitting.
- Retrieval-augmented generation
- Context management
- Token efficiency
- Cited
- Lee, Nair, Zhang, Lee, Khattab & Finn 2026
Updated 2026-05-18
WhenYou are designing a multi-agent system that needs to make progress on open-ended, iterative problems where failed intermediate attempts contain useful information.

UsePersistent storage of rejected outputs as first-class artifacts. Record not just successful outputs but failed attempts alongside the identified reasons they failed. Surface this failure history to downstream agents so they do not re-explore dead ends and can build on what the failure revealed.

EvidenceIn the AI Co-Mathematician system, a reviewer agent caught a flaw in a first-pass proof attempt. The flawed proof and the specific flaw were stored. When the human collaborator saw both the rejected proof and the reviewer’s identified weakness, they recognized immediately how to close the gap. The stored failure, with its explanation, was more useful than a silent rejection would have been.
- Agentic AI
- Failure handling
- Multi-agent systems
- Expert reasoning
- Cited
- Zheng, von Glehn, Zwols et al. 2026
Updated 2026-05-18
WhenYou are designing the memory layer for an agent that will operate across multiple sessions or long time horizons, where stored facts can become outdated as context evolves.

UseValidity as a first-class property alongside relevance in your memory architecture. Assign each stored item a freshness window based on how quickly that category of information typically changes. At retrieval time, surface freshness metadata to the model alongside the retrieved content, and include a validity check in the prompt before the model acts on retrieved information. Do not rely on the model to independently notice that retrieved content may be stale.

EvidenceSTALE demonstrates that frontier LLMs at 55.2% accuracy cannot reliably self-detect memory invalidity, even when given context that implies it. The benchmark’s third axis, implicit policy adaptation, shows the hardest failure: the model must proactively update its behavior based on an implied change it was never explicitly told about. This failure is systematic enough across frontier models that architectural mitigation is more reliable than prompting alone.
- Agent memory
- Memory systems
- Cited
- Chao, Bai et al. 2026
Updated 2026-05-18
WhenYou are deploying an LLM agent that connects to more than 50 tools via MCP and are experiencing failures that do not appear in simpler benchmarks or internal evals.

UseAudit tool retrieval as the first failure hypothesis before investigating model reasoning or plan generation. Log which tool the agent selected for each step and whether it was the correct tool. Treat a wrong tool selection rate above 10% as a retrieval-architecture problem, not a model capability problem. Mitigations to test: hierarchical tool indexing (coarse category filter before fine-grained selection), sharper semantic differentiation in tool descriptions, and query-time disambiguation that requires the agent to confirm its selection before calling.

EvidenceLi, Yang, Wang et al. (2026) built ComplexMCP, a benchmark with 150+ interdependent stateful MCP tools across 7 domains. At that scale, tool-retrieval saturation emerged as the primary failure mode across frontier models: agents selected the wrong tool due to overlapping descriptions, causing failures that originated before any planning or execution began. The failure mode was invisible in benchmarks with smaller tool sets, where distinct tools do not compete the way 150 overlapping options do.
- Agent tool use
- MCP
- Tool retrieval
- Agent evaluation
- Cited
- Li, Yang, Wang et al. 2026
Updated 2026-05-18
WhenYou are running a chain-of-thought reasoning workload at scale and need to reduce token costs without sacrificing accuracy, and your current approach applies explicit CoT uniformly to every query regardless of difficulty.

UseTwo-phase inference as the default evaluation pattern before committing to a fixed CoT length budget. Route queries through a latent-space exploration phase first, and switch to explicit chain-of-thought only when the latent phase signals uncertainty or reaches a confidence threshold. Measure accuracy and token count both before and after the split to confirm the tradeoff is favorable on your task distribution.

EvidenceLi, Wang, Liu et al. (2026) applied this two-phase approach (LaTER) to Qwen3-14B without any additional training. Token usage fell 32% compared to uniform explicit CoT. On AIME 2025, accuracy rose from 70.0% to 73.3% at 10,661 tokens versus 15,730 tokens. The accuracy gain alongside the cost reduction indicates that latent exploration produces better intermediate representations for hard reasoning steps rather than just truncating them. The approach requires no fine-tuning and was validated across multiple model families, indicating the gain comes from the phased structure rather than a model-specific property.
- Inference optimization
- Chain-of-thought
- Token efficiency
- Cost optimization
- Cited
- Li, Wang, Liu et al. 2026
Updated 2026-05-25
WhenYou are deploying an agent pipeline and have no labeled routing history, but the cost gap between direct LLM inference and full agent execution is large enough to make routing worthwhile.

UseA seed-set-based experience memory to bootstrap routing before any training data exists. Select a small set of queries that span the difficulty range you expect in production, run both the base model and the full agent on each, and record which system performed better. Use this memory to route new queries by retrieving similar past cases and applying a structured scoring step to decide whether agent capabilities are actually needed. Build the seed set to include genuinely easy queries (direct model call sufficient), borderline queries, and hard queries requiring agent execution. A seed set made only of hard queries will over-escalate.

EvidenceWang, Qiu et al. (2026) built BoundaryRouter, a training-free router using early behavioral experience and rubric-guided reasoning. Starting from a seed set only, it reduced inference time by 60.6% versus always running the agent while improving accuracy by 28.6% over always using direct LLM inference. Prompt-based routing without experience memory underperformed by 37.9 percentage points, establishing that the memory component carries most of the routing signal.
- Agent routing
- Cost optimization
- Cold start
- Inference efficiency
- Cited
- Wang, Qiu et al. 2026
Updated 2026-05-25
WhenYou are building or diagnosing an agentic retrieval-augmented generation system and are considering improving retrieval accuracy by upgrading your embedding pipeline or vector database.

UseA controlled comparison of grep vs vector retrieval inside your actual production harness before investing further in embedding infrastructure. Select 50 to 100 representative queries, run each through both retrieval paths in your specific harness and model combination, and measure accuracy. Let the result determine investment direction. Do not rely on isolated retrieval benchmark results, which are measured outside the harness and may not reflect which method your specific framework and model favor.

EvidenceSen, Kasturi, Lumer, Gulati, Subbiah et al. (2026) ran 116 LongMemEval questions through grep and vector retrieval across four agent harnesses (Chronos, Claude Code, Codex, Gemini CLI). The harness layer moved accuracy more than the choice between retrieval methods. Claude Code with Opus and Haiku showed a persistent grep advantage; Gemini CLI with Gemini 3.1 Pro showed a persistent vector advantage. Same benchmark, opposite winners, driven by harness rather than retrieval method.
- Agentic search
- Retrieval-augmented generation
- Harness engineering
- Token efficiency
- Cited
- Sen, Kasturi, Lumer, Gulati, Subbiah et al. 2026
Updated 2026-06-01
WhenYou are designing an inference-time orchestration layer (best-of-N sampling, agent committee, self-consistency) for a task and deciding whether multiple model calls will meaningfully improve accuracy over a single call.

UseA local verifier check before building the committee. Ask whether the task has an execution-based feedback mechanism: a test suite for code, a proof checker for formal reasoning, a constraint solver for planning, a type system for synthesis. If one exists, a committee design can reliably identify which candidate answer is correct and improve accuracy over single-call baselines. If no local verifier exists, invest in improving single-call quality instead, because the committee cannot determine which candidate to select without a soundness signal.

EvidenceSunkaraneni, Beneventano, Neumarker, Poggio and Galanti (2026) proved formally that inference-time agent committees succeed only when the task provides a local soundness signal for identifiability. Empirically, a nano-model committee using SWE-bench test suites as the verifier reached 76.4% on SWE-bench Verified, matching Gemini 3 Pro and Claude Opus 4.5 Thinking standalone (up from 67.0% single-call). The oracle ceiling with a perfect verifier was 79.0%.
- Agent committees
- Inference-time scaling
- Verifiability
- Cost optimization
- Cited
- Sunkaraneni, Beneventano, Neumarker, Poggio & Galanti 2026
Updated 2026-06-01
WhenYou are optimizing a test-time scaling strategy (self-consistency, chain-of-thought search) and the current thresholds for branching, probing, or stopping were set by intuition and hand-tuning experiments.

UseBuild a frozen replay cache of reasoning traces from your base model on a representative problem set, then run a coding agent to discover the controller program against that cache. Evaluate candidate controllers without live model calls during the search loop. Re-run discovery whenever you change base models or shift problem distributions. The cache-based evaluation makes each candidate assessment nearly free, so the agent can iterate far more extensively than human researchers can with live-inference experiments.

EvidenceAutoTTS (Zheng et al. 2026) discovered the Confidence Momentum Controller (CMC) by running a coding agent against a frozen AIME24 trace cache with zero LLM calls during evaluation. One complete discovery run cost $39.90 and 160 minutes. CMC reduced token usage by 69.5% compared to SC@64 (self-consistency with 64 parallel samples) while maintaining matched average accuracy across four Qwen3 model scales (45.3 vs 45.2). CMC outperformed every manually designed baseline and generalized to held-out AIME25 and HMMT25 benchmarks without retuning.
- Test-time scaling
- Inference optimization
- Controller discovery
- Cost reduction
- Cited
- Zheng, Liu, Huang et al. 2026
Updated 2026-06-08
WhenYou need to expose a test-time scaling strategy’s cost-accuracy operating point to infrastructure or product teams who cannot interpret or modify internal controller logic.

UseA single continuous β parameter (0 = maximum token efficiency, 1 = maximum peak accuracy) that maps deterministically to all internal thresholds in the controller. Document the β-accuracy and β-token curves on representative test tasks before deployment so teams can select an operating point without running ablations. Both ends of the curve and the midpoint should be characterized; the efficiency gains are largest near β=0.5, not at the extremes.

EvidenceAutoTTS’s β-parameterization collapsed the CMC controller’s multi-threshold configuration into one knob. At β=0.5, tokens fell 69.5% vs SC@64 at matched accuracy (45.3 vs 45.2 averaged across four Qwen3 model scales). At β=1.0, CMC exceeded all handcrafted baselines in 5 of 8 model-benchmark pairs on peak accuracy. Both operating points came from the same discovered controller without re-running discovery.
- Inference optimization
- Controllability
- Token efficiency
- Deployment
- Cited
- Zheng, Liu, Huang et al. 2026
Updated 2026-06-08
WhenYou are running self-consistency sampling (generating N reasoning traces and aggregating by majority vote) and want to improve answer quality without generating more traces.

UseConfidence-weighted aggregation instead of simple plurality vote. Score each reasoning trace by quality indicators before counting its vote. Weight answers by the accumulated confidence of the traces that produced them. High-confidence convergence on an answer is a stronger signal than raw count alone. Implement this as the first optimization pass before adding any path-level pruning.

EvidenceDDC’s Confidence-Weighted Bayesian Voting (CWBT) component improved accuracy on AIME 2025 by 15.6 percentage points over standard self-consistency using Qwen3-4B, because quality-weighted aggregation better identifies the correct answer when trace quality varies across the sampled pool. The accuracy gain appears even before factoring in the token reduction from early termination.
- Self-consistency
- Inference optimization
- Reasoning
- Cited
- Xu, Li, Zhao, Wu, Li & Yan 2026
Updated 2026-06-08
WhenYou are building an automated interpretability pipeline and currently select which model features to examine by activation frequency rank or random sampling.

UseA co-activation graph as the primary selection mechanism before any explanation work begins. Build a k-NN graph where each node is a feature (neuron or circuit element) and edges connect features that tend to co-activate on similar inputs. Apply a statistical separability metric to each node: features with high separability have crisp, testable activation patterns; features with low separability are entangled and hard to explain precisely. Route high-separability candidates to your explanation loop first. Frequency-based or random selection wastes explanation budget on features that are too diffuse to describe accurately.

EvidenceMarin-Llobet and Ferrando (2026) demonstrated that navigating activation space via a k-NN co-activation graph with statistical separability scoring outperformed alternative feature selection strategies for mechanistic interpretability. Features identified by the graph-based discovery agent produced better explanation outcomes on Gemma-2 and weight-sparse MLP neurons than those selected by frequency or at random, because high-separability features are precisely the ones an explanation loop can test and verify.
- Mechanistic interpretability
- Feature discovery
- Agent loops
- Cited
- Marin-Llobet & Ferrando 2026
Updated 2026-06-08
WhenYou are operating an AI agent that executes real-world tool calls (shell commands, file operations, HTTP requests, database queries) and your current safety posture relies on post-hoc audit logs or static keyword filters on raw tool call strings.

UseA runtime interception layer that evaluates the semantic intent of every tool call before execution and returns a structured verdict (allow, warn, block, review). Place the interceptor between the agent and its tools so no call executes without passing through evaluation. Treat the interceptor as distinct from and complementary to sandbox restrictions: the sandbox constrains the execution environment; the interceptor evaluates what each action means inside that environment.

EvidenceYang (2026) built AgentTrust, an 8-component runtime interceptor for agent tool calls. On a 300-scenario internal benchmark spanning six risk categories (file operations, network access, code execution, credential exposure, data exfiltration, system configuration) the production ruleset achieved 95.0% verdict accuracy at low-millisecond median latency. On 630 independently constructed real-world adversarial scenarios covering DevOps, cloud, container, and supply-chain operations, verdict accuracy was 96.7%. Latency overhead was small enough to be below network call variance for typical tool execution.
- Agent safety
- Runtime governance
- Tool use
- MCP
- Cited
- Yang 2026
Updated 2026-06-08
WhenAn agent pipeline is losing track of earlier context within a single session, and RAG retrieval is too slow, too brittle, or too expensive for the session length you need.

UseEvaluate online associative memory as a complement to or replacement for retrieval in the within-session memory slot. Specifically: if the task requires the model to remember what it did or was told several thousand tokens ago but the information is not clearly chunk-able or query-able, an online state-based mechanism that updates continuously at decode time addresses a different failure mode from retrieval. Baseline on MemoryAgentBench before and after to measure actual gain.

EvidenceLei, Zhang, Li, Wang et al. (declare-lab / Nanyang Technological University, 2026) added a compact 8x8 online state matrix to frozen Qwen3-4B, Qwen3-8B, and SmolLM3-3B models via LoRA-style adapters. The state updates via delta-rule learning at inference time, with no changes to the backbone. On MemoryAgentBench it scored 1.31x the frozen backbone and 1.15x the strongest retrieval baseline; on LoCoMo (long-term conversational memory) it scored 1.20x the backbone. General-purpose benchmarks (HotpotQA, IFEval, GPQA Diamond) showed near-baseline scores, confirming the adapter did not degrade non-memory tasks.
- Agent memory
- Long-context
- Retrieval alternatives
- Cited
- Lei, Zhang, Li, Wang et al. 2026
Updated 2026-06-15
WhenYou are selecting a memory architecture for a production agent and need to compare candidates on more than retrieval quality and latency.

UseAdd a safety criterion to memory architecture selection. Before committing to a design, run the trigger-probe protocol against each candidate architecture with a matched probe set and a NullMemory baseline. Compare memory-induced violation rates across architectures at equivalent exposure lengths. The architecture with the lowest rate at production-scale exposure is the safety-preferable choice, independent of capability rankings. If resources do not allow a full protocol run, at minimum instrument the retrieval layer to detect elevated similarity to unsafe content patterns before generation, as the paper shows risk is detectable at retrieval time.

EvidenceAcross the 8 memory architectures tested, not all configurations produced the same memory-induced violation rate. The paper establishes that architecture choice is a variable in longitudinal safety outcomes, not just a capability variable. The finding that risk is detectable at retrieval time before generation provides a tractable monitoring hook that architecture-level logging can expose.
- Agent memory
- Memory architecture
- Agent safety
- Cited
- Al-Tawaha, Gu, Niu, Jia & Jin 2026
Updated 2026-06-15
WhenYou are choosing a global compression setting (quantization, pruning, attention sparsity) for a production LLM and have not profiled per-token difficulty in your actual workload.

UseBefore committing to a uniform compression budget, measure the distribution of output entropy across a representative sample of your production traffic. Log top-1 token probability at each decode step. If a large fraction of steps have near-certain predictions, your compression budget is over-allocated to easy tokens and you have room to recover quality on hard tokens without increasing total compute.

EvidenceAkhauri & Abdelfattah (2026) found that a learned per-token scheduling policy outperformed uniform compression by up to 7.3 MMLU points at matched FLOPs, demonstrating that token difficulty is non-uniform and that uniform budgets leave quality on the table. The gain comes entirely from reallocating the same compute toward steps the model finds genuinely uncertain.
- Inference optimization
- LLM efficiency
- Compression
- Cited
- Akhauri & Abdelfattah 2026
Updated 2026-06-22
WhenYou are designing a lightweight efficiency scheduler or meta-controller to sit on top of a frozen LLM.

UseUse the model’s own hidden state at each decode step as the primary input signal for any per-token scheduling policy. The hidden state encodes the model’s current uncertainty and context representation; a small policy network trained on teacher-forced episodes with a quality-versus-budget reward can learn to distinguish high-entropy from low-entropy steps from this signal alone, without requiring additional features or human-specified difficulty labels.

EvidenceSOL’s policy network was trained via GRPO on teacher-forced episodes using only the frozen LLM’s hidden state as input. It learned to jointly tune attention sparsity, MLP pruning, and quantization bit-width per token, producing a quality-efficiency Pareto front that dominated uniform baselines across all tested compute levels (Akhauri & Abdelfattah 2026, arXiv:2605.10875).
- Inference optimization
- LLM efficiency
- Reinforcement learning
- Policy networks
- Cited
- Akhauri & Abdelfattah 2026
Updated 2026-06-22
WhenYour agent pipeline has a memory system and you are investigating why retrieval is returning stale or contradictory information, even after improving embedding models or chunk sizes.

UseAudit every memory write in the pipeline before touching retrieval infrastructure. Label each write by its operator intent: ingest (new information the agent did not previously know), revise (correction or update to an existing stored belief), or forget (deliberate removal). If revise operations are implemented as delete-then-insert, you are losing provenance and creating a window where retrieval returns nothing or the wrong version. Fix the data-model issue before optimizing the retrieval layer.

EvidenceOrogat and Mansour (2026) formalize long-term agent memory as Governed Evolving Memory (GEM), showing that existing vector stores support ingest and retrieve but lack a revise operator with provenance tracking. The delete-then-insert workaround teams use to simulate revision loses update history and creates retrieval ambiguity during any non-atomic execution window. The paper argues the revise-operator gap, not retrieval quality, is the primary structural cause of agent memory failures in production.
- Agent memory
- Data foundations
- Memory systems
- Cited
- Orogat & Mansour 2026
Updated 2026-06-22
WhenYou are running a mixture-of-agents or self-consistency pipeline and adding more agents or more diverse prompts is not improving accuracy past a plateau.

UseReplace the majority vote or final-answer aggregation step with an LLM that reads the complete reasoning chains from all agents. Pass every agent’s full trace, not just its conclusion, to the aggregating model. Use anchored refinement so the synthesis is only accepted when it does not degrade below the majority baseline.

EvidenceFadnavis, Kanakaraj and Wyss (2026) measured error correlations across varied prompts, samplers, and agent counts and found they remain high regardless of diversity applied to inputs. The trace-reading aggregator recovered correct answers from cases where all agents agreed on the wrong final answer. Across five benchmarks, beneficial corrections by the trace reader outweighed harmful ones. The SC-MoA variant with anchored refinement provided provable non-degradation against the majority baseline and achieved highest accuracy on all five benchmarks.
- Multi-agent systems
- Aggregation
- Self-consistency
- Cited
- Fadnavis, Kanakaraj & Wyss 2026
Updated 2026-06-29
WhenYou are choosing how to layer injection defenses in a production AI agent and deciding how much to rely on data-instruction separation as the primary control.

UseTreat data-instruction separation as one layer in a defense stack, not as the complete answer. Layer it with: explicit trust-boundary scoping (which contexts can trigger consequential actions), behavioral monitoring for anomalous action sequences that may signal a successful injection, and graceful failure behavior that surfaces suspicion rather than silently executing. Document the residual risk from context-aware attacks that separator-based controls cannot block, and use that documentation to calibrate monitoring thresholds.

EvidenceThe impossibility result in Abdelnabi and Bagdasarian (2026) shows that no single norm can simultaneously block all injections and permit all legitimate task flows. Data-instruction separation implements one such norm and is useful for blocking unsophisticated attacks, but it cannot close the gap that context-aware attacks exploit. The paper argues that graceful failure under attack is the correct target when perfect blocking is structurally unavailable.
- Agent security
- Prompt injection
- Defense-in-depth
- Cited
- Abdelnabi & Bagdasarian 2026
Updated 2026-06-29
WhenYou are scaling LLM inference capacity (adding GPUs, upgrading serving infrastructure, or paying for higher-throughput API tiers) because prefill latency or throughput is a bottleneck, and you have not yet measured how much of that prefill work is redundant across requests.

UseBefore adding hardware, sample a production window of prompts and measure prefix overlap across requests. Compute the fraction of tokens that appear identically in multiple requests (system prompts, retrieved documents, conversation history). If overlap is above 20%, evaluate whether a context-reuse middleware layer (reordering shared blocks to the prefix and deduplicating repeated content) can recover that capacity before committing to additional infrastructure spend.

EvidenceContextPilot, a context index, reorder, and deduplicate middleware evaluated at MLSys 2026, achieved 4 to 12x KV-cache hit rate improvements and up to 3x prefill latency reduction versus prior SOTA serving on workloads with significant cross-request context overlap, including enterprise document QA and multi-turn memory chat. Gains were workload-dependent: high-overlap workloads saw the largest returns, while low-overlap workloads saw minimal benefit.
- LLM inference
- KV-cache
- Prefill optimization
- Cost efficiency
- Cited
- Jiang, Huang, Cheng, Deng, Sun & Mai 2026
Updated 2026-07-06
WhenYou are constructing LLM prompts that mix shared context (system prompt, retrieved documents, conversation history) with per-request variable content, and you are using a cache-aware inference engine such as vLLM, SGLang, or llama.cpp.

UseOrder shared context blocks before variable content in every prompt. Place the system prompt first, then retrieved documents in a consistent order, then per-request variable content last. Do not insert per-request tokens into the middle of shared context. A single token difference at any position in the shared prefix invalidates KV-cache reuse for everything that follows it.

EvidenceKV-cache reuse requires that shared content appears at identical token positions across requests. ContextPilot’s reorder component exists specifically because applications constructed prompts with variable content interspersed in shared blocks, preventing the inference engine’s cache from recognizing the shared prefix. The reorder pass, which moves shared blocks to a consistent front position, was responsible for the majority of the cache hit rate improvement measured across enterprise document QA and multi-turn chat workloads.
- LLM inference
- KV-cache
- Prompt construction
- Prefill optimization
- Cited
- Jiang, Huang, Cheng, Deng, Sun & Mai 2026
Updated 2026-07-06
WhenYou are building a RAG pipeline to supply in-context demonstrations for training or fine-tuning a reasoning model (math, code, logic), and you are selecting demonstrations by semantic or lexical similarity to the target query.

UseReplace similarity-based ranking with reasoning-utility-based ranking. Use a capable model as a judge to assess whether candidate demonstrations share transferable reasoning patterns with the target problem, not just topic overlap. Fine-tune a retriever on those utility-based labels via contrastive learning before using it to select training demonstrations.

EvidenceRA-RFT (Xiao, Ma et al., Rice/Meta 2026) found +7.1 pp on AIME 2025 avg@32 for Qwen3-1.7B and +2.8 pp for Qwen3-4B over GRPO when demonstrations are selected by reasoning-utility match instead of semantic similarity. The retrieval corpus (OpenR1-Math-220K from NuminaMath-1.5) is identical in both conditions; the only variable is how retrieval is done.
- Retrieval-augmented generation
- Reasoning
- Training data
- Reinforcement fine-tuning
- Cited
- Xiao, Ma, Chen et al. 2026
Updated 2026-07-13
WhenYou are optimizing a reinforcement fine-tuning pipeline for a reasoning model and are deciding which components to improve: reward function design, training curriculum, or demonstration retrieval quality.

UseTreat retrieval quality as an independent variable from reward design and curriculum. Improvements on one axis do not interfere with the others. Develop and validate each component separately before combining. If your primary effort has been on reward shaping, upgrading to reasoning-aware retrieval is a distinct lever that can be added without revisiting the reward function or training schedule.

EvidenceXiao, Ma et al. (2026) explicitly characterize reasoning-aware retrieval as orthogonal to reward design and curricula. RA-RFT gains (+4.1 pp and +2.6 pp average across four benchmarks for 1.7B and 4B models respectively) stack on top of GRPO without requiring changes to the reward function or training schedule. The authors tested both axes independently, confirming they are non-interfering.
- Reinforcement fine-tuning
- Training pipeline
- Retrieval-augmented generation
- Reasoning
- Cited
- Xiao, Ma, Chen et al. 2026
Updated 2026-07-13