tandemly.ai

LEARN

SYNTHESIZE

PRACTICE

Research, Briefings & Practices

TANDEMLY

Understanding AI innovation in tandem with AI

Explore

About

Humans + AI,
Making Sense of AI

Tandemly is where we make sense of AI in public — synthesizing research, tracking the daily firehose of news, and distilling the practices that actually work. We also ship projects in tandem with AI: sometimes weird, sometimes useful, always a way of seeing what happens when humans and machines build together.

Selected Work

Projects

01

Live

Cutthroat

Ruthless word game — steal letters and words from your opponents.

by Bumbo

Game Strategy Single Player

Play now → Tip creator

02

Live

Gaming Red Pandas

Browser-based indie games — whimsical solo, local, and multiplayer panda adventures.

by Gaming Red Pandas

Games Browser Multiplayer

Play now → Shop the swag

Latest

Research & Writing

When Observation Masking Helps Your Search Agent -- and When It Doesn't

A regime map from sweeping 4B-to-284B models across four search benchmarks: masking stale retrieved content peaks at +12.6pp when a strong retriever meets a mid-capacity model, and collapses when the model is already saturated. Don't mask by default; profile your regime first.

The Retriever That Matches by Method, Not Topic

Standard RAG retrieves by semantic similarity. For training reasoning models, that's the wrong signal. RA-RFT trains a retriever to rank by expected reasoning benefit, then fine-tunes via RLVR on analogous demonstrations. Up to +7.1 points on AIME 2025.

The Benchmark Score Is Also a Harness Score

Seven agent benchmarks, one standardized harness, 15 models, 400K rollouts. Scaffold choice and environment volatility move outcomes enough to reorder leaderboards. A unified framework that finally separates model effects from harness effects.

DeployBench: When Agents Say Done and Nothing Runs

Four frontier agents, 51 research artifact deployment tasks, clean environments. Best pass rate: 51%. The dominant failure in 97 of 154 cases: agents that stopped before checking whether anything actually ran.

Evaluate Agents Without Deploying Them

Testing a new agent means deploying it. ADWM breaks that constraint: train a diffusion world model on historical interaction logs, then simulate the environment for any candidate agent before it reaches users. Off-policy evaluation that actually works for LLM agents.

Agent Planning Benchmark: What's Actually Failing in Your Agent?

Most agent evals give you one score and no attribution. APB separates planning failure from execution failure across five settings and 22 domains, then adds a sixth test most teams skip: can the agent recognize when a task is impossible?

ContextPilot: The Prefill You Already Paid For

LLM applications send the same context blocks over and over. RAG pipelines re-attach the same documents. Agents re-inject full histories. ContextPilot intercepts this, reorders shared blocks into cache-reusable prefixes, and deduplicates the rest. Up to 3x faster prefill, 36% fewer tokens, no model changes.

Prompt Injection Has No Complete Defense. Here's Why.

UMass researchers derive an impossibility: any norm that blocks all injections also blocks flows a legitimate task needs. Data-instruction separation is a partial mitigation. Red-team with context-aware attacks, not jailbreak strings.

Beyond Consensus: What Votes Throw Away, Traces Preserve

In mixture-of-agents stacks, majority voting discards the most useful signal: correct intermediate steps buried in minority chains. A trace-reading aggregator recovers correct answers even when every agent voted wrong.

VibeSearchBench: When Queries Get Vague, Agents Fall Apart

A benchmark for how people actually search: vaguely, over multiple turns, revealing intent piece by piece. Best frontier model: 30.3 Triplet F1. Good clarifying questions matter more than retrieval method.

MUSE-Autoskill: The Skill Lifecycle Agents Were Missing

Agent skill libraries don't improve themselves. MUSE-Autoskill builds the five-component lifecycle that lets them: create skills from task successes, gate admission with unit tests, refine from new examples, retire underperformers. Auto-generated skills beat the human-skill ceiling on tasks where generation succeeded.

Is Agent Memory a Database? The Missing Data-Foundations Layer

Vector stores can ingest and retrieve. They cannot revise. Concordia researchers formalize what agent memory actually needs: four operators (ingest, revise, forget, retrieve) replacing CRUD, and argue the missing revise operator is the root cause of most memory failures.

Compute Where it Counts: Per-Token Efficiency

A tiny policy network reads what the model is thinking at each decode step and sets how hard to run that step. 7.3 MMLU points gained over uniform compression at matched compute.

Interpretability Can Be Actionable: A Rubric for the Field

Two criteria for any interpretability finding: does it specify what to change, and has the change been validated? A position paper arguing the field has optimized for the first criterion and underweighted the second.

Remembering More, Risking More: Memory Accumulates Safety Risk

Eight memory architectures, three deployment scenarios: safety violation rates climbed monotonically as memory grew. The same agent with more history produced more violations on identical probe tasks. Risk is detectable before generation.

δ-mem: Adding Persistent Memory to Any Frozen LLM

A compact 8×8 state matrix compresses past context and injects memory corrections into attention at decode time. 1.31× the backbone on MemoryAgentBench. The frozen backbone is never touched.

AgentTrust: A Runtime Safety Layer for Every Tool Call

An 8-component interceptor evaluates every agent tool call before it executes. Deobfuscates shell payloads, tracks multi-step attack chains, proposes safer alternatives via SafeFix. 95.0% verdict accuracy at millisecond latency.

Automated Interpretability: Two Agent Loops, One Autonomous Pipeline

A discovery agent navigates the activation graph to find which features are worth examining. An explanation agent probes and refines hypotheses through contrastive testing. Beats one-shot auto-interp baselines on Gemma-2.

AutoTTS: When LLMs Discover Their Own Reasoning Strategies

A coding agent found a controller that cuts inference tokens by 70% vs running 64 parallel samples. Full discovery cost $39.90 and 160 minutes.

Dual-Dimensional Consistency: Smarter Self-Consistency Sampling

Weight self-consistency votes by reasoning quality, prune weak paths early. Over 10x token reduction at matched accuracy across five benchmarks.

When Weak Models Beat the Frontier: The Boosting Connection

A nano-model committee hit 76.4% on SWE-bench Verified, matching flagship models. The framework explains exactly when and why committees of weak models work: the task needs a local verifier.

Is Grep All You Need? The Harness Matters More Than the Method

Four agent harnesses, same 116 questions, two retrieval methods. The framework around the search tool moved accuracy more than grep versus vector. Claude Code favors grep. Gemini CLI favors vector.

BoundaryRouter: Learning When to Escalate to an Agent

A training-free cold-start router builds experience memory from a seed set. 60.6% inference time reduction versus always-agent, 28.6% accuracy gain over always-LLM.

LaTER: Latent-Phase Reasoning Cuts Tokens 32% Without Losing Accuracy

A training-free two-phase method explores in latent space first, then switches to explicit chain-of-thought only when needed. 32% fewer tokens, better AIME accuracy.

ComplexMCP: Three Failure Modes in Large-Scale Tool Sandboxes

150+ interdependent MCP tools. Three named failure modes. The gap between benchmark performance and production behavior, measured.

STALE: When Agent Memory Becomes a Liability

A 1,200-query benchmark finds the best frontier model scores only 55.2% at detecting stale memories. Implicit conflict is where production agents silently break.

Meta-Harness: The 6x Gap Lives in Your Code, Not Your Model

Varying only the harness around a fixed model creates a 6x performance spread. Stanford and MIT built a system to automatically search for better harness code.

How Coding Agents Actually Perform in the Wild

110,000 open-source PRs, five agents. Code gets merged, but churns faster than human-written code over time.

Conversation Reduces Load. Images Build It.

A 124-person RCT found multimodal conversational AI produces better biology learning outcomes than text-only chat or semantic search.

Image Generation Diversity: When Models Miss the Map

No SOTA image generator covers more than 77% of its training distribution. FID can't detect the gap. IRS measures it. DiADM fixes it.

SLOW: The AI Tutor That Thinks Before It Speaks

A four-stage reasoning workspace that separates cognitive diagnosis from response generation. The tutor thinks before it speaks.

Single-Agent LLMs vs Multi-Agent Systems: Equal-Budget Reasoning

Control for compute and the multi-agent advantage on multi-hop reasoning largely disappears. A synthesis of Tran and Kiela (2026).

LLMs in Games: When Generated Content Runs the Rules

Students embedded LLMs as architectural components in two games. Model errors turned into fairness violations.

Arknights: When the AI Lies, Players Learn

Deliberately unreliable AI guidance produces deeper understanding than transparency could.

The Research Is There. The Understanding Isn't.

24,000 AI papers a month, almost none reaching the people who could use them. Here's why the translation layer matters.

Vibe Coding: Flow, Trust, and Co-Creation

The first qualitative study of vibe coding reveals a paradigm built on flow and calibrated AI trust.

Games That Teach AI Ethics

Two multiplayer games use text-to-image AI to teach teens about bias in generative AI through play.

BAVT: Spend Less, Reason Better

Budget-Aware Value Trees cut AI agent costs by 75% with equal or better accuracy.

What is Vibe Coding?

The emerging practice of building software by describing what you want to an AI.

How to Showcase Your AI Projects

You built something with AI. Nobody can find it. Here's what we learned.

Humans dream it
Agents build it
We ship it together

Get in Touch

hello@tandemly.ai

Collaborations & Inquiries