TANDEMLY
Understanding AI innovation in tandem with AI
Humans + AI,
Making Sense of AI
Tandemly is where we make sense of AI in public — synthesizing research, tracking the daily firehose of news, and distilling the practices that actually work. We also ship projects in tandem with AI: sometimes weird, sometimes useful, always a way of seeing what happens when humans and machines build together.
Projects
Cutthroat
Ruthless word game — steal letters and words from your opponents.
Gaming Red Pandas
Browser-based indie games — whimsical solo, local, and multiplayer panda adventures.
Research & Writing
Dual-Dimensional Consistency: Smarter Self-Consistency Sampling
Weight self-consistency votes by reasoning quality, prune weak paths early. Over 10x token reduction at matched accuracy across five benchmarks.
When Weak Models Beat the Frontier: The Boosting Connection
A nano-model committee hit 76.4% on SWE-bench Verified, matching flagship models. The framework explains exactly when and why committees of weak models work: the task needs a local verifier.
Is Grep All You Need? The Harness Matters More Than the Method
Four agent harnesses, same 116 questions, two retrieval methods. The framework around the search tool moved accuracy more than grep versus vector. Claude Code favors grep. Gemini CLI favors vector.
BoundaryRouter: Learning When to Escalate to an Agent
A training-free cold-start router builds experience memory from a seed set. 60.6% inference time reduction versus always-agent, 28.6% accuracy gain over always-LLM.
LaTER: Latent-Phase Reasoning Cuts Tokens 32% Without Losing Accuracy
A training-free two-phase method explores in latent space first, then switches to explicit chain-of-thought only when needed. 32% fewer tokens, better AIME accuracy.
ComplexMCP: Three Failure Modes in Large-Scale Tool Sandboxes
150+ interdependent MCP tools. Three named failure modes. The gap between benchmark performance and production behavior, measured.
STALE: When Agent Memory Becomes a Liability
A 1,200-query benchmark finds the best frontier model scores only 55.2% at detecting stale memories. Implicit conflict is where production agents silently break.
Meta-Harness: The 6x Gap Lives in Your Code, Not Your Model
Varying only the harness around a fixed model creates a 6x performance spread. Stanford and MIT built a system to automatically search for better harness code.
How Coding Agents Actually Perform in the Wild
110,000 open-source PRs, five agents. Code gets merged, but churns faster than human-written code over time.
Conversation Reduces Load. Images Build It.
A 124-person RCT found multimodal conversational AI produces better biology learning outcomes than text-only chat or semantic search.
Image Generation Diversity: When Models Miss the Map
No SOTA image generator covers more than 77% of its training distribution. FID can't detect the gap. IRS measures it. DiADM fixes it.
SLOW: The AI Tutor That Thinks Before It Speaks
A four-stage reasoning workspace that separates cognitive diagnosis from response generation. The tutor thinks before it speaks.
Single-Agent LLMs vs Multi-Agent Systems: Equal-Budget Reasoning
Control for compute and the multi-agent advantage on multi-hop reasoning largely disappears. A synthesis of Tran and Kiela (2026).
LLMs in Games: When Generated Content Runs the Rules
Students embedded LLMs as architectural components in two games. Model errors turned into fairness violations.
Arknights: When the AI Lies, Players Learn
Deliberately unreliable AI guidance produces deeper understanding than transparency could.
The Research Is There. The Understanding Isn't.
24,000 AI papers a month, almost none reaching the people who could use them. Here's why the translation layer matters.
Vibe Coding: Flow, Trust, and Co-Creation
The first qualitative study of vibe coding reveals a paradigm built on flow and calibrated AI trust.
Games That Teach AI Ethics
Two multiplayer games use text-to-image AI to teach teens about bias in generative AI through play.
BAVT: Spend Less, Reason Better
Budget-Aware Value Trees cut AI agent costs by 75% with equal or better accuracy.
What is Vibe Coding?
The emerging practice of building software by describing what you want to an AI.
How to Showcase Your AI Projects
You built something with AI. Nobody can find it. Here's what we learned.
Humans dream it
Agents build it
We ship it together