1. Opening: The Anthropomorphic Reflex
A growing trend in AI‑native projects treats large language models as narrowly‑scoped "employees" inside rigid multi‑agent architectures — writer, reviewer, execution‑logger, statistician — each with artificial permissions and strictly isolated contexts. The pattern is intuitive: it mirrors how we scale human intelligence, by division of labour. It is also, in many cases, the wrong reflex.
This article argues that for tasks that fit inside a single model's effective context window, a single, tool‑augmented agent that iterates with real‑world feedback tends to be structurally superior to a fragmented pipeline of role‑bound agents. The rebuttal is not absolute — there are legitimate reasons to fragment — but the default has been miscalibrated. The current default copies human org charts onto an agent whose nature does not warrant them.
The load‑bearing argument is information‑theoretic (§4). The bookkeeping equations in §5 are illustrative, not a proof; they are how to talk about the asymmetry, not how to establish it.
The thesis aligns naturally with the ANSELM stance: build conversations, not committees.
2. Where the Pattern Comes From, and Why It Misleads
The "AI micro‑corporation" inherits two assumptions from human organisations:
- Specialists outperform generalists because human cognitive bandwidth is narrow.
- Parallel contests beat serial work because humans iterate slowly.
Neither assumption transfers cleanly to a modern LLM. A general‑purpose model already contains the integrated knowledge that role‑bound agents can only access after costly recombination, and it iterates orders of magnitude faster than any human team. When the same underlying model is wrapped into multiple "roles," the resulting structure inherits the coordination cost of a human org without inheriting any of its cognitive diversity.
Hammond et al. (2026), in their taxonomy of multi‑agent risks, name several failure modes — miscoordination from information asymmetries, network‑effect error propagation, collusion, emergent agency — that arise even among genuinely independent agents. Our concern is narrower and sharper: when fragmentation is unnecessary, it imports those failure modes for no compensating benefit.
3. Accumulation vs. Averaging — An Intuition Pump
Before formalising anything, the core intuition is worth stating plainly.
- A single iterative agent keeps every signal — partial drafts, tool outputs, error messages, half‑formed hypotheses — inside one continuous context. Each iteration adds to what the model already knows about this specific problem.
- A fragmented pipeline breaks that context at every hand‑off. Each downstream agent sees only a summary of what came before. Information that does not survive the summary is gone.
The first regime accumulates. The second, at best, averages — and it averages over impoverished views of the problem. "At best" matters: in degenerate cases (a critical signal lost at a single hand‑off) the fragmented system can underperform any single agent in it. The structural claim is asymmetry of information flow, not a uniform performance gap.
This is not a theorem; it is a description of where the information lives. The next two sections give it an information‑theoretic footing and a practical bookkeeping form.
4. An Information‑Theoretic Sketch
Let $X$ denote the ground‑truth task (the problem and its full feedback environment), and $Y$ denote the artifact ultimately produced (code, spec, decision). We care about the mutual information $I(X;Y)$ — how much of the task the final artifact actually captures.
Let $S_i$ be the internal state of the $i$‑th processing stage (an iteration, or an agent in a pipeline). Two regimes can be distinguished.
Iterative regime (single context). All stages share one state that grows monotonically:
$$ S_{i+1} = S_i \cup \Delta_i, $$
where $\Delta_i$ is the new evidence acquired at step $i$ (a tool result, a self‑critique, an environment signal). Because no information is discarded between steps, $I(X;S_{i+1}) \geq I(X;S_i)$. Convergence is bounded only by context capacity and by saturation of useful evidence.
Pipelined regime (hand‑offs). Each agent $A_k$ sees only a summary $T_k = f_k(S_k)$ of the prior state, where $f_k$ is a lossy summarisation function. By the data processing inequality,
$$ I(X; T_k) \leq I(X; S_k). $$
The inequality is strict whenever $f_k$ is genuinely lossy — and in practice it almost always is, because hand‑offs are summarisation bottlenecks, permission filters, or schema projections (what we call permission theatre: filtering done for organisational tidiness rather than for real safety).
Two consequences follow without needing precise numbers:
- Hand‑off loss compounds. Across $n$ stages, the upper bound on $I(X;Y)$ decays multiplicatively in the channel capacities of each $f_k$.
- Parallel ensembles do not recover the loss. Aggregating $m$ parallel agents whose individual states are all bounded by the same lossy view cannot exceed that bound; the aggregator inherits the ceiling of its inputs.
This is why the iterative single‑agent regime can keep climbing while pipelined and ensemble regimes hit a structural ceiling set by their narrowest interface.
5. A Practitioner's Bookkeeping (Intuition Pump, Not Theorem)
A friendlier way to keep score, useful for design discussions even though it is not a formal proof:
- $K_0$: the model's pretrained baseline.
- $E_i$: insight produced at step $i$ (in the multi‑agent case, necessarily partial — limited by the agent's narrow scope).
- $C_i$: redundancy with what is already known, so the net gain is $E_i - C_i$.
- $\gamma_i \in [0,1]$: the hand‑off discount — the fraction of an agent's insight that survives summarisation, permission filtering, or schema projection on its way to the next stage.
For a single iterative agent in one continuous context:
$$ V_{\text{iter}} = K_0 + \sum_{i=1}^{n} (E_i - C_i). $$
For a one‑shot parallel ensemble of $n$ role‑bound agents whose outputs are merged by averaging (e.g., an LLM judge or summariser):
$$ V_{\text{multi}}^{(n)} \approx K_0 + \frac{1}{n}\sum_{i=1}^{n} \gamma_i E_i. $$
For a sequential pipeline the analogue is $K_0 + \sum_i \prod_{j \le i} \gamma_j E_i$, with hand‑off discounts compounding.
For an ensemble of pipelines — $m$ independent chains feeding an argmax‑style aggregator (the architecture real multi‑agent frameworks most often deploy) — the bookkeeping is a max over chains rather than an average:
$$ V_{\text{mvop}}^{(m)} \approx K_0 + \max_{k \le m}\Big(\prod_{j \le n}\gamma_{jk}\Big),E_k. $$
This sits strictly between the pipeline and the flat parallel ensemble: voting can rescue the luckiest chain, but cannot exceed the ceiling set by the lossiest interface inside that chain.
These expressions are intuition pumps. The load‑bearing claim is the data‑processing argument in §4; the bookkeeping above is just a memorable way to talk about it. One honest limit: the $\gamma_i$ are written as smooth scalars, but in practice each constraint either survives a hand‑off or is lost entirely. The smooth form captures the mean trend; it understates how violently the per‑run variance explodes when a single critical signal happens to die at one interface — a feature the experiment in §10 makes very visible.
A note on the choice of baseline. We deliberately model the parallel‑averaging case rather than the sequential pipeline because it is the more optimistic fragmentation baseline: averaging independent insights is, in principle, less lossy than compounding hand‑off discounts down a chain. If iteration beats parallel averaging, it beats sequential pipelines a fortiori.
6. Three Architectures, Cleanly Separated
A frequent confusion in this debate is treating "multi‑agent" as one thing. It is at least three.
6.1 Iterative single agent
One model, one continuous context, real tools. The agent drafts, executes, observes, critiques itself, revises. This is the regime that accumulates. ANSELM's "co‑pilot" sits here. Note that multi‑agent debate patterns — where a single model is prompted to argue against itself, or to adopt opposing perspectives within one conversation — are iterative in this sense, not pipelined: they share context and accumulate.
6.2 Sequential pipelines
Writer → Reviewer → Reviser, with each stage seeing only the prior stage's output (or worse, a summary of it). This is the regime where the data processing inequality bites hardest, because every hand‑off is a lossy bottleneck and the chain is serial.
A pipeline can still be useful when the stages genuinely require different capabilities (a small cheap model triages, an expensive model reasons), or when auditability demands a separation between actor and critic. But absent those reasons, a pipeline of identical models is mostly a context‑destruction machine.
6.3 Parallel ensembles ("contests")
Multiple agents run in parallel; an aggregator picks or merges. This pattern works for humans because human experts have genuinely diverse base knowledge $K_0$ and iterate slowly. With instances of the same model, $K_0$ is shared up to sampling noise, and serial iteration is fast enough to dominate. Same‑model contests therefore harvest noise, not diversity.
A genuinely heterogeneous ensemble (e.g., distinct model families with complementary biases) can still earn its keep — but even there, a sequential use of diverse models inside a single conversation typically beats a one‑shot parallel vote, because each model gets to see what the others actually said rather than only the aggregator's verdict.1
When agents are different fine‑tuned models, $K_0$ values can diverge meaningfully and some of the human‑contest logic partially applies. The iteration‑speed advantage remains, however, and per‑role fine‑tuning is rarely the actual practice in the micro‑corporation pattern — it is usually one base model with different system prompts.
7. When Fragmentation Is Genuinely Justified
The honest design rule is not "never fragment." It is fragment only when something forces you to. Legitimate forcing functions include:
- Security and trust boundaries. PII scrubbing, untrusted‑input sanitisation, or enforcement of capability limits where an isolated context is the point.
- Context‑window saturation. When a task genuinely outgrows the model's effective context, controlled offloading to a new session is a fallback — not a starting architecture.
- Cost and latency shaping. Cheap models for triage, expensive models for reasoning; embarrassingly parallel sub‑tasks that are truly independent.
- Auditability and accountability. Separating actor from critic so the rationale is logged through an interface, not buried in a single transcript.
- Heterogeneous model diversity. Distinct model families used deliberately for their different priors.
- Tool‑shaped "agents." A SQL executor, a sandboxed code runner, or a search wrapper is a deterministic tool, not a cognitive agent. Functional tool separation is healthy; the present critique applies only to role‑based cognitive fragmentation.
None of these is "because that is how a human team would do it." Each is a concrete property of the deployment, not an organisational metaphor.
Common counter‑arguments, briefly addressed
- "Large tasks require decomposition." True when a task exceeds the context window — that is one of the forcing functions above. The argument here targets chosen decomposition that mirrors human job titles for tasks that do fit in one window.
- "Multi‑agent debate improves reasoning." Recent work — notably Du et al. (2023) on multi‑agent debate, and the broader self‑critique / society‑of‑minds line — reports that having a model argue against itself in named roles ("proposer", "skeptic", "judge") reduces bias and improves answers. This is consistent with the present article, not against it. The mechanism that does the work in those papers is that every step sees the full prior transcript, not a summary of it: the personas are serial perspectives adopted inside one continuous context, with no $f_k$ in the data‑processing‑inequality sense. In the vocabulary of §6, that is iterative single‑agent reasoning under a different label. Break the shared transcript — run proposer, skeptic, and judge as isolated sessions exchanging only conclusions — and the same setups empirically degrade, which is itself evidence for the structural claim in §4.
8. The Design Rule
Default to a single, tool‑augmented agent in a continuous context. Fragment only when context, security, cost, latency, auditability, or genuine model heterogeneity force you to — and design the hand‑off to lose as little as possible when you do.
This is a heuristic, not a theorem. Its strength comes from the structural asymmetry between accumulation and lossy hand‑off, not from any specific equation.
9. Reconciling with ANSELM's Living Digital Thread
A fair objection arises here. ANSELM's manifesto values disposable views, open formats, and a living digital thread — the opposite of a single opaque conversation transcript. How can "keep the conversation whole" coexist with "the digital thread must be alive"?
The reconciliation is that the conversation is the workshop, not the archive. The single‑agent iterative loop is where reasoning happens with maximum information density. What ANSELM asks of that loop is that its outputs be continuously externalised as Knowledge Cells, decisions, rationale, and queryable artifacts — exactly the open, human‑readable formats the manifesto calls for. The conversation accumulates; the ecosystem persists.
In other words: keep the conversation whole during reasoning, and crystallise its conclusions into the digital thread. The single agent is not an alternative to the thread — it is the cleanest way to feed it.
10. Empirical Evidence
The argument predicts a measurable difference. The example is deliberately drawn from enterprise architecture rather than from software, because the multi‑agent failure mode is most visible in domains the ANSELM audience already lives in — and because the literal mechanism of failure in fragmented BPR engagements is the data‑processing inequality from §4.
10.1 Task
The full target is to redesign the order‑to‑cash process for a mid‑sized B2B distributor moving from an EDI‑only channel to a mixed EDI + self‑service portal channel, under an explicit constraint set:
- SOX‑compliant segregation of duties (no single role spans incompatible activities such as order entry and credit‑limit approval).
- Days Sales Outstanding ≤ 45 days.
- Credit‑check SLA ≤ 4 hours for new accounts.
- No manual re‑keying between systems.
- GDPR‑lawful retention of customer and transaction data.
- Reuse of the existing SAP S/4 instance; no new core ERP.
The deliverable is a process description (text + BPMN‑equivalent flow) covering the happy path and named edge cases: disputed invoices, partial shipments, returns crossing month‑end close, credit‑hold release authority, and channel reconciliation between EDI and portal orders.
For Phase 1 (reported below) we used a representative sub‑process of this brief — the credit‑hold release loop — keeping all six constraint families (segregation of duties, SLA bound, audit trail, GDPR retention, system‑of‑record reuse, and structural flow completeness). This keeps each run cheap enough to repeat n=5 times per architecture while still exercising every checker family. Scaling to the full order‑to‑cash brief is Phase 2.
10.2 Architectures (all using the same base model)
The article's claim is about hand‑off structure, so the four architectures we test span the structural space cleanly:
- ITER. One agent in a continuous context, with a constraint checker as a tool. Loop: draft → check → analyse violations → revise (max 5 iterations). 0 hand‑offs.
- MULTI‑VOTE‑FLAT. m=5 parallel modellers given the same brief at temperature 1.0 with distinct seeds. Aggregator picks one by self‑report. 1 hand‑off (vote only); each candidate sees the full brief.
- MULTI‑VOTE‑OVER‑PIPE. m=5 parallel MULTI‑PIPE chains, then an aggregator picks. 4 hand‑offs per candidate, plus a final vote. This is the steel‑man for real multi‑agent frameworks (CrewAI, AutoGen): parallel teams with internal role separation.
- MULTI‑PIPE. Discovery → Modeller (sees only Discovery's summary) → Reviewer (sees only the model and a violations summary) → Planner (sees only the approved model). 4 sequential hand‑offs; each lossy.
10.3 Metrics
- Constraint‑violation count at the final deliverable (objective; the constraint checker is the oracle for the six constraint families).
- Residual ambiguity: count of underspecified hand‑offs and decision points in the produced process (every decision must name the role, the data, and the outcome paths).
- Information loss at each interface: BERTScore between the raw upstream content and the summary actually passed downstream. Measured at every hand‑off in the pipeline architectures; trivially zero in ITER.
- Total token consumption.
10.4 Results (n=5 per architecture, gpt‑4o‑mini‑2024‑07‑18, temperature 0 except where noted)
| Architecture | Hand‑offs | Violations (mean ± std) | Range |
|---|---|---|---|
| ITER | 0 | 0.0 ± 0.0 | 0–0 |
| MULTI‑VOTE‑OVER‑PIPE | 4 + vote | 1.4 ± 0.9 | 1–3 |
| MULTI‑VOTE‑FLAT | 1 (vote) | 2.0 ± 0.0 | 2–2 |
| MULTI‑PIPE | 4 (chain) | 17.0 ± 20.1 | 1–39 |
Three observations are load‑bearing.
1. Violations track hand‑off count and structure, not "amount of fragmentation." ITER (zero hand‑offs) hits the constraint‑satisfaction ceiling on every run. MULTI‑VOTE‑FLAT — one final hand‑off, but each candidate sees the brief whole — sits at a low, uniform ceiling. MULTI‑PIPE — four sequential hand‑offs with no recovery mechanism — collapses with mean 17 and a 1‑to‑39 range, exactly the catastrophic variance the data‑processing inequality predicts when a critical signal happens to die at one of the lossy interfaces.
2. Same‑model voting harvests noise, not diversity — exactly as §6.3 predicts. MULTI‑VOTE‑FLAT's five candidates were genuinely distinct samples (different SHA‑256 hashes, different step ordering, different role names) yet every one of them produced exactly two violations. That uniformity is the cleanest possible empirical signature of "candidates differ in surface form but not in their relationship to the underlying constraint set." A real ensemble — say, distinct model families — would be expected to produce different violation profiles.
3. Voting can partially compensate for fragmentation, but cannot recover full ITER quality. MULTI‑VOTE‑OVER‑PIPE (1.4 ± 0.9) does beat MULTI‑VOTE‑FLAT (2.0): given five fragmented pipes, the aggregator can usually find one candidate that escaped the worst hand‑off losses. But the floor is still strictly above ITER, and the variance climbs back up. This matters: it means real multi‑agent frameworks (which roughly correspond to MULTI‑VOTE‑OVER‑PIPE) are not free of the data‑processing penalty — they merely soften it by sampling, at considerable token cost.
10.5 An honest inversion
This section originally pre‑registered the prediction ITER < MULTI‑PIPE < MULTI‑VOTE. The data forced an inversion to ITER < MVOP < MULTI‑VOTE‑FLAT < MULTI‑PIPE. The misordering came from a coarse mapping between architecture labels and hand‑off count: a one‑shot parallel vote was tacitly grouped with "more multi‑agent than a pipeline," when in hand‑off terms it has strictly fewer lossy interfaces (each candidate sees the brief whole; only the final pick is a hand‑off). The corrected ordering tracks the underlying axis the article has argued for from the start — number and structure of lossy hand‑offs — and the data tracks it closely.
10.6 Caveats
- Phase 1 is small. One sub‑process (credit‑hold release), one model family (
gpt‑4o‑mini), n=5 per architecture. The structural signal is large enough to read through that — ITER's perfect score and MULTI‑PIPE's 1‑to‑39 range are unambiguous — but the magnitude of the gap on the full order‑to‑cash brief is open. Phase 2 scales the brief and adds at least one heterogeneous model family (a genuine $K_0$ comparison rather than sampling jitter). - The oracle is partial. The constraint checker is objective for the six rules listed. Real BPR has dimensions — operational feasibility, organisational fit, change‑management cost — that no checker captures. The experiment therefore tests the article's structural claim about information flow, not the broader claim that ITER produces a "better" redesign in every sense. Constraint satisfaction is the backbone; quality remains a secondary, judgment‑based signal we have not yet measured.
- Same‑model is the deliberately weakest test of voting. A genuinely heterogeneous ensemble (different fine‑tuned models, different families) might do better than MVOP did here. Phase 2 will test this directly. The Phase 1 result rules out the cheap version of multi‑agent frameworks — same model, different system prompts — which is what most production deployments actually run.
10.7 Reproducibility & artifacts
The experiment is intentionally small enough to read end‑to‑end. Four moving parts do all the work:
- Brief (
briefs/credit_hold_release.yaml) — the constraint set the architecture must satisfy, written once and shared by all four architectures so the only independent variable is hand‑off structure. - Process schema (
schemas/process.schema.json) — a fixed JSON Schema with a closed tag vocabulary (creates_order,raises_hold,reviews_hold,releases_hold,rejects_hold,escalates,closes_case). Every architecture writes its final deliverable against the same schema, which is what makes the constraint checker comparable across runs. - Constraint checker (
src/anselm_experiment/checker.py) — a deterministic oracle for the six constraint families plus structural flow‑completeness (everynextreference resolves; every branch reachable; no orphan steps). No LLM in the loop. This is the ground truth for the violations column in §10.4. - Runner (
src/anselm_experiment/runner.py) — dispatches the four architectures, writes a per‑run log directory, and persistssummary.jsonafter every completed run so a cancelled sweep never wipes prior data.
Every run leaves a forensic trail on disk: each LLM call is dumped as call_NNNN_<role>.json (full request, full response, latency, token counts); each pipeline hand‑off writes the upstream payload and the downstream summary side‑by‑side, so the BERTScore in §10.3 is computable post‑hoc rather than baked into the run; the final deliverable lands as result.json; violations land as a structured list with the offending step and the rule it broke. MVOP runs additionally keep one sub‑directory per candidate chain (pipe_0/ … pipe_4/) plus the aggregator's mvop_summary.json, so the "which candidate did the vote pick, and why" question is answerable without re‑running anything.
The knobs that matter for replication: pinned model snapshot gpt‑4o‑mini‑2024‑07‑18 for every architecture (so $K_0$ in §5 is truly held constant); temperature 0 everywhere except vote candidates (1.0 with seeds 1000+k for MV‑flat, 2000+k for MVOP — distinct seed bands so the two voting architectures don't accidentally share samples); ITER capped at 5 revision iterations.
The full experiment lives in its own repository at anselm-systems-engineering/handoff-tax-experiment, including the headline plot at runs/phase1_headline.png and all run directories backing the §10.4 table. Anyone who wants to reproduce, falsify, or extend the result has every artifact named here as a starting point.
11. Conclusion
The "AI micro‑corporation" is an anthropomorphic reflex. It copies human organisational structures without examining whether the underlying agent's nature warrants them. The information‑theoretic picture is unflattering to that copy: hand‑offs are lossy channels, the data processing inequality is unforgiving, and parallel contests of identical models tend to harvest sampling noise rather than genuine diversity.
The corrective is not to ban fragmentation but to demote it. A useful single‑sentence test: fragment when the hand‑off interface already exists in the problem; do not invent hand‑off interfaces to mimic an org chart. Microservice boundaries, security perimeters, context‑window ceilings, cost tiers, sandboxes — these are real interfaces. Intro/body/conclusion of an article is not.
Default to one conversation. Fragment only when something concrete forces you to. Crystallise conclusions into the digital thread as you go. That is the ANSELM posture stated as architecture: not a committee, a conversation.
References
- Hammond, L., et al. (2026). Multi‑Agent Risks from Advanced AI. Cooperative AI Foundation.
- Cover, T. M., & Thomas, J. A. Elements of Information Theory — for the data processing inequality.
- Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multi‑Agent Debate. arXiv:2305.14325.
- ANSELM — AI‑native Systems Engineering Learning Method. https://anselm.ing