ANSELM

AI-native Systems Engineering Learning Method

1. Opening: The Anthropomorphic Reflex

A growing trend in AI‑native projects treats large language models as narrowly‑scoped "employees" inside rigid multi‑agent architectures — writer, reviewer, execution‑logger, statistician — each with artificial permissions and strictly isolated contexts. The pattern is intuitive: it mirrors how we scale human intelligence, by division of labour. It is also, in many cases, the wrong reflex.

This article argues that for tasks that fit inside a single model's effective context window, a single, tool‑augmented agent that iterates with real‑world feedback tends to be structurally superior to a fragmented pipeline of role‑bound agents. The rebuttal is not absolute — there are legitimate reasons to fragment — but the default has been miscalibrated. The current default copies human org charts onto an agent whose nature does not warrant them.

The load‑bearing argument is information‑theoretic (§4). The bookkeeping equations in §5 are illustrative, not a proof; they are how to talk about the asymmetry, not how to establish it.

The thesis aligns naturally with the ANSELM stance: build conversations, not committees.

2. Where the Pattern Comes From, and Why It Misleads

The "AI micro‑corporation" inherits two assumptions from human organisations:

  1. Specialists outperform generalists because human cognitive bandwidth is narrow.
  2. Parallel contests beat serial work because humans iterate slowly.

Neither assumption transfers cleanly to a modern LLM. A general‑purpose model already contains the integrated knowledge that role‑bound agents can only access after costly recombination, and it iterates orders of magnitude faster than any human team. When the same underlying model is wrapped into multiple "roles," the resulting structure inherits the coordination cost of a human org without inheriting any of its cognitive diversity.

Hammond et al. (2026), in their taxonomy of multi‑agent risks, name several failure modes — miscoordination from information asymmetries, network‑effect error propagation, collusion, emergent agency — that arise even among genuinely independent agents. Our concern is narrower and sharper: when fragmentation is unnecessary, it imports those failure modes for no compensating benefit.

3. Accumulation vs. Averaging — An Intuition Pump

Before formalising anything, the core intuition is worth stating plainly.

The first regime accumulates. The second, at best, averages — and it averages over impoverished views of the problem. "At best" matters: in degenerate cases (a critical signal lost at a single hand‑off) the fragmented system can underperform any single agent in it. The structural claim is asymmetry of information flow, not a uniform performance gap.

This is not a theorem; it is a description of where the information lives. The next two sections give it an information‑theoretic footing and a practical bookkeeping form.

4. An Information‑Theoretic Sketch

Let $X$ denote the ground‑truth task (the problem and its full feedback environment), and $Y$ denote the artifact ultimately produced (code, spec, decision). We care about the mutual information $I(X;Y)$ — how much of the task the final artifact actually captures.

Let $S_i$ be the internal state of the $i$‑th processing stage (an iteration, or an agent in a pipeline). Two regimes can be distinguished.

Iterative regime (single context). All stages share one state that grows monotonically:

$$ S_{i+1} = S_i \cup \Delta_i, $$

where $\Delta_i$ is the new evidence acquired at step $i$ (a tool result, a self‑critique, an environment signal). Because no information is discarded between steps, $I(X;S_{i+1}) \geq I(X;S_i)$. Convergence is bounded only by context capacity and by saturation of useful evidence.

Pipelined regime (hand‑offs). Each agent $A_k$ sees only a summary $T_k = f_k(S_k)$ of the prior state, where $f_k$ is a lossy summarisation function. By the data processing inequality,

$$ I(X; T_k) \leq I(X; S_k). $$

The inequality is strict whenever $f_k$ is genuinely lossy — and in practice it almost always is, because hand‑offs are summarisation bottlenecks, permission filters, or schema projections (what we call permission theatre: filtering done for organisational tidiness rather than for real safety).

Two consequences follow without needing precise numbers:

  1. Hand‑off loss compounds. Across $n$ stages, the upper bound on $I(X;Y)$ decays multiplicatively in the channel capacities of each $f_k$.
  2. Parallel ensembles do not recover the loss. Aggregating $m$ parallel agents whose individual states are all bounded by the same lossy view cannot exceed that bound; the aggregator inherits the ceiling of its inputs.

This is why the iterative single‑agent regime can keep climbing while pipelined and ensemble regimes hit a structural ceiling set by their narrowest interface.

5. A Practitioner's Bookkeeping (Intuition Pump, Not Theorem)

A friendlier way to keep score, useful for design discussions even though it is not a formal proof:

For a single iterative agent in one continuous context:

$$ V_{\text{iter}} = K_0 + \sum_{i=1}^{n} (E_i - C_i). $$

For a one‑shot parallel ensemble of $n$ role‑bound agents whose outputs are merged by averaging (e.g., an LLM judge or summariser):

$$ V_{\text{multi}}^{(n)} \approx K_0 + \frac{1}{n}\sum_{i=1}^{n} \gamma_i E_i. $$

For a sequential pipeline the analogue is $K_0 + \sum_i \prod_{j \le i} \gamma_j E_i$, with hand‑off discounts compounding.

For an ensemble of pipelines — $m$ independent chains feeding an argmax‑style aggregator (the architecture real multi‑agent frameworks most often deploy) — the bookkeeping is a max over chains rather than an average:

$$ V_{\text{mvop}}^{(m)} \approx K_0 + \max_{k \le m}\Big(\prod_{j \le n}\gamma_{jk}\Big),E_k. $$

This sits strictly between the pipeline and the flat parallel ensemble: voting can rescue the luckiest chain, but cannot exceed the ceiling set by the lossiest interface inside that chain.

These expressions are intuition pumps. The load‑bearing claim is the data‑processing argument in §4; the bookkeeping above is just a memorable way to talk about it. One honest limit: the $\gamma_i$ are written as smooth scalars, but in practice each constraint either survives a hand‑off or is lost entirely. The smooth form captures the mean trend; it understates how violently the per‑run variance explodes when a single critical signal happens to die at one interface — a feature the experiment in §10 makes very visible.

A note on the choice of baseline. We deliberately model the parallel‑averaging case rather than the sequential pipeline because it is the more optimistic fragmentation baseline: averaging independent insights is, in principle, less lossy than compounding hand‑off discounts down a chain. If iteration beats parallel averaging, it beats sequential pipelines a fortiori.

6. Three Architectures, Cleanly Separated

A frequent confusion in this debate is treating "multi‑agent" as one thing. It is at least three.

6.1 Iterative single agent

One model, one continuous context, real tools. The agent drafts, executes, observes, critiques itself, revises. This is the regime that accumulates. ANSELM's "co‑pilot" sits here. Note that multi‑agent debate patterns — where a single model is prompted to argue against itself, or to adopt opposing perspectives within one conversation — are iterative in this sense, not pipelined: they share context and accumulate.

6.2 Sequential pipelines

Writer → Reviewer → Reviser, with each stage seeing only the prior stage's output (or worse, a summary of it). This is the regime where the data processing inequality bites hardest, because every hand‑off is a lossy bottleneck and the chain is serial.

A pipeline can still be useful when the stages genuinely require different capabilities (a small cheap model triages, an expensive model reasons), or when auditability demands a separation between actor and critic. But absent those reasons, a pipeline of identical models is mostly a context‑destruction machine.

6.3 Parallel ensembles ("contests")

Multiple agents run in parallel; an aggregator picks or merges. This pattern works for humans because human experts have genuinely diverse base knowledge $K_0$ and iterate slowly. With instances of the same model, $K_0$ is shared up to sampling noise, and serial iteration is fast enough to dominate. Same‑model contests therefore harvest noise, not diversity.

A genuinely heterogeneous ensemble (e.g., distinct model families with complementary biases) can still earn its keep — but even there, a sequential use of diverse models inside a single conversation typically beats a one‑shot parallel vote, because each model gets to see what the others actually said rather than only the aggregator's verdict.1

1

When agents are different fine‑tuned models, $K_0$ values can diverge meaningfully and some of the human‑contest logic partially applies. The iteration‑speed advantage remains, however, and per‑role fine‑tuning is rarely the actual practice in the micro‑corporation pattern — it is usually one base model with different system prompts.

7. When Fragmentation Is Genuinely Justified

The honest design rule is not "never fragment." It is fragment only when something forces you to. Legitimate forcing functions include:

None of these is "because that is how a human team would do it." Each is a concrete property of the deployment, not an organisational metaphor.

Common counter‑arguments, briefly addressed

8. The Design Rule

Default to a single, tool‑augmented agent in a continuous context. Fragment only when context, security, cost, latency, auditability, or genuine model heterogeneity force you to — and design the hand‑off to lose as little as possible when you do.

This is a heuristic, not a theorem. Its strength comes from the structural asymmetry between accumulation and lossy hand‑off, not from any specific equation.

9. Reconciling with ANSELM's Living Digital Thread

A fair objection arises here. ANSELM's manifesto values disposable views, open formats, and a living digital thread — the opposite of a single opaque conversation transcript. How can "keep the conversation whole" coexist with "the digital thread must be alive"?

The reconciliation is that the conversation is the workshop, not the archive. The single‑agent iterative loop is where reasoning happens with maximum information density. What ANSELM asks of that loop is that its outputs be continuously externalised as Knowledge Cells, decisions, rationale, and queryable artifacts — exactly the open, human‑readable formats the manifesto calls for. The conversation accumulates; the ecosystem persists.

In other words: keep the conversation whole during reasoning, and crystallise its conclusions into the digital thread. The single agent is not an alternative to the thread — it is the cleanest way to feed it.

10. Empirical Evidence

The argument predicts a measurable difference. The example is deliberately drawn from enterprise architecture rather than from software, because the multi‑agent failure mode is most visible in domains the ANSELM audience already lives in — and because the literal mechanism of failure in fragmented BPR engagements is the data‑processing inequality from §4.

10.1 Task

The full target is to redesign the order‑to‑cash process for a mid‑sized B2B distributor moving from an EDI‑only channel to a mixed EDI + self‑service portal channel, under an explicit constraint set:

The deliverable is a process description (text + BPMN‑equivalent flow) covering the happy path and named edge cases: disputed invoices, partial shipments, returns crossing month‑end close, credit‑hold release authority, and channel reconciliation between EDI and portal orders.

For Phase 1 (reported below) we used a representative sub‑process of this brief — the credit‑hold release loop — keeping all six constraint families (segregation of duties, SLA bound, audit trail, GDPR retention, system‑of‑record reuse, and structural flow completeness). This keeps each run cheap enough to repeat n=5 times per architecture while still exercising every checker family. Scaling to the full order‑to‑cash brief is Phase 2.

10.2 Architectures (all using the same base model)

The article's claim is about hand‑off structure, so the four architectures we test span the structural space cleanly:

10.3 Metrics

10.4 Results (n=5 per architecture, gpt‑4o‑mini‑2024‑07‑18, temperature 0 except where noted)

ArchitectureHand‑offsViolations (mean ± std)Range
ITER00.0 ± 0.00–0
MULTI‑VOTE‑OVER‑PIPE4 + vote1.4 ± 0.91–3
MULTI‑VOTE‑FLAT1 (vote)2.0 ± 0.02–2
MULTI‑PIPE4 (chain)17.0 ± 20.11–39

Three observations are load‑bearing.

1. Violations track hand‑off count and structure, not "amount of fragmentation." ITER (zero hand‑offs) hits the constraint‑satisfaction ceiling on every run. MULTI‑VOTE‑FLAT — one final hand‑off, but each candidate sees the brief whole — sits at a low, uniform ceiling. MULTI‑PIPE — four sequential hand‑offs with no recovery mechanism — collapses with mean 17 and a 1‑to‑39 range, exactly the catastrophic variance the data‑processing inequality predicts when a critical signal happens to die at one of the lossy interfaces.

2. Same‑model voting harvests noise, not diversity — exactly as §6.3 predicts. MULTI‑VOTE‑FLAT's five candidates were genuinely distinct samples (different SHA‑256 hashes, different step ordering, different role names) yet every one of them produced exactly two violations. That uniformity is the cleanest possible empirical signature of "candidates differ in surface form but not in their relationship to the underlying constraint set." A real ensemble — say, distinct model families — would be expected to produce different violation profiles.

3. Voting can partially compensate for fragmentation, but cannot recover full ITER quality. MULTI‑VOTE‑OVER‑PIPE (1.4 ± 0.9) does beat MULTI‑VOTE‑FLAT (2.0): given five fragmented pipes, the aggregator can usually find one candidate that escaped the worst hand‑off losses. But the floor is still strictly above ITER, and the variance climbs back up. This matters: it means real multi‑agent frameworks (which roughly correspond to MULTI‑VOTE‑OVER‑PIPE) are not free of the data‑processing penalty — they merely soften it by sampling, at considerable token cost.

10.5 An honest inversion

This section originally pre‑registered the prediction ITER < MULTI‑PIPE < MULTI‑VOTE. The data forced an inversion to ITER < MVOP < MULTI‑VOTE‑FLAT < MULTI‑PIPE. The misordering came from a coarse mapping between architecture labels and hand‑off count: a one‑shot parallel vote was tacitly grouped with "more multi‑agent than a pipeline," when in hand‑off terms it has strictly fewer lossy interfaces (each candidate sees the brief whole; only the final pick is a hand‑off). The corrected ordering tracks the underlying axis the article has argued for from the start — number and structure of lossy hand‑offs — and the data tracks it closely.

10.6 Caveats

10.7 Reproducibility & artifacts

The experiment is intentionally small enough to read end‑to‑end. Four moving parts do all the work:

Every run leaves a forensic trail on disk: each LLM call is dumped as call_NNNN_<role>.json (full request, full response, latency, token counts); each pipeline hand‑off writes the upstream payload and the downstream summary side‑by‑side, so the BERTScore in §10.3 is computable post‑hoc rather than baked into the run; the final deliverable lands as result.json; violations land as a structured list with the offending step and the rule it broke. MVOP runs additionally keep one sub‑directory per candidate chain (pipe_0/pipe_4/) plus the aggregator's mvop_summary.json, so the "which candidate did the vote pick, and why" question is answerable without re‑running anything.

The knobs that matter for replication: pinned model snapshot gpt‑4o‑mini‑2024‑07‑18 for every architecture (so $K_0$ in §5 is truly held constant); temperature 0 everywhere except vote candidates (1.0 with seeds 1000+k for MV‑flat, 2000+k for MVOP — distinct seed bands so the two voting architectures don't accidentally share samples); ITER capped at 5 revision iterations.

The full experiment lives in its own repository at anselm-systems-engineering/handoff-tax-experiment, including the headline plot at runs/phase1_headline.png and all run directories backing the §10.4 table. Anyone who wants to reproduce, falsify, or extend the result has every artifact named here as a starting point.

11. Conclusion

The "AI micro‑corporation" is an anthropomorphic reflex. It copies human organisational structures without examining whether the underlying agent's nature warrants them. The information‑theoretic picture is unflattering to that copy: hand‑offs are lossy channels, the data processing inequality is unforgiving, and parallel contests of identical models tend to harvest sampling noise rather than genuine diversity.

The corrective is not to ban fragmentation but to demote it. A useful single‑sentence test: fragment when the hand‑off interface already exists in the problem; do not invent hand‑off interfaces to mimic an org chart. Microservice boundaries, security perimeters, context‑window ceilings, cost tiers, sandboxes — these are real interfaces. Intro/body/conclusion of an article is not.

Default to one conversation. Fragment only when something concrete forces you to. Crystallise conclusions into the digital thread as you go. That is the ANSELM posture stated as architecture: not a committee, a conversation.

References

  1. Hammond, L., et al. (2026). Multi‑Agent Risks from Advanced AI. Cooperative AI Foundation.
  2. Cover, T. M., & Thomas, J. A. Elements of Information Theory — for the data processing inequality.
  3. Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multi‑Agent Debate. arXiv:2305.14325.
  4. ANSELM — AI‑native Systems Engineering Learning Method. https://anselm.ing