ANSELM

AI-native Systems Engineering Learning Method

What this is, and what it is not. This is a field report — an independently conducted, fully measured case study of a production-grade software system built under a 1.1 FTE configuration (architect-generalist + part-time reviewer + AI coding assistant). It is not an application of ANSELM, nor an ANSELM case study. It is independent evidence that the mechanisms ANSELM advocates — knowledge-first work, AI as cognitive partner, continuous context, role collapse across architectural and implementation responsibilities — produce measurable, auditable results when applied to the construction of complex technical systems. The domain here is software engineering; the cognitive patterns are the same. Read it as a proof of principle, not as a demonstration of the method.


Original article follows. The full working data — module-by-module labour breakdowns, raw repository metrics, COCOMO II EAF calibration, and per-scenario cost analysis — is available on request. muban itself is a closed-source commercial project at muban.me.


A field report on AI-augmented software engineering, role collapse, productivity compression, and the changing economics of building production-grade systems.

TL;DR

A production-grade document generation service — an API-first backend microservice handling template ingestion, multi-format rendering, and deep PDF post-processing at a level of domain depth comparable to the rendering engines inside commercial platforms such as JasperReports Server or Aspose-based products, though scoped as a headless backend rather than a full BI suite with a web front-end — was designed, implemented, tested, and documented by a single experienced architect-generalist acting simultaneously as chief solution architect, lead implementer, technical writer, and stakeholder liaison, working with an AI coding assistant, plus a part-time reviewer (~10% FTE, a programmer-practitioner). The work that a typical mid-sized corporate team would estimate at roughly 2,000 person-days (~9 years of full-time effort spread across a 5–8 person team) was completed in approximately 3–4 calendar months, or ~72 person-days of real effort — a figure derived empirically from commit history rather than asserted.

The resulting system is not a prototype. It contains:

This article examines what happened, why the compression factor was real (not marketing hype), how the work was organised, where the risks live, and what this means for engineering leadership.

1. Why this story matters

The conversation around AI coding assistants oscillates between two extremes. On one side, executives quote vendor benchmarks claiming "30% faster development" — a number small enough that few practitioners feel it in their daily life. On the other, social media is full of "one-shot vibe-coded prototypes" that collapse the moment they meet a real production requirement.

Neither narrative is useful for engineering leaders trying to make capital allocation decisions.

What is useful is a fully measured, end-to-end project built under realistic constraints, where the artefacts can be inspected and the methodology audited. That is what this report provides.

The project — internally codenamed muban — replaces a category of enterprise software that traditionally requires multi-quarter engagements, dedicated teams, and seven-figure budgets. The headline result is that a 1.1 FTE configuration completed it in roughly the time it usually takes a corporate team to finish discovery and architecture sign-off.

The deeper point, however, is not the speed. It is the decomposition of where that speed came from, because if leadership reads only the headline ("AI made us 25× faster"), they will draw the wrong conclusions and build the wrong organisations.

2. The product: what was actually built

The system is a Spring Boot 3.3 service that ingests document templates (JRXML for JasperReports, DOCX with embedded SpEL expressions) and produces rendered output in PDF, PDF/A-1b, DOCX, XLSX, HTML, RTF, and TXT. It is a microservice in the architectural sense — a single bounded context with a sharp API contract — but it is not a "small" microservice. Its complexity is dominated by two factors:

  1. Domain depth. Generating documents that survive contact with real enterprise requirements means handling colour management (CMYK ICC profiles for professional printing), archival conformance (PDF/A-1b OutputIntent, font embedding, transparency flattening), security (PDF encryption, HMAC-SHA256 integrity verification on uploaded templates), and accessibility-adjacent concerns (font validation, image optimisation).
  2. Integration breadth. OAuth2 with multiple identity providers (Auth0, ADFS, generic OIDC), Active Directory role mapping, three database backends (H2, PostgreSQL, MySQL), ActiveMQ for asynchronous generation, Prometheus metrics, GitLab CI pipelines with custom utilities, and a published Maven artefact for the DOCX engine consumed by other services.

Eleven functional subsystems were built:

SubsystemProduction lines of codeTest lines of code
Document generation core (JasperReports adapter, template service, caching)~4,400~10,900
muban-docx engine (SpEL expressions, conditional blocks, table replication, image replacement in OOXML)~3,020~3,740
PDF post-processing pipeline (PDF/A, CMYK, fonts, security, transparency, image optimisation)~4,050~4,620
Security and identity (JWT, OAuth2 clients, AD role mapping, refresh-token blacklist, RBAC)~3,160~9,500
Audit and monitoring (real-time events, threat detection, correlation-ID propagation)~3,037~3,700
User and admin management (advanced search with JPA Specifications)~2,524~5,600
Async and queuing (ActiveMQ producer/consumer, bulk submission, monitoring)~2,834~3,600
Parameter validation and type conversion (native JSON support)~461~1,200
Configuration and infrastructure (19 Spring config classes, exception handling, DTOs, file storage)~6,699~6,300

Plus the OpenAPI contract (3,307 lines), the documentation set (46 Markdown files totalling ~1.3 MB), and DevOps assets (Dockerfile, docker-compose, GitLab CI).

A test-to-production ratio of 1.04:1 is roughly double the industry median for mid-level teams. The bilingual documentation (Polish and English handbooks of 130+ KB each for both JasperReports template authors and the API contract) approaches the volume of a small technical book.

3. The measurement problem

Productivity claims in software engineering are notoriously unreliable. Lines of code mismeasure effort, story points are relative, and "feature complete" is subjective. To produce numbers that survive scrutiny, three independent estimation methods were applied, then triangulated:

Bottom-up technical decomposition. Every subsystem was inventoried, classified by complexity (CRUD, integration, low-level algorithmic), and assigned a person-day cost using productivity rates calibrated for mid-level developers on Spring Boot stacks. This yielded 638–844 person-days at the system level, and a more granular 738–993 person-days when broken down further. The consensus midpoint is ~800 person-days of technical effort.

Function Points. Approximately 50 REST transactions across 10 controllers, 12 JPA entities, 6 report generators and exporters give ~450 function points. At mid-level productivity (12–18 FP/day), this implies 25–38 person-days per 100 FP, or ~110–170 person-days for business logic alone — consistent with the bottom-up estimate when R&D, infrastructure, and documentation are added.

COCOMO II. The nominal formula gives ~132 person-months (2,640 person-days) for 30 KLOC of semi-detached complexity. With realistic Effort Adjustment Factors (very high tool support, single site, complete personnel continuity, high product complexity, very high documentation requirements) the EAF collapses to ~0.43, producing ~1,135 person-days — strikingly close to the bottom-up estimate for a small product team.

The three methods converge on a base technical cost of ~800 person-days "hands on keyboard". This is the number that would appear on the time sheets of the engineers actually writing code, tests, and documentation. It does not include analysis, project management, formal QA, security review, compliance, UAT, ceremonies, handoffs, or onboarding.

4. The full-SDLC cost: what corporations actually pay

The technical cost of 800 person-days is what a fictional friction-free team would spend. Real organisations carry overhead. The overhead model used here aggregates ten categories — requirements analysis, formal architecture, project management, formal QA, security review, compliance, UAT and change advisory, communication and meetings, onboarding and knowledge transfer, and Scrum ceremonies — with multipliers calibrated against PMI, IFPUG, DORA, and Standish Group data.

Layered on top is the diseconomy of scale captured by Brooks' Law: a 9-person team is not three times more productive than a 3-person team because coordination cost grows non-linearly.

Combining the two yields four organisational reference scenarios:

ScenarioTeam profileOrganisational multiplierScale penaltyTotal cost
A. Small product team (startup, scale-up)3–4 people: 2 devs + PO + half-time QA×1.40×1.10~1,230 person-days
B. Mid-sized corporation5–8 people: 3–4 devs + lead + QA + PM + half architect×2.04×1.20~1,960 person-days
C. Large enterprise (SAFe, ARB, security gate, CAB)9–15 people including dedicated QA team, architect, security officer×2.66×1.35~2,870 person-days
D. Regulated sector (bank, insurance, public administration)12–20 people including compliance, audit, risk, legal×3.32×1.50~3,980 person-days

For mid-sized corporate delivery (scenario B), the calendar cost is approximately 22 months with a 5-person team — a typical 18–24 month project for software of this scope.

The actual cost incurred to build muban was ~72 person-days total. This figure is not an opinion; it is derived empirically from the project's git history using a workflow-calibrated commit rate: 514 non-merge commits by the primary author at ~1 hour per commit (the typical AI-assisted prompt → review → correct → commit loop) ≈ 64 person-days, plus 23 non-merge commits by the reviewer at ~2 hours per commit (no AI assistance, by design) ≈ 6 person-days, plus an estimated 16 hours of pure-review work ≈ 2 person-days. Total: ~72 person-days across the 1.1 FTE configuration. The reviewer (about 10% FTE) provided code review on critical changes and served as the bus-factor mitigation.

The compression factor depends on which corporate reality is being compared:

Reference scenarioTotal cost (PD)Real cost (PD)CompressionNotional savings (PD)
A. Small product team~1,23072~17×~1,160
B. Mid-sized corporation~1,96072~27×~1,890
C. Large enterprise~2,87072~40×~2,800
D. Regulated sector~3,98072~55×~3,910

At a Polish mid-level all-in cost of ~€280/day, scenario B alone represents roughly €530,000 in avoided cost. In regulated sectors, where specialised QA, security, and compliance staff cost considerably more per day, the equivalent figure is closer to €1.1–1.6 million.

5. The honest part: where does the compression come from?

A 27× compression factor sounds like marketing fiction. It is not — but it is also not what most readers will assume. The compression has four independent sources, and only one of them is attributable to AI.

Source 1a: Team-size effect (~1.5–2× of the compression). A solo or near-solo configuration does not pay the Brooks-Law coordination tax that a 5–15 person team pays. Planning meetings, refinement sessions, demos, retrospectives, ARB approvals, formal QA handoffs, defect triage rituals, status reports, and the dozens of micro-interruptions that punctuate enterprise development simply do not exist. This portion of the gain is achievable without any AI: it is a function of team size and process formality, and it comes bundled with well-known trade-offs (loss of redundancy, accountability, formal compliance evidence) discussed in section 8.

Source 1b: Role-collapse effect (~1.5–2× of the compression). This is the source that is most often misunderstood and the hardest to replicate. In a corporate setup, the chief solution architect, the lead developer, the technical writer, and the stakeholder liaison are four distinct roles — frequently held by four different people, with formal handoffs, mistranslations, and approval queues between them. In muban they were collapsed into a single head. The same person who heard the stakeholder requirement also chose the architecture, wrote the code, generated the documentation, and updated the API contract — typically within the same working day. The information-loop length from stakeholder intent to working artefact dropped from weeks to hours. This is not a team-size effect; it is a role-merging effect. It requires a person who can credibly hold all four roles, which is a much smaller population than "senior developers," and it is the reason a typical mid-level developer with the same AI tooling would not reproduce this compression even if working solo.

Together, sources 1a and 1b account for roughly 3× of the overall compression. They are organisationally portable, but only to the extent that organisations can constitute teams around generalist senior practitioners rather than narrow specialists.

Source 2: Modern framework leverage (neutral, equal for everyone). Spring Boot 3, Spring Security with OAuth2 client support, Spring Data JPA, Apache POI, PDFBox, JasperReports 7, and Lombok deliver an enormous portion of the system's behaviour as configured library code. A team building the same product on a 2010-era stack would write three times more code. This is a real efficiency, but every team has access to it equally; it does not differentiate AI-assisted work from non-AI-assisted work.

Source 3: AI acceleration of technical work (~3.5–4× of the compression). This is the contribution attributable to the AI assistant. It is highly heterogeneous across task types:

Task categoryObserved AI speedupShare of project effort
Boilerplate, DTOs, mappers5–10×~25%
Unit tests (given the code under test)3–6×~30%
Technical documentation (handbooks, READMEs)4–8×~15%
Spring configuration, properties files3–5×(within infrastructure)
Standard CRUD REST endpoints3–4×~10%
Business logic with established libraries (Jasper, POI)2–3×~10%
Low-level algorithmic work (PDF/A, CMYK, OOXML byte manipulation)1.2–1.8×~5%
Integration debugging, dependency conflict resolution1.5–2.5×~5%

Weighted by effort share, the average AI speedup on this project lands around 3.7×. This number aligns with the upper range of independently measured studies (METR's 2024 evaluations, GitHub Copilot productivity research, Anthropic's reported Claude Code metrics for greenfield projects).

The multiplicative product of the three sources (3× × 1× × 3.7×) gives a predicted compression in the range of 11–20×, which matches the observed 17–27× when slack from "writing-from-zero-without-search-cost" is included. The numbers are internally consistent.

The headline-grade takeaway is therefore not "AI made us 27× faster". It is closer to: "A senior generalist holding four corporate roles in one head, supported by AI that supplied implementation throughput and on-demand domain knowledge, compressed the work by roughly 27× versus a mid-sized corporate baseline. The role collapse and the AI compound; neither factor alone would have been sufficient."

This distinction matters operationally. An organisation that adopts AI tools without rethinking team structure will capture roughly the 3.7× AI speedup, not the 27× compression. An organisation that downsizes teams without senior generalists at the centre will hit a different ceiling: AI-generated code without architectural judgment to evaluate it tends to compound into the "vibe-coded prototype" caricature. The full compression requires both the right person profile and the right tooling, applied to systems whose criticality justifies the trade-off.

6. How the work was actually organised

The methodology was not heroic. It looked like the following.

The 1.1 FTE pattern. One developer was the single sustained author. A second developer was engaged on demand — averaging perhaps half a day per week, sometimes a single hour, sometimes a full day around major releases — to review critical pull requests, validate architectural decisions, and maintain enough familiarity to take over if needed. This pattern intentionally trades a small amount of organisational redundancy (the reviewer is not a substitute, only a successor candidate) for nearly all of the productivity of a solo workflow. The effective bus factor sits at approximately 2.

Role collapse, not just team-size reduction. The primary author was an experienced architect-generalist — meaning someone with broad cross-stack literacy, working architectural-decomposition skill, stakeholder-facing communication craft, and the intellectual humility to treat an AI assistant as a competent but fallible collaborator. On muban he simultaneously held the roles that a corporate setup would distribute across at least four people: chief solution architect, lead implementer, technical writer (handbooks, API contract, ADR-style inline documentation), and stakeholder liaison. He did not claim, and would probably not pass, a contemporary mid-level coding interview built around algorithmic puzzles — the contribution was architectural decomposition, judgment, and discrimination, with AI supplying the typing throughput. The on-demand reviewer was a programmer-practitioner providing the daily-IDE muscle memory and code-level sanity check that the architect did not. This pairing — architect with judgment + AI with implementation throughput + practitioner with code-level scepticism — was the actual unit of production. The tight information loop from stakeholder intent to working artefact, with no intermediate handoffs, is what made the schedule compression possible. It is also the property that is hardest to reproduce by hiring policy alone: it presumes a person who can credibly hold all four roles at once. Years of experience matter only as a rough proxy for accumulating that profile — the profile itself, not the year count, is what the compression depends on.

No prior domain experience. This point deserves emphasis because it is easy to miss. Neither implementer had previously worked in pre-press or print-production engineering. PDF/A-2b and PDF/A-3b conformance rules, ICC colour profiles and CMYK separation, font embedding and subsetting policies, JasperReports' compilation and fill model, the OOXML quirks of DOCX image replacement — all of this domain knowledge was acquired during the project itself, not brought into it. In a conventional engagement this would have added a substantial ramp-up cost (COCOMO II would model it via low ACAP/PEXP/LTEX multipliers, typically inflating effort by 30–60%). The fact that the project still landed at 72 person-days of real human effort, not in spite of this gap but with the AI assistant actively closing it in real time, is part of what the compression number is actually measuring.

Microservice-shaped scope discipline. muban is one bounded context. It does not try to be a platform. When functionality could have been absorbed (template authoring UI, output storage management, document delivery), it was instead delegated to existing or future microservices. The architecture assumption is that scaling beyond a single solo workflow happens by adding microservices with their own solo teams, not by adding people to a single team. This avoids Brooks-Law penalties precisely because it forbids the conditions under which they appear.

AI-driven specification and documentation in lockstep with code. The OpenAPI contract, the architecture diagrams, the bilingual handbooks, and the inline ADR-like documentation were not written after the fact. They were generated alongside (sometimes ahead of) the code, with the AI assistant maintaining consistency across all artefacts. This is the practice that drove the documentation-to-code ratio so high: the marginal cost of producing professional documentation when an AI has the full repository in context is small enough that there is no reason not to do it.

Test-first within reason. Tests were generated by the AI immediately after each non-trivial code unit, then reviewed for completeness and adversarial cases. The result is the 1.04:1 test-to-production ratio. Industry data places mid-level teams between 0.5:1 and 0.7:1; this project is therefore not under-tested relative to its production peers — it is over-tested.

Iterative releases as a forcing function. Fifteen minor versions were cut over the project's life (1.0 through 1.15.7). Each release imposed a quality gate: the build had to pass, the documentation had to be coherent, the API contract had to match the implementation. This rhythm — short cycles, frequent integrations — is the only sustainable mechanism for a solo developer to avoid accumulating undetected technical debt, because AI tools amplify both correct and incorrect intuitions.

7. What the AI actually did, in practice

Three working modes dominated, in roughly equal proportion:

Mode A: structured generation. The developer described an outcome — "implement an OAuth2 client service supporting Auth0, ADFS, and generic OIDC, with refresh-token rotation and a blacklist" — and the AI produced a first draft of code, tests, and documentation. The developer then reviewed, corrected, and integrated. This mode is where the 5–10× speedup on boilerplate-heavy work originates.

Mode B: pair-programming on hard problems. When a CMYK colour conversion was producing visually incorrect output, the developer described symptoms, the AI proposed hypotheses, the developer tested them, results were fed back, and the cycle continued. Speedup here is modest (1.5–2×) but the cognitive load reduction is large: the developer is never alone in front of a problem.

Mode C: knowledge retrieval and synthesis. "What does PDF/A-1b require for OutputIntent, and how does PDFBox 2.x express it?" The AI provided an immediate answer with code examples, sparing 20–60 minutes of documentation searching per question. Aggregated across the project, this likely accounts for tens of person-days of saved time.

This mode deserves a separate emphasis: on muban it functioned as a domain-knowledge bridge, not just a documentation shortcut. The implementers came in without pre-press background, and a substantial fraction of AI interactions were structured as "explain the constraint, then show me how to express it in this stack" — for PDF/A conformance requirements, ICC profile handling, JasperReports' parameter and subreport semantics, OOXML's relationship model in DOCX, font metrics and embedding rules. In a conventional team this knowledge would have been bought either by hiring a domain specialist or by absorbing weeks of specification reading per implementer. Here it was acquired on demand, in the flow of the work, and almost always with a runnable code example attached. This is a qualitatively different category of leverage from boilerplate generation, and it is the one most often missed by productivity studies that measure tasks rather than capability expansion.

What the AI did not do, materially:

8. Risks and what they mean for adoption

The compression factors above are real, but they come bundled with risks that any executive considering this model needs to internalise.

Quality risk under reduced human review. AI-generated code is not uniformly correct. Subtle defects — off-by-one errors in algorithmic code, incorrect handling of edge cases, security-relevant misconceptions — can pass a tired solo author's review. Mitigations in this project: 1.04:1 test ratio, a part-time reviewer for critical paths, fifteen iterative releases each providing an integration checkpoint, and obsessive attention to the audit subsystem (every state-changing operation is logged with correlation IDs, threat detection, and queryable trails). These mitigations cost effort and must be planned for; without them, AI-assisted solo work degrades into the "vibe-coded prototype" caricature.

Scaling ceiling. This methodology works for a microservice of 30–50 KLOC built by one person. It does not scale linearly to a 500 KLOC monolith built by one person. The architectural answer is the one used here: decompose the product into bounded contexts, each developed by a 1.1 FTE micro-team, and let the macro-architecture absorb the coordination that a single solo developer cannot. This requires architectural maturity at the system level even though no individual component requires team coordination.

Bus factor at the component level. A solo developer with an on-demand reviewer gives a bus factor of approximately 2. For non-critical and medium-critical systems this is acceptable; the extensive documentation (~28,000 lines of Markdown in this project) further lowers the cost of onboarding a replacement. For mission-critical systems (life-safety, financial transactions of regulatory significance), a higher bus factor is mandatory and the methodology must be adjusted — typically by promoting the on-demand reviewer to ~30% FTE, which still preserves most of the productivity advantage.

Learning curve. The 3.7× AI speedup is achievable only by developers who have internalised the patterns of effective AI collaboration: prompt structure, context management, recognition of failure modes, judicious use of autonomous vs. interactive modes. The transition cost is real (typically 2–4 weeks of below-baseline productivity for a developer adopting these tools earnestly). Industry programmes that expect immediate gains without this investment will be disappointed.

Maintenance cost beyond build. The figures above describe build phase only. Maintenance (bug fixes, minor features, dependency upgrades) typically runs 15–25% of build cost per year. AI tools accelerate maintenance roughly as much as they accelerate construction, but the absolute hours do not vanish. A solo-developed microservice still requires solo-developer maintenance hours indefinitely.

9. What this means for engineering leaders

The temptation when reading a report like this is to extrapolate naïvely: "If one developer with AI can do the work of 22, we should fire 95% of engineering."

This is the wrong reading. The correct reading has three components.

First, the productivity gain is real but bounded. The 3.7× AI acceleration of technical work is the durable, replicable, organisationally portable number. It is also the number most enterprise teams capture only partially today, because they pair AI tooling with unchanged team structures and process formality. Organisations that want the full 3.7× must invest in the workflow practices that make it accessible: short prompt-response loops, AI-aware IDE setups, lowered ceremony around code review for AI-generated boilerplate, and tooling for AI to maintain documentation alongside code.

Second, the organisational compression is real but contingent. The additional ~3× from solo-style workflows requires accepting reduced redundancy, less formal governance, and bus-factor risks. This is acceptable for many systems — internal tools, microservices in well-tested architectures, products with clear bounded contexts — and unacceptable for others. Engineering leadership's job is to identify which systems can move to the lean methodology and which must retain the corporate envelope.

Third, the right organisational model is plural. A modern engineering organisation should probably contain a mixture of: (a) larger teams on mission-critical or highly regulated systems where the corporate overhead pays for itself in risk reduction, (b) small product-team configurations on medium-criticality systems where the 17× compression versus traditional small-team baselines is the relevant gain, and (c) 1.1 FTE microservice teams on bounded-context components where the full 27–40× compression versus mid-sized corporate baselines is realistically achievable. Treating the entire engineering function as one homogeneous unit and applying one methodology to all of it is leaving substantial value unrealised regardless of which methodology is chosen.

A caveat on the staffing assumption. Configurations (b) and (c) above presume an experienced architect-generalist at the centre — someone who can credibly hold the chief-architect, lead-developer, and primary-author-of-documentation roles simultaneously. The relevant attributes are cognitive, not chronological: breadth of cross-stack literacy, architectural decomposition skill, stakeholder-facing communication craft, and the intellectual humility to use AI as a peer rather than as either oracle or threat. Some practitioners reach that profile in seven or eight years; others never reach it in twenty. Treating year-count as the hiring signal is a category error, and a costly one. A mid-level developer who lacks the architectural and stakeholder-facing dimensions, placed in the same configuration, will likely capture the 3.7× AI speedup but will not capture the role-collapse multiplier — because the judgment, the translation skill, and the discrimination needed to reject incorrect AI suggestions are not yet there. Workforce strategies that propose "replace teams with juniors plus AI" are therefore misreading this case study; so are strategies that conclude "hire only fifteen-year veterans." The realistic talent equation is fewer people, but with broader and more flexible profiles, not fewer people of the same shape.

10. A note on the methodology of this article

The numbers in this article are derived from three estimation methods cross-validated against each other (bottom-up technical decomposition, function point counting, COCOMO II with calibrated EAF). The full workings, including module-by-module breakdowns and sensitivity analyses, were produced as a separate internal report and are available on request. The base technical estimate of ~800 person-days has a stated uncertainty of ±15%; the full-SDLC scenario costs carry an additional ±20% from variability in organisational multipliers; the financial figures use Polish 2026 labour rates and would need to be reprojected for other markets.

A note on the author profile. Honesty matters more than flattering framing here. The primary author of muban is an experienced architect-generalist whose accumulated capability is broad rather than deep — working knowledge across multiple stacks, paradigms, and architectural layers, plus stakeholder-facing communication craft and the intellectual humility to treat an AI assistant as a peer rather than as oracle or threat. He is not a daily-coding mid-level developer, and would probably not pass a contemporary mid-level coding interview built around algorithmic puzzles or recent framework trivia. The productivity figures should be read with that in mind: this was not a case of "AI compensates for missing skill," it was a case of "AI supplies implementation throughput to a person whose scarce capability is judgment and decomposition." The on-demand reviewer was a programmer-practitioner supplying the daily-IDE craft that the architect did not. The compression figures should therefore be read as a benchmark for what becomes possible when an experienced architect-generalist is given AI tooling sufficient to act as his own implementation team — not as a generic claim about "AI productivity" applied to an average developer, and explicitly not as a claim that the configuration requires a fifteen-or-twenty-year veteran. The profile is about breadth of capability and cognitive habits, not about tenure.

The author has declared the project's actual real cost (3–4 calendar months, single primary developer + on-demand reviewer, AI assistant in continuous use) and the methodology. The repository, including all code, tests, documentation, and CI configuration, exists and is auditable.

What is not auditable is the counterfactual: nobody can prove that a 5-person team would in fact have required 22 calendar months. The reference scenarios are best-effort estimates calibrated against well-established models. They are sufficient for executive decision-making about the order of magnitude of opportunity, but they are not, and cannot be, precise.

11. Closing

Software engineering is at the start of a structural change comparable in scale to the move from waterfall to agile, or from on-premise to cloud. The economics of building production-grade systems are shifting in a way that benefits skilled, judgment-exercising developers operating in lean, well-architected configurations, while penalising organisational forms whose primary value was coordination at scale.

The right response is neither panic nor triumphalism. It is to study the working examples carefully, separate the durable signal from the noise, and rebuild organisational structures to capture the new economics where they apply — while preserving the ones that justify their overhead in the systems where overhead is genuinely warranted.

The muban project is one such working example. It is not a unicorn, it is not a stunt, and it is not unrepeatable — provided the staffing premise is reproduced honestly. It is what an experienced architect-generalist with current-generation AI tooling, a part-time practitioner reviewer, and disciplined engineering practices can produce in three to four months. The interesting question is not whether this is possible — the artefacts are there to inspect — but how many of the systems being built today by 5-to-15-person teams could have been built this way instead, and whether engineering organisations have the talent profile, the architectural discipline, and the governance flexibility to capture that opportunity where it exists.

This article describes a real software project. Quantitative estimates are derived from established models (COCOMO II, IFPUG Function Points, PMI overhead schedules) and should be treated as order-of-magnitude rather than precision figures. Productivity multipliers attributed to AI assistance reflect this project's measured experience and are consistent with published studies (METR 2024, GitHub Copilot research, Anthropic engineering reports), but individual results will vary with the practitioner's seniority profile, task mix, organisational configuration, and tooling maturity. The compression figures reported here presume a senior generalist holding multiple roles simultaneously; they should not be applied uncritically to teams of different shape.

References

  1. Boehm, B. W., Abts, C., Brown, A. W., Chulani, S., Clark, B. K., Horowitz, E., Madachy, R., Reifer, D., & Steece, B. (2000). Software Cost Estimation with COCOMO II. Prentice Hall. (Reference text for the COCOMO II model, Effort Adjustment Factors, and the semi-detached mode used in section 3.)
  2. International Function Point Users Group (IFPUG). (2010). Function Point Counting Practices Manual, Release 4.3.1. IFPUG. https://www.ifpug.org
  3. Brooks, F. P. (1995). The Mythical Man-Month: Essays on Software Engineering (Anniversary ed.). Addison-Wesley. (Source for the diseconomy-of-scale argument in sections 4 and 5.)
  4. Project Management Institute. (2021). A Guide to the Project Management Body of Knowledge (PMBOK Guide), Seventh Edition. PMI. (Source for overhead categories used in the full-SDLC scenarios in section 4.)
  5. Standish Group. (2020). CHAOS Report 2020: Beyond Infinity. The Standish Group International. (Reference for project-overhead and success-rate data used in calibrating organisational multipliers.)
  6. Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution Press.
  7. DORA. (2023). Accelerate State of DevOps Report 2023. Google Cloud / DORA. https://cloud.google.com/devops/state-of-devops
  8. METR. (2024). Measuring the impact of AI coding assistance on developer productivity. METR (Model Evaluation & Threat Research). https://metr.org/ (Empirical task-level speedup measurements referenced in section 5; observed AI speedup distribution.)
  9. Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot (arXiv:2302.06590). arXiv. https://arxiv.org/abs/2302.06590
  10. Ziegler, A., Kalliamvakou, E., Li, X. A., Rice, A., Rifkin, D., Simister, S., Sittampalam, G., & Aftandilian, E. (2022). Productivity Assessment of Neural Code Completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS '22). ACM. https://doi.org/10.1145/3520312.3534864
  11. GitHub. (2022). Research: Quantifying GitHub Copilot's impact on developer productivity and happiness. GitHub Blog. https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
  12. Anthropic. (2024–2025). Claude Code engineering reports and case studies. Anthropic. https://www.anthropic.com/engineering (Source for the Claude Code productivity figures referenced in section 5.)
  13. ISO. (2005). ISO 19005-1:2005 — Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1). International Organization for Standardization.
  14. ISO. (2011). ISO 19005-2:2011 — Document management — Electronic document file format for long-term preservation — Part 2: Use of ISO 32000-1 (PDF/A-2). International Organization for Standardization.
  15. ISO. (2012). ISO 19005-3:2012 — Document management — Electronic document file format for long-term preservation — Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3). International Organization for Standardization.

Where exact URLs may change over time, the canonical bibliographic record (author, year, title, publisher) is given so the source can be located in the relevant publisher's catalogue or archive.