
Deep Agents Need More Than Intelligence.
Advances in AI seem to follow the same arc — smarter models, bigger context, better tools — until we discover intelligence was never the bottleneck. Leadership was and is, to a large extent. Code is also becoming a bigger challenge with stated X3, X4, and X10 increases in productivity, as the amount of code generated by AI increases exponentially. This is the case for building the organization, not just the agents or models — and a lesson the Navy learned long before the labs did.
Every major advance in artificial intelligence seems to follow the same pattern. First, we celebrate smarter models, then larger context windows, then more capable tools — and eventually, we discover that intelligence is no longer the limiting factor. Leadership is. Leadership determines the organization, and the organization determines whether intelligence becomes coherent action or efficient confusion.
An Organization That Thinks.
For two years, the conversation centered on the individual agent — its prompt (prompt engineering), its reasoning, its memory, its tools. Those conversations matter, but they quietly assume the wrong unit of analysis. The question is no longer How do we build a smarter agent? It is How do we build an organization that thinks — and who decides what it should think about? Leadership by consensus is NOT the answer.
That question returns me to something I learned years ago in the U.S. Navy as an officer and enlisted. People unfamiliar with naval operations imagine a warship as a single integrated machine. It is not. It is an organization composed of organizations. Combat Information Center (the nerve center) does not own combat. Navigation does not own positioning, they are the keepers of it! Fire Control does not own targeting. Each discipline builds deep expertise within its own domain while contributing to a single shared operational picture and synergy.
The reporting lines are clear; the information lines are far more fluid. Authority runs vertically while specialists trade information horizontally. These horizontal nodes are intriguing. The organization succeeds because it balances command with collaboration. But coordination is not the first move. Before it could begin, someone had to define the mission, set priorities, and say what success looks like. The Combat Information Center did not invent the mission; it executed within it.
Leadership framed the problem. The organization coordinated the solution.
The Argument Is No Longer Hypothetical.
For most of AI’s recent history this was a useful metaphor. It is now a production dispute. In June 2025, two of the most respected teams in agentic AI published opposite conclusions within a day of each other.
Anthropic described the multi-agent system behind its research product: a lead orchestrator that plans a strategy and delegates to specialized subagents working in parallel (parallel reasoning), each in its own context window. On their internal evaluation, that arrangement outperformed a single top-tier agent by roughly 90 percent. The explanation could have come from an organizational-behavior textbook—past a threshold, groups of agents accomplish what no individual can.
The team behind Devin reached the opposite verdict. In a piece titled plainly “Don’t Build Multi-Agents,” they argued that parallel subagents are fragile by design: isolated from one another’s context, they make conflicting decisions and produce incoherent results. Their prescription was a single, coherent thread that never loses track of what it has already decided. A subsequent academic study sharpened the picture, cataloging numerous failure modes and sorting them into three families.
What surprised me during my own exploration of Deep Agents was discovering that the model itself was only one small part of the system. The runtime assembled memory, governance, tools, skills, middleware, and context before the model ever generated its first token. Watching those interactions through tracing changed how I thought about AI. I realized I wasn’t studying smarter models anymore—I was studying organizations.
Not one of these is a failure of intelligence. Perhaps a lesson in humility. However, every one is a failure of organization.
Three Ways an Organization Fails.
Read that catalog closely, and the numerous failure modes collapse into three. None is about raw capability. Each is about how work is specified, coordinated, and carried out — the oldest problems in organizational life, rediscovered in silicon.
Specification
Unclear objectives, missing constraints, ambiguous ownership. The organization sets out to solve a problem no one defined precisely — and intelligence faithfully delivers the wrong thing.
Coordination
Conflicting decisions, lost context, dispersed authority. Specialists act in isolation and their outputs collide — one building in a style another could never see.
Execution
The right plan, a faulty delivery: a tool misfires, a step is skipped, a result is dropped. The intent was sound; the doing failed.
The leading labs are not discovering the limits of their models. They are rediscovering organizational theory — one production incident at a time.
The Trade-offCoherence Versus Capability.
Both teams are right, which is exactly why the disagreement matters. They are caught on opposite horns of the same trade-off. Share all context across one mind, and decisions stay mostly coherent — but the system is capped at a single reasoning thread’s reach. Too many inputs to the context window is also problematic. Distribute the work across many minds, and you gain parallel reach — but you risk the failure the Devin team illustrated memorably: one subagent building a game background in one visual style while another builds a character in a clashing one, because neither could see what the other had silently decided.
Coherence or capability. For now, you appear to buy one by surrendering the other. Some techs use both processes across multiple agents– multiple subagents outputs feed back into a main or team leader agent. I often try to use the IDE Cursor’s multi-agent environment as the team coordinator, with Codex and Claude in split terminals as subagents. This way, the coordinator is aware of tasks and is awaiting results. This could work, and it reminds me of how I used to work, don’t hover or micro-manage, but be present and supportive!
PrecedentThe Navy Already Resolved This.
Here is what makes the naval case more than an analogy. Combat organizations faced this precise trade-off and built an institution to resolve it: the shared operational picture, maintained through a dedicated synthesis function — the Combat Information Center.
The CIC is neither the single mind nor the swarm. It is a deliberate architecture for holding shared situational awareness while expertise stays distributed, but organically connected. Watch stations keep their specialties and their imagined autonomy. The picture they feed is common. Authority to act remains clear through supervisors, watch captains, and tactical action officers. No one confuses contributing to the picture with owning the decision. In fact, each has an obligation to report and take action upon disagreement!
The fundamental organizational building block they’re missing is a structured mechanism for maintaining shared situational awareness. Expecting technicians to just get it isn’t leadership! Anthropic’s orchestrator writes its plan to durable memory before the context window fills, so shared intent survives. The Devin team abandons distribution altogether to preserve one continuous record. Both are improvising substitutes for something high-reliability organizations formalized generations ago.
The labs are not short on intelligence. They are short on organizational architecture.
Where the Principle Holds — and Where It Doesn’t.
This is a design decision, not an ideology, and the evidence is specific about the boundary. Distributed structures excel at breadth-first, loosely coupled work — research, discovery, anything explored along independent threads. They struggle with tightly coupled work, where every move depends on the last. Anthropic itself notes the pattern is weaker for tasks like coding, where coherence is everything; the Devin team works precisely in that domain.
The Navy maps the same boundary. Situational awareness is distributed deliberately. But weapons release and ship handling stay under tight unity of command, because in those functions a moment of divided authority is a casualty. Mature organizations do not choose distribution or command as a creed. They assign each function the structure its coupling demands.
Where It LivesThe Organization Lives in the Runtime.
For two years I assumed, as most of us did, that an agent’s intelligence lived in its model (LLM) and its character lived in its prompt. Watching DeepAgent systems assemble (tracing) itself corrected me. Before a single token of reasoning occurs, a runtime gathers memory, governance, skills, tools, context, and human oversight, and composes them into a working whole. Only then is the model invoked (called).
That sequence is easy to miss and impossible to unsee. Especially since most model makers put this part in a black box, never to be seen.The architecture is not merely the prompt. It is the runtime that assembles memory, governance, skills, tools, context, and human oversight into a coherent organization before intelligence is ever invoked. The model does not contain the organization. The runtime does.
The language model is not the organization. The runtime is.
This reframes the entire stack. When we say “organization,” we are not describing something hidden inside the model’s weights. We are describing everything the runtime decides to place around the model — a design act, performed by people, every time the system runs.
InheritanceA System Prompt Is Organizational Culture.
Once you can see the runtime, its layers start to look familiar. They arrive in order, each one constraining the next — and the sequence is almost eerily organizational.
The resemblance is not decorative. A system prompt functions much like organizational culture. Individual workers do not renegotiate the mission each morning; they inherit it. Likewise, agents inherit their objectives, boundaries, and governance before they ever receive a user request. Culture is simply the context an organization makes standing — true by default, applied before anyone has to ask.
Your brand is the context you make standing.
This is also where brand quietly lives. What an organization chooses to make standing context — its mission, its standards, its voice — is its culture to insiders and its brand to everyone else. These are the ‘md’ files for agents and models.
And these layers nest the way a runtime layers its sources: an organizational identity sets the base, a professional discipline refines it, an individual’s judgment sits on top — each later layer overriding the last where they meet. Personal, professional, and organizational brands are not competitors for the same space. They are a context stack, resolved in order. There is a complex process of reading ‘md’ files in layers and order to build the complex stack. Runtime feeds metadata and other information to the LLM to make decisions and thus minimize the need for large context windows and tokens.
Tracing Is Organizational Transparency.
For most of computing history, the unit of work has been a black box: input enters, output appears, and the reasoning in between is unavailable for inspection. Tracing dissolves that box. For the first time, I could watch the construction itself, not just the result —
Let’s not forget ‘beforeware.’ This process can loop with tool calls. High-reliability organizations have depended on exactly this for decades — the flight-data recorder, the after-action review, the audit trail. Their reliability rests on auditability. Agent tracing offers something similar: the ability to inspect not only decisions, but the organizational processes that produced them. An organization you cannot observe (black box)is an organization you cannot improve. Seeing a production trace for the first time made it obvious that most of the important organizational work happens before the model begins reasoning.
MemoryMetadata Is Organizational Memory.
There is a quieter lesson in how the runtime feeds the model. It does not hand the agent four hundred documents and trust it to read them all. It hands the agent metadata — a compact index of what exists and when to reach for it — and lets the full content be retrieved only when the work actually demands it.
Organizations have always worked this way. They rarely hand every employee the entire policy manual. They expose only the knowledge required for the work at hand, and trust people to know where the rest is kept. Modern agent runtimes are converging on the same principle through metadata-driven context selection. The organization remembers far more than any individual worker is ever shown — and that, too, is by design.
ScarcityLeadership Is the Scarce Resource.
As reasoning models continue to improve, a different scarcity emerges. Organizations rarely fail because they lack people — or now, models — capable of solving problems. They fail because few leaders consistently define the right problems to solve. That last sentence is worth reading again!
AI changes neither reality. A language model can evaluate thousands of alternatives once the objective is clear. It cannot determine which five problems deserve the organization’s attention in the first place. That remains a leadership function. An executive’s most valuable contribution is not answering questions. It is defining the handful of questions that matter.
Capability Is the Rarer System.
If leadership is the scarce judgment, capability is the scarce system — and the two are so often confused they are worth separating cleanly. Intelligence solves a problem once. Capability is the organization’s ability to produce that outcome again, and again, reliably, after the brilliant individual has moved on. A model can be intelligent on Tuesday and unavailable on Wednesday; capability is what survives the calendar.
This is why the familiar complaint is slightly wrong. Organizations do not, in the end, need better AI. They need better capability systems — the architecture, memory, governance, and coordination that turn occasional intelligence into repeatable outcomes. Intelligence is becoming abundant and cheap. Capability remains rare and expensive, because capability is organizational, not cognitive.
Seen whole, the stack has a top — and the top is human.
Three Moves for Builders.
You do not need a research lab to apply this. You need one system you are responsible for — human, digital, or both.
Name the problem before the tooling
Resist starting with the model or the prompt. State the handful of problems the system exists to solve, and who owns each. Most “AI” failures are specification failures wearing a technical costume.
Design the runtime, not just the request
Decide deliberately what standing context surrounds the work: the memory it inherits, the governance it obeys, the tools it may reach for, and where a human stays in the loop. That assembly is the organization.
Instrument for capability
Trace it, so you can see how outcomes are produced — not just whether they appear. An outcome you can observe and repeat is capability. One you can’t is a lucky Tuesday.
The Next Decade.
We have spent recent years building better minds. The next decade will be spent building better organizations for those minds to inhabit — and clearer leadership to aim them. That shift is already visible in the only place that matters: in what the best teams argue about when their systems fail.
There is a personal version of this shift, too, and it may be the truer one. Months ago, the question on my desk was how to write a better prompt (prompt engineering). It has quietly become a question entirely different.
Engineering → Context
Engineering → Organization
Engineering
That is not a change of tooling. It is a change of altitude — from wording a request, to assembling the context around it, to designing the whole organization that decides what context to assemble in the first place.
Intelligence does not create organization. Organization creates usable intelligence.
The Industrial Revolution optimized machines.
The Information Age optimized computers.
The AI Age will optimize organizations of intelligence.
The organizations that outperform in the AI era will not necessarily possess the smartest models. They will possess the clearest leadership and the deepest capability. Leaders define the problems. Architects design the cognitive organization. The runtime assembles the knowledge, tools, and oversight. Intelligent agents execute within it. In fact, organizations are becoming cognitive systems.
Cognition — “Don’t Build Multi-Agents” (2025).
Berkeley — study cataloguing multi-agent system failure modes (MAST, 2025).
Practitioner observation — LangChain Deep Agents middleware and LangSmith tracing (2025).
It was never an engineering challenge. It was a leadership challenge.
AI will increasingly solve the problems we define. Our advantage will lie in defining the problems worth solving — and in building the organizations, human and digital, that can solve them again and again.
