Why Agents Fail

The consensus on systems of record has shifted twice in the last three years. As someone who’s been operating in the space, I feel the need to address certain matters directly as we enter 2026. In venture, the most dangerous lies are the ones everyone agrees on. My hope is that in writing this capital allocators might see a different path forward - one that leads them towards operators willing to commit decades for what's to come. The first wave of value from the transformer revolution has been captured. What lies ahead is larger.
In 2023, Silicon Valley's emergent narrative held that the legacy systems of record—SAP, Workday, Salesforce, ServiceNow—would be displaced by next generation AI-native systems. The pitch confused an implementation detail for a structural moat: LLMs require vector-native databases, and surely the incumbents couldn't just bolt one onto or shift away from their existing relational infrastructure. Surely CIOs would stake their careers on unproven vendors, rip out decade-old integrations, and endure multi-year change management—because the interface felt more intuitive.
As we enter 2026, agents are the newly anointed disruptors. The prevailing narrative is that there's a core input these agents are missing in the enterprise: truth, or as engineers like to call it, context. Solve for context, the thinking goes, and the agent succeeds. This conviction has produced an ideological schism over where context actually resides. One camp argues it lies buried within legacy CRM, HCM, and ERP systems, awaiting liberation through better interfaces. Another insists it does not yet exist—that agents must construct their own representation of reality from first principles.
Context matters. But context without enforcement merely enriches the input to an output that remains unverified. The question worth asking is simpler - why do agents fail?
Intelligence Is Not the Bottleneck
The prevailing theory treats agent quality as a function of intelligence. More data, superior reasoning, richer context—turn these dials and performance improves. Continue turning and eventually the agent displaces the human. This theory misapprehends why labor requires humans in the first place.
The promise of agents is the promise of replacing labor. But you cannot replace labor you have not thoroughly understood. Intelligence does not obviate the need for validation. However capable the agent, its output must still be verified—by a system, by an institution, by a person. A laborer is not defined by intelligence. A laborer is defined by accountability: not merely the capacity to verify, but the obligation to stand behind that verification—to own the outcome when something fails, when a figure does not reconcile, when someone must answer for what occurred. Understood this way, agent failure becomes easier to diagnose.
Consider the domains where agents appear most formidable: composition, correspondence, strategic analysis. These share a structural feature: they lack enforcement. A weak ad still runs. A poorly reasoned email still gets sent. No gatekeeper exists, no system empowered to declare "this shall not pass." One may deploy a highly intelligent agent to draft a complex report, but no loop has been closed. The burden of verification has merely migrated to a human who must read, evaluate, and decide whether to endorse.
Enterprises do not fail because they lack intelligence. They fail because intelligence, on its own, does not confer permission to act. Every real action must pass through layers of tribal knowledge rooted in constraints that define what the organization is actually allowed and supposed to do. Most traditional domains within the enterprise—marketing, finance, legal, HR—are not bottlenecked on generating intelligent suggestions about their work. They are bottlenecked on certifying that an answer is admissible and executable—under law, policy, audit, operational reality, and the environment their work gets done in.
Verification vs. Enforcement
Now consider the domain of software. Engineers write code. Code executes. Upon execution, the environment itself renders judgment—immediately, unambiguously. A program that fails that certification fails visibly, at runtime, before it can cause damage downstream. This is a core property many domains lack in a real enterprise setting. A legal contract is not executed—it is interpreted. A strategy memo is not tested—it is discussed. These artifacts cannot be certified by the environment because they do not act upon it. This is why software has approached the verification problem in ways other domains have not—not because programmers are smarter than attorneys or accountants, but because code inhabits an environment capable of response.
Yet even in software, the human has not departed—merely ascended. The compiler catches syntax. The test suite catches specified behavior. Intent remains beyond their jurisdiction. Were it otherwise, quality assurance would not exist. Software narrows the gap between "this executes" and "this is correct." It does not close it. Most enterprise domains have not begun to narrow that gap at all. Their records encode not merely information but liability, institutional memory, political leverage. To share them is to expose vulnerability. To standardize them is to surrender control. Tribal knowledge persists—local, guarded, fragmented.
This brings us to the question of enforcement.
Verification asks whether an output is admissible. Enforcement acts upon that verdict without appeal. A payment that fails OFAC screening does not advance to committee deliberation. It terminates. A trade that breaches a risk limit does not escalate for human review. It is rejected. These are environments in which the system itself possesses the authority to refuse—and that refusal is final.
Why Agents Actually Fail
Agents fail because they are asked to bear responsibility within environments incapable of evaluating them. The output arrives - fluent, formatted, plausible…but there is no system to reject it. No mechanism distinguishes the correct from the merely coherent. We have made generation cheap. We have not made judgment automatic. The threshold for agent capability is neither intelligence nor context. It is whether output can be validated by the environment without human intermediation. Where this holds, genuine labor automation becomes possible. Where it does not, the ceiling remains restricted to generation.
The agentic leverage the industry awaits will not emerge from superior context or more sophisticated models. It will emerge from constructing environments capable of enforcement. Certain domains will never achieve this: trial advocacy, creative work, any endeavor suffused with irreducibly human judgment. The subjective resists systematization by its nature. Yet vast expanses of economically significant work are not inherently subjective. They are merely unenforced. The rules exist. The validity conditions exist. The infrastructure can be constructed—but only from within. The governing logic lies buried in the thicket of edge cases native to each domain: the exceptions, the permissibility conditions, the reasons a transaction is rejected. None of it is documented, let alone structured.
The first wave of value from the transformer revolution was captured by those who commoditized generation. The next wave will go to those who automate judgement at runtime. The former was about building the model. The latter is about building the world the model runs in—messy, slow, and irreducibly domain-specific. This is not a market that will be won by the fastest or the best-funded. It will be won by those willing to spend decades within specific domains, encoding logic that has never been written down, building tomorrow's environments for today's agents to scale within the chaos of the modern enterprise.





