When an enterprise AI agent fails, the instinct is to upgrade the model. Here, three case studies show what happens when teams prioritize multi-agent architecture instead.
When Aaron Sneed launched his defense technology startup, he could not afford to hire lawyers, accountants, or HR staff. So he built them.
Working from Florida, the 40-year-old solopreneur spent months training a group of fifteen AI agents he calls "The Council." Each agent handles a different job. There are agents for HR, finance, legal work, supply chain, manufacturing, security, and field operations. At the head of the metaphorical table sits (again, metaphorically) a chief-of-staff agent, which Sneed describes as "the voice that sets priority based on parameters like risks, issues, and opportunities."
There’s a hierarchical organizational structure, even amongst the agents. The chief of staff enforces the rules, while recommendations from legal, compliance, and security agents carry more weight. When Sneed faces a conundrum, he posts a document in a shared chat and watches all 15 agents weigh in at once. He calls it a roundtable. The setup, he told Business Insider, saves him roughly 20 hours a week.

It’s an unusual business operation, but what Sneed learned along the way is invaluable for organizations with ten or ten thousand employees alike. When he first started, he tried to assign everything to one agent. It did not work. The agent kept getting confused, missing details, and giving answers that sounded right but fell apart under pressure from his actual lawyer. The fix was structural, not a smarter model. Sneed split the work, gave each agent a defined role, set parameters for weighing recommendations from each agent, and built an infrastructure that let them cross-check each other.
That insight, at scale, is one of the most important shifts happening in enterprise AI right now. Most enterprise AI agent failures are not caused by weak models, but by asking one agent to do work that should be split across several agents that hand off to each other cleanly.
An AI agent is a program built on top of a large language model, such as those that power ChatGPT or Claude. The model gives the agent the ability to read, write, and reason. The agent has memory, tools, and a job description. You can tell an agent to search the web, pull data from a database, write an email, or take an action in another system.
The trouble starts when companies try to make one agent do everything. A single agent, given a complex job, has to hold all the work in a single “mental workspace,” known as a context window. It has to remember the work it's already done, decide which tool to use next, and keep track of dozens of partial answers.
And the context window can only hold so much. When the workspace fills up, the agent starts dropping threads: forgetting past work, repeating tasks, or ceasing to respond altogether.
A multi-agent AI system splits a single large job among several specialist agents. Each agent gets its own workspace. A lead agent, sometimes called a supervisor, breaks the job into pieces, hands the pieces to the specialists, and combines the results when they come back.
Structure is critical to agentic and multi-agent AI systems: the most common pattern is “supervisor and worker,” where a lead agent directs specialists. Another is a sequential pipeline, where agents pass work down a line, each one finishing a step before the next begins. There are also hierarchical setups in which teams of agents are managed by other agents. All of them share a common idea: separating the work so that no single agent gets buried.
Most enterprise leaders reach for a better model when their AI agent underperforms. The instinct makes sense: if the output is bad, upgrade the brain. But the data tells a different story. According to Anthropic's engineering team, token usage alone (the inputs, outputs, and context of a single conversation) explains 80 percent of the performance variance on complex tasks. The model does matter, but the architecture matters more.
Three specific problems keep showing up in enterprise deployments. Each one has a multi-agent fix.
The three case studies that follow show how real enterprise teams use multi-agent AI to solve each of these problems.
Anthropic, the AI company that builds Claude, ran a direct test. The task: identify every board member at every IT company in the S&P 500. For a single AI agent, this is brutal. The agent has to research hundreds of companies, decide which sources to trust, handle different website structures, and hold the partial answers from every search in its memory at the same time.
When Anthropic ran this with a single Claude Opus 4 agent working through the list one company at a time, the agent struggled to keep the full picture in view and could not reliably finish the job.
The team then built a multi-agent AI system using a supervisor and worker pattern. The lead agent, running on Claude Opus 4, broke the task into smaller pieces. It spawned subagents running on Claude Sonnet 4 to handle each piece in parallel. Each subagent worked in its own fresh workspace, focused on one part of the list, used a tight set of search tools, and returned its findings. The lead agent then combined everything into a final answer.
The multi-agent system outperformed the single-agent setup by 90.2 percent on Anthropic's internal research evaluation. The engineering team explained the result this way: the architecture distributes work across agents with separate context windows, which adds capacity for parallel reasoning. The same problem that broke a single agent became solvable when the workload was split.
When a single agent is asked to handle too many sources, too much volume, or too many partial results, performance falls apart. Splitting the work across specialist subagents, each with a narrow focus and its own workspace, removes the overload.

The trade-off, of course, is cost. Multi-agent systems use roughly 15 times as many tokens as single-chat interactions, so they make sense for high-value tasks but are overkill for simple ones.
Vendasta, a software company that serves local businesses, had a problem with its sales team. Sales development reps (SDRs) research new prospects, set up first meetings, and pass qualified leads to closers. Vendasta's SDRs were losing huge amounts of time to manual work. The company calculated that its team was losing 282 working days a year to this kind of administrative work, costing more than $1 million in missed pipeline.
The handoffs were the issue. Every step required a human to gather information, clean it up, and pass it to the next step. But each handoff slowed things down, and sometimes data got lost along the way.
Using Zapier, Vendasta built a sequential pipeline that works like an assembly line. When a new prospect comes in through a form or event, the system pulls in contact details from Apollo and Clay, two outside data tools that fill in company information. An AI agent summarizes what is known about the prospect. The record is logged in Vendasta's customer relationship management system (CRM). A routing agent then sends the prospect to the right sales rep instantly.
A second pipeline handles work after a sales call. The call transcript gets summarized automatically, key takeaways are logged in the CRM, and a personalized follow-up email is drafted and queued up for the rep to review before sending.
"Our sales reps are able to focus on deals and not have to worry about doing all these tedious tasks before they're able to get to the next deal."
—Jacob Sirrs, Marketing Operations Specialist, Vendasta
The results came from removing the handoff problem. Information now moves from one step to the next without a human in the middle. Vendasta saves 15 minutes per call, frees up 1,200 minutes (20 hours) of work each day across 20 reps, and credits about $1 million in reclaimed pipeline revenue to the change.
BASF is one of the largest chemical companies in the world, with more than 11,000 employees across over 70 sites. Its Coatings division makes automotive and industrial coatings, and its sales team has a research problem common to large enterprises: too many tools, too many data sources, and no unified way to ask questions.
Before multi-agent AI, a BASF Coatings sales rep preparing for a customer meeting had to pull together information from very different places. Customer visit reports lived in Salesforce, market consumption insights were stored in internal data tables, and external market news lived as PDF files and free-text reports.
This is a tool and domain overload problem. A single AI agent given access to all those tools would struggle to pick the right one for each question. Ask it for last quarter's sales numbers, and it might search the PDFs. Ask it about industry news, and it might query the database. There were just too many ways to fail.

In partnership with Databricks, BASF Coatings built Marketmind, a multi-agent AI system delivered to sales reps through Microsoft Teams. Marketmind is the assistant that the reps actually talk to. Behind it sits a supervisor agent that routes each question to the right specialist. One specialist, called a Genie agent, handles structured data, including the Salesforce visit reports and sales tables. Another specialist handles unstructured data, scanning PDFs and market news through vector search. The supervisor reads each question, decides which specialist (or both) should answer, and combines the results.
"Marketmind turns our field interactions into timely, AI-driven actions, nudging smart follow-ups, surfacing relevant opportunities, and connecting peers facing similar challenges. The result: faster prep, sharper customer conversations, and more time selling where it counts."
—Adrian Fierro, Head of Global Market Intelligence, BASF Coatings
BASF Coatings began scoping in April 2025, ran a five-to-six-week proof of concept, piloted with 25 users, and launched in North America that October. The rollout target is more than 1,000 sales representatives worldwide.
All of the case studies cited here started in the same place. A team had a complex job, tried to solve it with one AI agent, the agent fell short, and the fix turned out to be a better structure.
Before approving the next model upgrade, enterprise leaders can ask three diagnostic questions about any underperforming AI deployment.
Aaron Sneed figured this out on his own, by trial and error, with fifteen agents on a roundtable. Enterprise teams have the benefit of his lesson and of the production examples that have followed it. The next best AI investment is probably not a new model, but a better structure for combining the tools already in hand.
.avif)
Multi-agent AI systems use roughly 15 times more tokens than single chat interactions, according to Anthropic. The trade-off makes sense for high-value, complex tasks where coordination problems would otherwise cause failures. For simple, short tasks that one agent can handle in a single context window, multi-agent setups are overkill.
The three most common patterns in production are supervisor and worker, sequential pipeline, and hierarchical multi-agent. Supervisor and worker models use a lead agent to coordinate specialists. Sequential pipeline structures pass work from one agent to the next like an assembly line. Hierarchical setups have teams of agents managed by other agents.
A supervisor and worker pattern uses a lead agent to direct several specialist agents. The supervisor reads each request, decides which specialist should handle it, sends the work, and combines the results. BASF Coatings uses this pattern in Marketmind to route between structured data and unstructured data agents.
Multi-agent AI beats a single agent when the task is large, multi-source, or requires routing across different tools. Anthropic found that multi-agent setups outperformed a single agent by 90.2 percent on a complex research task. But single agents remain a better fit for short, narrow jobs that fit comfortably in one context window.
A multi-agent AI system splits one job across several specialist agents that each work in their own context window. A lead agent breaks the job into pieces, hands the pieces to the specialists, and combines the results.