Five files, one repo: the complete Copilot customization stack
Five customization files, one .NET Aspire repo, one complete agent setup. How AGENTS.md, .instructions.md, SKILL.md, .prompt.md, and .agent.md work together in practice.
Every week someone asks me which agent setup they should use. Should they just chat with one model? Give it MCP tools? Or wire up a whole team of specialists with a planner, implementer, and reviewer?
The honest answer is that it depends on the job. But “it depends” is a lazy answer, so this post walks through the same realistic feature in all three setups, what each one tends to get right or wrong, and the decision framework I use to pick between them.
Before the comparison, clear definitions. These terms get used loosely, so here is what I mean by each one.
| Setup | What it is | Typical example |
|---|---|---|
| Single-agent | One model, one prompt, one context window. No external tools beyond what the IDE provides. | A chat session in Copilot or Claude where you paste code and iterate. |
| Agent-with-tools | One model that can call external tools through MCP or function calling. Still one context window, but it can read files, run commands, search docs, hit APIs. | Claude Code with a few MCP servers attached, or Copilot’s agent mode with tool access. |
| Multi-agent | Multiple specialized agents coordinated by an orchestrator. Each agent has its own role, prompt, and often its own context window. | A planner agent writes a spec, an implementer agent writes code, a reviewer agent checks it. |
The line between “agent-with-tools” and “multi-agent” blurs in practice. If your single agent can spawn subagents, is that one agent or many? For this post, multi-agent means at least two agents with distinct roles that hand work to each other.
Same task, same model family, same spec. Different orchestration.
The task: add rate limiting to an ASP.NET Core Minimal API service inside a .NET Aspire solution. A handful of endpoints, a Redis integration already wired up in the AppHost, service defaults for OpenTelemetry and health checks in place, and an existing xUnit test project using Aspire.Hosting.Testing. No current rate-limit implementation.
The spec called for per-IP and per-API-key limits using the built-in Microsoft.AspNetCore.RateLimiting middleware, configurable windows via appsettings.json, proper 429 responses with Retry-After headers, and integration tests that spin the AppHost up with DistributedApplicationTestingBuilder and hit both limiters. The distributed story (sharing counters across instances via Redis) was explicitly called out as a follow-up, not part of this first pass, because ASP.NET Core’s built-in rate limiter is in-process.
The model family was the Claude 4.x line, with Haiku used for worker roles in the multi-agent run. Here is what each setup tends to get right and wrong on a task like this.
Fast and cheap, but it forgets Aspire conventions. The typical symptom is adding the rate limiter directly in Program.cs of the API project, instead of extending the shared ServiceDefaults extension methods where health checks and telemetry already live. The code works. It does not fit the shape of the solution. On a multi-service Aspire setup, that difference shows up the moment you add a second API and realise the limiter is not there.
Spends noticeable time reading the AppHost wiring and the ServiceDefaults project before writing anything. That reading time pays for itself. It finds the existing Redis integration registered in the AppHost, wires the connection through service discovery rather than hardcoding a string, and writes tests that actually boot the distributed application with DistributedApplicationTestingBuilder. Where it tends to trip: tool-call loops on edge cases like API-key header casing, where it retries the same fix with minor variations instead of stepping back.
Slower and more expensive, but cleaner on the first pass. A planner agent produces the spec, catches the header-case edge case, and explicitly places the rate-limit registration in ServiceDefaults so future Aspire services pick it up. An implementer works against that spec. A reviewer catches missing tests, typically around the Retry-After header or the 429 response shape. The trade-off is everything you do not see: the orchestration tokens, the handoff artifacts, the longer wall-clock time.
Across tasks like the one above, a pattern emerges. Here is the matrix I use now.
| Factor | Single-agent | Agent-with-tools | Multi-agent |
|---|---|---|---|
| Task complexity | Low | Medium | High |
| Codebase size | Small | Medium to large | Medium to large |
| Context needed | Fits in one prompt | Needs file reads, search | Needs isolated reasoning per step |
| Determinism required | Low | Medium | High |
| Cost sensitivity | High | Medium | Low |
| Debuggability | Easy | Medium | Hard |
| Setup effort | None | Low | Medium to high |
Two rules of thumb I find useful:
That second rule matters more than people expect. Multi-agent systems live or die on what they pass between agents. A vague spec between a planner and an implementer is worse than a single agent with a clear prompt. Good specs, clear task lists, and structured progress logs are what make the whole thing work.
Multi-agent looks expensive on paper, and often is, but model routing changes the picture.
A sensible split: orchestrator and planner on Sonnet, implementer on Sonnet (coding is where quality matters most), and reviewer, test-runner, and doc-updater roles on Haiku. At current Anthropic pricing, Haiku 4.5 sits at roughly a third of Sonnet 4.5’s per-token cost, so offloading bounded, focused roles to Haiku is a real saving without a real quality hit on those roles.
Latency does not benefit from routing the same way. Multi-agent is sequential by nature, each agent waiting for the previous one’s output. Some roles parallelize (test and lint agents can run side by side, doc updates can run after the implementer commits) but the planner-implementer-reviewer chain stays serial.
Rule of thumb: if you are running hundreds of these a day in CI, cost matters more than latency. If you are running a handful during an afternoon of focused work, latency matters more than cost.
Every setup fails differently. Knowing how they fail tells you what guardrails to put in place.
Context collapse. The model runs out of room and starts forgetting earlier parts of the conversation. You notice it when it reintroduces a bug you already fixed.
Pattern invention. Without tool access to read the codebase, the model guesses at conventions and often guesses wrong. The code compiles, but it does not fit.
Silent scope creep. Without explicit planning, the model quietly expands the task. You asked for rate limiting, you got rate limiting plus a new config system you never asked for.
Tool-call loops. The agent retries the same failing tool call with slight variations, burning tokens without making progress. The classic signature is three near-identical error messages in a row.
Context bloat. Every tool call adds output to the context. An aggressive file-reader can exhaust the context window on a medium codebase before it writes a single line.
Over-reading. The agent reads way more files than it needs because its prompt rewards thoroughness. You pay for exploration that did not change the answer.
Lying handoffs. Agent A reports a task as done, but it is not. Agent B trusts the handoff and builds on broken foundations. This one hurts because the failure shows up downstream.
Coordination overhead. The orchestrator spends more tokens deciding who does what than the workers spend actually working.
Runaway cost. Without a budget cap or iteration limit, a multi-agent loop can spiral. I have seen a misconfigured Ralph-style loop burn through a day’s token budget in an hour.
Debugging hell. When something goes wrong, which agent caused it? Good logging and per-agent artifacts make this tractable. No logging makes it nearly impossible.
The short version, with the caveats that matter.
The anti-pattern I see most often is reaching for multi-agent because it sounds sophisticated, for a task that a single well-prompted agent would have finished in two minutes. Complexity is not a feature.
The boundary between these three archetypes is thinning. Modern agents spawn subagents mid-task. MCP servers wrap what used to be full multi-agent systems behind a single tool call. The question is shifting from “which setup do I pick” to “which shape does this specific task want right now”.
My current working model: start with the simplest setup that might work, measure, and only add structure when the failure mode demands it. Context collapse means you need better context management, not always more agents. A pattern-invention problem means tool access, not always a reviewer agent. A determinism problem is usually where multi-agent pays off.
If you are starting fresh, here is the path I would take:
The best AI coding setup is the smallest one that gets the job done. Everything above that is complexity you will have to debug later.
Start the conversation