Everyone has an opinion on AI replacing developers. Mine fits in one sentence: I've never seen an agent replace a dev, I've seen agents tear through work fast and well when it's framed precisely, and produce nonsense the moment you let them loose in the wild. That's less of a sell than a demo where the AI spits out an app in thirty seconds. It's also far more useful day to day.
I stopped typing my prompts one at a time into a chat window a while ago. I built a small chain of agents that split the work across a project, and it's by running it on real deliverables (production code, audits, this blog) that I learned where it holds and where it breaks. Nothing magic in there. Mostly plumbing and a few rules I picked up by getting burned.
An orchestrator and specialists, not one lone genius
The starting mistake is wanting a single agent that does everything. You ask it to architect, code the backend, build the frontend, write the tests and draft the docs in one go, and you get something mediocre across the board. A generalist in a hurry.
I split it differently. There's an orchestrator that takes the brief, breaks it down and hands out the work. Behind it, specialised agents each with a clear role: one that writes the spec and settles the architecture decisions, one for the backend, one for the frontend, one for containerisation and deployment, one that reviews the code and writes the tests, one that writes prose. The orchestrator doesn't code. It coordinates and it checks. That's the whole point of it: someone whose only job is to hold the big picture and not trust the others blindly.
Two settings make the difference in practice. First the reasoning budget, what I call the effort. An architecture decision deserves an agent that thinks long and wide, because a bad foundation rots everything that comes after it. Writing a Dockerfile is a known pattern, seen a thousand times, that calls for no particular depth. Giving both the same thinking budget wastes time and tokens on the second. So I modulate: a lot for the architect, almost nothing for the mechanical tasks. Then the tool scope. The writer doesn't need a browser or database access. The agent that deploys has no business poking around in my mockups. Restricting what each agent can touch shrinks the context it loads, keeps it focused on its job, and incidentally limits the damage if it goes off the rails.
The gain isn't only speed. It's quality through separation. An agent that thinks about a single thing does it better than one that's juggling five.
An agent is only as good as its framing
Here's the lesson I'd get tattooed if I could: an agent is good exactly to the degree that its task is well defined. Well briefed, constrained by a clear spec, it's fast and precise. Badly briefed, it doesn't stop, it doesn't tell you "I don't have enough to go on". It fills the gaps. It invents a plausible assumption and runs with it, completely unfazed.
That's why I refuse to start any code before a spec is validated. Not out of dogma. Because the spec is the frame that stops the agent from embroidering. When I skip that step "to move faster", I pay for it twice over afterwards, untangling what the agent assumed in my place. Half a page of clear spec beats three paragraphs of fuzzy intentions and an agent left to its imagination. The model isn't to blame. You ask it to fill a void, it fills it, that's exactly what it was trained to do.
Verify, especially when the agent looks sure of itself
The most dangerous trap isn't the agent that's visibly wrong. It's the one that's wrong with confidence, in a clean format, without the faintest sign of hesitation.
A real case that stuck with me. An agent tasked with auditing my own code recommended, very seriously, that I rename a configuration file. Specifically, to put proxy.ts back to middleware.ts. Except that under Next.js 16 the middleware file is now called proxy.ts, it's the framework's new convention, and my multilingual routing depends on it. Had I applied the recommendation, I'd have broken the site. The agent took a recent convention for a bug, because its intuition was still running on the earlier versions. Flawless output, confident tone, wrong conclusion.
Another time, an agent miscounted entries in a dataset and produced a wrong total with the same calm it would have had if it were right. Nothing in the form gives the error away. That's the real danger: the false answer looks exactly like the right one, point for point.
In both cases, what saved the day was the verification layer. The orchestrator doesn't pass a high-impact claim along without testing it, and I personally re-read anything touching decisions that can break something. The rule I apply: the heavier the consequence of a recommendation, the more I check it with my own eyes. Renaming a file that can take down the routing gets checked by hand, full stop. An agent's fluency is not proof. It's the opposite of proof, in fact, because it lulls your vigilance to sleep.
"It compiles" doesn't mean "it works"
When an agent hands you code that passes the build, the reflex is to tick the box. The pipeline is green, so it's fine. No.
A green build tells you one thing and one thing only: your code is syntactically valid and your types hold. It says nothing about what your pages return when a human actually opens them. I've had a blog article deployed, live, a full 404, while every check had passed. The compiler saw no problem with it, because to the compiler an error page is a perfectly valid page. It just has the wrong content. No agent that only looks at the code would have caught it. You have to start the real server and look at what it answers, with an actual test on the thing that's running.
That's the safeguard nothing replaces: a human checking, and a test against the live application. Not a test the agent claims to have run in its head. A test that produces an observable answer, an HTTP code, a page on screen, a result you can point your finger at.
Reliability comes from the structure, not the model
If I had to boil my conviction down to one idea, it would be this one, and it cuts against the prevailing talk. What makes a chain of agents reliable isn't the size of the model at the end of it. It's the structure around it: clear roles, ordered steps, a spec upstream, a review downstream, and a human keeping a hand on the consequential decisions. The big model helps, obviously. But a big model that's badly framed just produces more convincing falsehoods, which is worse.
Orchestrated well, this machinery takes the drudgery off my plate and genuinely speeds me up. The repetitive work, the boilerplate, the first drafts, the proofreading passes, everything that used to tire me without making me think, I delegate. What I keep is the framing at the input and the judgement at the output. The two ends where there's actually something to decide.
This article, you may have guessed, went through that chain. A writer agent produced a first draft from a brief I wrote, I re-read it, cut it, fixed it, and I'm the one signing it. The AI did the draft fast. The rest, the framing before and the control after, stays my job, and I don't see that changing any time soon.
If you're building this kind of chain in your team and want a perspective that's already hit the walls, let's talk.
