Most people’s AI setup is a collection of prompts. Maybe a few projects, some saved instructions, a growing library of things that worked once. Each one exists on its own. None of them know about each other. When output quality drops, you tweak the prompt or start over – because there’s nothing connecting the pieces that would help you figure out what actually went wrong.
I ran my own AI work like this for months. Separate prompts, separate projects, no relationships between them. Every time something broke, I’d fix it in isolation. The fixes didn’t compound. The same problems showed up in different contexts because nothing linked the lessons together.
Last week I rebuilt the entire system. Not by writing better instructions. By giving the collection a structure – and giving that structure the ability to call its own parts.
How software solved this decades ago
Software developers don’t write one giant file that does everything. They write functions that call other functions. Each function does one thing. When function A needs something that function B handles, it calls function B, gets the result, and moves on. Function B might call function C. The whole thing forms a stack – a chain of calls, each one scoped to its job.
This is a call stack. It’s how every piece of software you’ve ever used is built. And it’s the mental model that’s missing from how most people structure their AI work.
When your AI instructions are all in one document, everything is active all the time. Instructions for writing blog posts share space with website builds, document reviews, formatting rules. The AI can’t tell which ones matter right now. It treats everything as equally relevant, which means nothing gets the attention it needs.
A call stack solves this. Instead of a flat collection, you have a structure. A top-level protocol knows what’s available. When a specific task starts, the relevant protocol gets loaded. That protocol knows what sub-protocols it needs and calls them by name. Each layer is scoped to its job. When the job’s done, that layer is no longer active.
What this looks like in practice
I run a system called MLOS – the Millionleaves Operating System. It governs how I work with AI across every project.
There’s one file that lives in every AI project: the orchestrator. It’s a registry – a list of every protocol, its version, and where to find it. It doesn’t contain the protocols themselves. It just knows where they live.
When I start a session and say “I need to build a website,” the AI reads the orchestrator, identifies that website builds are governed by the Rapid Web Build protocol, and fetches it. That protocol, once loaded, says “block generation follows the ACF Protocol.” The AI goes back to the orchestrator, looks up the ACF Protocol, and fetches that too.
Protocols call sub-protocols by name. The orchestrator resolves names to locations. No protocol contains another protocol’s address – they just reference each other the same way functions reference other functions. The orchestrator is the resolver. If you know what DNS does for the internet, it’s the same idea.
This means I never have stale instructions. Every protocol is fetched live from its canonical source. Update the source, and every project gets the current version next time it’s invoked. One file copied into each project. Everything else fetched on demand.
Why structure matters more than content
Most people think it’s about what the instructions say. It’s at least as much about how they’re organised.
AI models have a finite attention budget. Everything loaded into the context window competes for that budget. When your website-building instructions share space with your content-writing instructions and your document-review instructions, each set gets less attention. The model isn’t ignoring your instructions. It’s drowning in them.
A call stack means only the relevant instructions are loaded at any given time. The website build protocol is in context during website builds. The content protocol is in context during content work. They don’t interfere with each other because they’re never active simultaneously.
There’s a subtlety here that took me a while to understand. You can’t physically remove something from the context window mid-conversation – once it’s fetched, it’s there until the conversation ends or gets compressed. But you can logically offload it by changing which protocol is active. The AI stops applying rules from a protocol that’s no longer governing the current task. For full isolation, you start a new session. Within a session, scope and governance do the work.
This is exactly how software call stacks work. A function doesn’t disappear from memory when it returns – but execution moves on, and its variables go out of scope.
The debugging problem
When something goes wrong in a flat prompt setup, you have no way to isolate the failure. Was it the writing instructions that caused the problem? The formatting rules? The tone guidance? The technical constraints? Everything was active, so anything could be the cause. You’re debugging a monolith.
When something goes wrong in a structured system, you know exactly which protocol was active. You know what it loaded, what it called, and what was in scope. The failure is traceable. You can fix the specific protocol that failed without touching anything else.
I had this happen during a client website build last week. The AI generated field definitions that didn’t match the architecture we’d agreed on. In a flat system, I’d have rewritten the prompt and hoped for the best. In the structured system, I checked which protocol was governing the field generation step, found an ambiguity in how it referenced the design specification, and fixed that one line. Next run, clean output.
That’s not a prompt improvement. That’s engineering.
The honest limitation
This approach has a cost. Every protocol fetch uses a turn and consumes tokens. Split your system too aggressively and you spend more time loading instructions than doing work. The right level of decomposition emerges from practice – which protocols actually cause context pressure in real sessions, which ones are small enough to stay combined.
I got this wrong initially. I had governance layers that were heavier than the work they governed. The fix was a deliberate freeze on new protocol development – stop adding structure, start generating evidence about what the existing structure actually needed.
The architecture is right. The granularity is tuned by use.
What this means for you
You don’t need my system. You don’t need protocols or orchestrators or registries. Those are my implementation.
What you might need is the mental model.
If you’re doing anything with AI beyond single-shot questions – content production, data analysis, multi-step workflows – your instructions have a structure problem, whether you’ve noticed it or not. Everything loaded at once. No scoping. No way to debug when output quality drops.
The call stack isn’t a developer concept borrowed for convenience. It’s the solution to a problem that every serious AI user hits eventually: your instructions are correct, your AI is capable, and the output is still wrong – because everything is competing for attention and nothing is scoped to the task at hand.
Start simple. Separate your instructions by task type. Load only what’s relevant. Give each set a clear scope. Notice what happens to your output quality.
Then notice that you’ve just built the first layer of a call stack.

