The promise of artificial intelligence, particularly Large Language Models (LLMs), has often been framed around the idea of “agents” – systems that can autonomously understand complex requests, break them down into steps, and execute them with minimal human intervention.
At Stelia, we’ve built and deployed production-grade AI across some of the world’s most complex organisations, and it’s exactly this challenge that led us to build Stelia AI OS – a full-stack operating system that replaces the need to assemble and maintain an entire AI stack in the first place.
What we’ve found, repeatedly, is that while impressive demonstrations of broad agents exist, building truly reliable and scalable AI-powered applications using them is proving surprisingly difficult. The key isn’t necessarily less AI, but a more strategic deployment of it – focusing on specific tasks within well-defined workflows. This article explores how leveraging workflow managers, rather than relying solely on AI orchestration frameworks, unlocks rapid prototyping and dramatically improves scalability for real-world applications.
Why full-context agents struggle to scale
The core idea behind many AI agents is to provide the LLM with all relevant information – the user’s request, past interactions, knowledge base access, and even tools it can use. This “full context” approach sounds intuitive, but quickly runs into several roadblocks:
- Context window limits: LLMs have finite context windows. Complex requests, long histories, or large knowledge bases can easily exceed these limits, leading to truncation and information loss.
- “Lost in the middle” problem: Even within the context window, LLMs struggle to consistently focus on relevant information. Important details can be overlooked amidst a sea of text.
- Cost & latency: Processing larger contexts is significantly more expensive and slower, impacting both development costs and user experience.
- Hallucinations & drift: The more context an LLM has to process, the higher the risk of generating inaccurate or irrelevant responses. Maintaining consistency and reliability becomes a major challenge.
- Debugging complexity: When an agent fails, pinpointing the source of the error within a massive context is incredibly difficult. Was it the prompt, the knowledge base, or the LLM itself?
These issues aren’t insurmountable, but they create a significant barrier to building robust and scalable AI applications. The dream of a single, all-knowing agent often clashes with the practical realities of LLM limitations.
The workflow manager solution: decoupling intelligence from orchestration
Instead of tasking an LLM with everything, a more effective approach is to decompose complex tasks into smaller, well-defined steps and orchestrate them using a workflow manager. Think of it like an assembly line: each station performs a specific function, and the product moves sequentially through the process.
Here’s how it works:
- User request: The user initiates a request (e.g., “Find me the top 3 marketing blogs discussing AI content creation and summarise their latest posts”).
- Workflow definition: A workflow manager defines the steps required to fulfil the request. These steps might include:
Step 1: Keyword extraction: Identify key concepts (“marketing blogs”, “AI content creation”).
Step 2: Define web search: Use an LLM to define search terms based on inputs from the user.
Step 3: Web search: Use a search API to find relevant blogs.
Step 4: Content retrieval: Fetch the latest posts from those blogs.
Step 5: Summarisation (AI task): Use an LLM to summarise each post. This is where the AI comes in.
Step 6: Ranking & filtering (AI task): Use an LLM to rank the summaries based on relevance and quality.
Step 7: Presentation: Format and present the results to the user.
Task-specific AI calls: Each step that requires intelligence leverages an LLM with a focused prompt. For example, the summarisation step might have a prompt like: “Summarise this blog post in three bullet points, focusing on key takeaways for marketing professionals.”
Data passing: The output of each step is passed as input to the next, creating a clear and traceable flow.
Key difference: The LLM isn’t responsible for understanding the entire request or maintaining a long-term memory. It’s simply executing a specific task with well-defined inputs and outputs.

The benefits: rapid prototyping & scalable architectures
This approach unlocks several key benefits:
- Simplified prompts: Because each AI task is focused, prompts can be much simpler and more effective. You’re not asking the LLM to do everything at once; you’re giving it a clear, specific instruction. This leads to more predictable and reliable results.
- Reduced context requirements: Each AI call requires only the context relevant to that specific task, minimising cost and latency.
- Increased reliability: By breaking down the process into smaller steps, you can more easily identify and debug errors. If a summarisation fails, you know exactly where the problem lies.
- Improved scalability: Workflow managers are designed to handle large volumes of tasks concurrently. You can easily scale individual steps (e.g., add more summarisation workers) without impacting the entire system.
- Faster iteration: You can quickly swap out different AI models or tools for specific steps without rewriting the entire workflow. This allows for rapid experimentation and optimisation.
- Greater control & observability: Workflow managers provide detailed logging and monitoring capabilities, giving you complete visibility into the entire process.
- Larger dataset handling: Because you’re not loading massive datasets into the LLM context, you can process much larger volumes of data. You can stream data through the workflow in batches or use external databases to store and retrieve information as needed.
Practical considerations & tools
- Workflow manager choice: If you’re using Stelia AI OS, the built-in workflow manager handles orchestration natively, with smart model routing, optimised token consumption, and full observability built in. Alternatively, you can look to third parties, though you will ultimately get a more integrated experience using our own built-in orchestration.
- AI model selection: Choose the right AI model for each task. You might use a different model for summarisation, translation, and sentiment analysis.
- Error handling: Implement robust error handling to gracefully handle failures at each step. Consider retries, fallback mechanisms, and alerting systems.
- Data serialisation: Choose a consistent data serialisation format (e.g., JSON) to ensure seamless communication between steps.
- Security: Protect sensitive data by encrypting it in transit and at rest.
Augmenting the workflow with an AI observer agent
While workflow managers provide excellent observability, a further layer of intelligence can be added through an independent “Observer Agent.” This agent doesn’t control the workflow; instead, it monitors its progress, identifies potential issues, and provides insights to improve performance. It operates alongside the workflow manager, consuming event data and leveraging LLMs for analysis.
How it works: The Observer Agent subscribes to events emitted by the workflow manager (e.g., task started, task succeeded, task failed, output generated). It then uses LLMs to:
- Track task status: Maintain a real-time view of the workflow’s progress, flagging stalled or failing tasks.
- Analyse results: Evaluate the outputs of AI-powered steps (e.g., summaries, rankings) for quality and relevance.
- LLM-as-judge: Employ an LLM to independently assess the quality of AI outputs. For example, after a summarisation step, the agent could prompt: “Evaluate the following summary for accuracy, conciseness, and relevance to the original blog post. Provide a score from 1-5.”
- Anomaly detection: Identify unexpected patterns or deviations from expected behaviour (e.g., unusually high latency, consistently low-quality scores).
Benefits of the observer agent
- Proactive issue identification: The agent can flag potential problems before they impact the user experience.
- Automated quality control: The LLM-as-Judge functionality provides an automated way to assess the quality of AI outputs, reducing the need for manual review.
- Data-driven optimisation: The agent can identify areas where the workflow can be improved, such as by using a different AI model or adjusting prompts.
- Reduced human intervention: The agent can automate many routine monitoring and quality control tasks, freeing up human operators to focus on more complex issues.
- Enhanced reliability: By continuously monitoring the workflow and identifying potential problems, the agent can help to ensure that it operates reliably.
Comparison
A test between Vertex AI Studio and Stelia’s workflow – part of Stelia AI OS – running as similar as possible prompts shows the difference between token consumption of a purely agentic solution vs an agentic solution where coordination is managed by the workflow, and tooling is executed by CPU workers.
| Stage | Vertex AI Studio token estimate | Stelia AI OS batch workflow |
|---|---|---|
| 1st search terms | 73 | 229 |
| Gather search data | 2229 | 0 |
| Generate first report | 3782 | 845 |
| Research gaps | 5636 | 820 |
| Gather gap search data | 7621 | 0 |
| Final conclusion | 9231 | 5982 |
| Totals | 28,572 | 7876 |
So, for the same steps (obviously with different LLM models and reports), there was over a 360% increase in token usage for the same ‘feature’. Gathering data was 0 for Stelia, as it’s a CPU-based process that doesn’t require a GPU, so no token consumption was required for that stage, whereas the vertex AI prompt was asked to gather data, so there was an overhead for asking the LLM to gather the data directly.
Conclusion
While the vision of a fully autonomous AI agent is compelling, building reliable and scalable applications requires a more pragmatic approach. By decoupling intelligence from orchestration using workflow managers, you can unlock the full potential of LLMs while overcoming many of their limitations. This approach enables rapid prototyping, improved scalability, and greater control – ultimately leading to more successful AI-powered solutions. If you’re looking for a platform built around exactly this approach, Stelia AI OS was designed from the ground up to make this kind of workflow-first, task-specific AI possible at enterprise scale. Don’t ask the LLM to be the agent; empower it to excel at specific tasks within a well-defined workflow.