Follow

Keep up to date with the latest Stelia advancements

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Why understanding application behaviour is the prerequisite for scaling AI

Scaling AI applications successfully begins with identifying behavioural constraints, then modifying architecture accordingly.

As AI systems move from experimental pilots into production-critical enterprise applications, the question of how to scale them reliably is front of mind.

Scaling AI and ML workloads has long been assumed to be achievable through the linear approach of adding more and more infrastructure, proven successful with previous web applications and databases. We see this approach baked into technical teams across the enterprise landscape, provisioning more GPUs as inference latency degrades and accelerating infrastructure procurement conversations as soon as training jobs stall.

But in reality, scaling AI applications for reliable and lasting performance doesn’t begin with the infrastructure, but first by determining application behaviour and ensuring that the solution designed supports the specific performance priorities required.

“You cannot scale what you do not understand. Understanding application behaviour dictates hosting and delivery success.”

Dave Hughes, Stelia CTO

Decisions around scaling typically begin with how much, before answering what kind. By flipping these conversations on their head, we consider how different workload types express distinct behavioural traits, and how architecting with these traits in mind enables production-scale delivery of enterprise applications.

Why behavioural traits must define requirements

Every application has distinct objectives and operational constraints that shape how it behaves under load. Understanding these behavioural traits is key to revealing which architectural requirements matter most for achieving performance at scale.

For example, a multiplayer gaming server’s highest priority is supporting concurrent users, which in production translates to holding thousands of persistent connections with continuous bidirectional data flow. A Minecraft server with 100 players logged in for 19-hour sessions demands long-lived stateful connections where session state must survive server restarts and memory must remain stable over extended periods.

Comparing this to an e-commerce platform where users add items to a cart, triggering short-lived HTTP requests, stateless interactions and variable, bursty traffic – the performance priorities change completely.

Each application’s behavioural traits directly correspond to the unique architectural requirements that performance at scale demands. While a gaming server with these performance demands requires connection-aware load balancing and graceful connection draining, an e-commerce platform’s architectural challenge shifts entirely toward sudden traffic spikes that demand elastic compute provisioning and cache efficiency.

In practice, no single definition of “an application” should exist within scaling discussions, and application behaviour spans multiple patterns, each demanding different scaling strategies and deriving entirely varied architectural choices.

The table below illustrates some of the considerations different applications require:

Application typeConnection patternPrimary challenge
Multiplayer game servers, online trading platforms.Long-lived stateful connections, (persistent TCP, WebSocket).Connection density, memory stability, horizontal scaling without dropping sessions.
E-commerce platforms, public APIs, content sites.Short-lived, bursty connections (stateless HTTP, variable request volume).Sudden traffic spikes, elastic compute provisioning, cache efficiency.
Analytics pipelines, search systems, recommendation engines.Data-intensive batch or streaming workloads.Data locality, I/O throughput, query latency optimisation.
Model training, rendering, simulations.Compute-heavy, parellelisable, data-intensive distributed jobs.Job scheduling, data movement costs, checkpoint reliability.
Real-time inference, AI-powered APIs.Compute-intensive (accelerator-bound), low-latency request/response.Model loading latency, request batching efficiency, cold start mitigation.

Breaking down an application’s behaviour in this way is a key first step, and the foundation for every subsequent architectural decision. From this position, engineering decisions are able to be made from a purposeful, problem-first perspective, architecting for optimal performance tailored to the specific workload, rather than expecting a general approach to work universally.

The gap between theoretical scaling and enterprise reality

While beginning with application behaviour under load in mind is the ideal approach, the reality is that most enterprise applications evolve from prototypes designed organically for immediate functionality, without complete architectural foresight of expected requirements at production-scale.

At Stelia, we are often approached by teams struggling to progress successful pilots born from incremental feature additions where scale was dismissed as a future problem until it became an urgent imperative. By this point, retrofitting an application designed without foresight costs both resource and time, as architectural decisions that made sense at prototype scale must be undone to remove production-scale blockers.

In the current market, understanding how an application actually behaves under load from the outset is both a technical and strategic priority. Organisations cannot afford to lose competitive advantage due to hidden scaling constraints that could have been addressed earlier. When behavioural constraints become visible early, modification can be targeted rather than speculative, enabling faster time to market and more reliable production performance.

How can enterprises change tact to enable effective scaling of AI workloads?

Closing the gap between a behaviour-first approach, and the reality of moving enterprise pilots to production scale requires a fundamental restructure of approach. This transformation begins with visibility, progresses through targeted modification, and concludes with infrastructure decisions that support the application’s actual behaviour rather than fighting against it.

1. Identify behavioural constraints from the outset.

With the goal to observe actual runtime characteristics, understanding application behaviour must begin with instrumentation under realistic load conditions, profiling to determine where time is actually spent, where memory grows, and how data moves through the system.

These observations will reveal the constraints that will determine whether the application is able to scale, and where modifications may be required.

2. Modify the application to remove scaling blockers.

With constraints in full view, the changes required will be based entirely on the application’s behavioural profile, and these application-level changes can be made before infrastructure compensations are implemented to hide inefficiencies.

Modifications made at this stage will create a dynamic whereby infrastructure supports well-behaved applications, not attempts to fix poorly architected ones.

3. Architect hosting aligned to true behaviour.

Only after understanding and modifying an application’s behaviour can infrastructure decisions then be made effectively, as instance types, orchestration patterns, and data locality strategies all flow directly from understanding an applications performance requirements under load.

The behavioural traits identified at the outset are able to translate into concrete architectural choices, and infrastructure becomes designed to support requirements rather than forcing the application to conform to available infrastructure.

4. Set appropriate governance and security boundaries.

Inevitably, different behavioural patterns demand different governance and security approaches. Real-time inference serving sensitive data operates under entirely different compliance and security requirements than batch training on anonymised datasets.

Data residency, access controls, and audit requirements must align with both the application’s behaviour and the sensitivity of the data it processes.

Why full-stack expertise is essential

Executing this approach successfully however, requires fluency across the entire stack. Application development, infrastructure provisioning, and performance optimisation are typically treated as separate disciplines with separate teams. But effective scaling demands understanding how these layers interact in operational environments.

Such fluency across the stack is rare. Most organisations have deep expertise in one layer but lack the cross-stack fluency needed to diagnose behavioural constraints, modify applications appropriately, and architect infrastructure that supports the resulting behaviour.

This is not a criticism of existing teams; it reflects how technical specialisation has evolved. But it does create a capability gap that must be addressed, either through building internal expertise or partnering with those who possess this holistic systems understanding. The teams that scale AI workloads successfully in this next phase of AI impact, will be those who understand how to treat operationalising AI at scale as a unified problem rather than separate isolated challenges.

Reframing the scaling question

Scaling AI workloads effectively doesn’t come down to a question of infrastructure capacity but instead one of understanding. Understanding how the application behaves under load, what constraints that behaviour creates, and how to architect systems that support rather than fight that behaviour.

The organisations moving successfully from pilot to production are those that begin with observation rather than procurement. They instrument to understand actual runtime characteristics, modify applications to address the constraints those characteristics reveal, and only then make infrastructure decisions based on how the modified application actually performs.

This approach requires a shift in how scaling problems are framed, flipping the conversation from how much infrastructure is required to what kind of application are we dealing with and what does it need to operate effectively at scale. Answer these questions first, and the infrastructure decisions follow naturally.

Enterprise AI 2025 Report