Last week, Dave Hughes, Stelia’s CTO, spoke at AICamp’s ML meetup, sharing his lessons learned from running AI workloads in production.
It was an evening of candid discussion about what’s actually happening beneath the surface of an industry moving at breakneck speed. Drawing on experience spanning high-performance computing, machine learning infrastructure, and operationalising AI at production scale, Dave explored why talented teams with promising prototypes consistently hit the same wall.
Below are the key insights from that discussion: how the industry arrived at this point, how infrastructure must evolve to enable production AI at scale, and practical guidance for teams navigating this challenging landscape.
How the industry arrived here
When GPT-3 and GPT-4 triggered the AI boom in 2023, the market responded predictably. A wave of startups emerged, each promising to reinvent software categories with AI-native capabilities. Enterprises at first watched with interest, then began experimenting for themselves.
As the market has advanced, boards quickly demanded measurable ROI from AI initiatives, while leadership expected organisational transformation from technical capabilities unable to move beyond pilots, resulting in companies rushing to deploy AI before determining whether their infrastructure was actually able to support it long term.
In reality, AI workloads demand what HPC and cloud computing each delivered separately, but together. HPC brought rigorous systems thinking and comprehensive planning around compute resources. Cloud computing delivers elastic computing with great degrees of flexibility and scalability, multi-tenancy, and resource efficiency. But AI requires both simultaneously, and current infrastructure struggles to deliver this combination effectively.
Across organisations, we see the same pattern continue to emerge, as teams architect for experimentation, not production. The infrastructure decisions sufficient to enable rapid prototyping quickly become constraints when systems need to handle production volumes, and cost structures that may have appeared manageable during pilots become prohibitive at scale.
What elastic infrastructure must become
Consequently, as AI becomes central to operations, infrastructure capable of supporting production deployments must be architected with lessons from both HPC and cloud in mind. That means combining HPC’s rigorous, holistic systems thinking with traditional cloud computing’s elastic scaling, multi-tenancy, and resource efficiency.
The next generation of infrastructure needs to address several challenges simultaneously:
Data mobility becomes foundational. As AI moves beyond training to inference – agents, fine-tuning, and real-world applications – workloads must run at the edge, not only in centralised GPU clusters. This requires moving substantial data volumes: training data, incremental backups, and inference models. Current data transfer pricing from major cloud providers makes this prohibitively expensive. Production AI infrastructure must enable fluid data movement between edge, on-premise, and cloud environments without cost structures that undermine business viability.
Distribution defines architecture. Inference workloads have latency requirements that demand deployment in metro areas close to end users. Yet interconnected GPU systems are, within reason, fairly static at the point of assembly in data centers – the opposite of elastic cloud principles. This creates inherent tension: distributed infrastructure is necessary, but GPU clusters are inherently static. Addressing this requires rethinking architectural approaches to distributed compute.
Proven HPC technologies remain relevant. Tools like Slurm aren’t legacy systems; they’re battle-tested technologies trusted in production-critical environments. HPC forced comprehensive planning because clusters were designed for lengthy, multi-year lifecycles with capacity determined upfront. Today, AI teams prototype rapidly but often don’t consider production requirements until bottlenecks emerge. Resilient infrastructure benefits from reintroducing some of this architectural discipline.
Guidance for teams building in this environment
For teams developing AI products within this volatile phase of industry development, Dave offered three key pieces of guidance:
- Prototype with scale in mind. Rapid iteration drives innovation and remains essential in fast-moving markets. However, architectural decisions made during prototyping determine whether systems can scale when required. Building for global scale on day one isn’t necessary, but ensuring architecture can scale when needed is critical. Retrofitting scalability into MVPs not designed for production loads is expensive and often prohibitive.
- Adopt holistic architecting and debugging practices. When performance issues emerge, investigation shouldn’t stop at the application layer. These systems are deeply interconnected; hardware and firmware constraints frequently manifest as application-level failures. If code appears sound but performance degrades, the issue often lies elsewhere in the stack.
- Evaluate economics from the start. AI’s potential for global impact is significant, but realising that potential requires sustainable business models. Understanding where value can be delivered at scale, what infrastructure costs will look like in production, and how margins are affected by architectural choices determines which products will survive production reality.
The path forward
Dave’s session reinforced what experience across the industry has proven: the gap between prototype and production is architectural, stemming from a fundamental mismatch between what AI workloads require and what current infrastructure delivers. While AI training workloads share characteristics with traditional HPC, where fairly static, upfront-defined infrastructure worked well for long-running computations, inference introduces a fundamentally different demand pattern. Short-lived, bursty inference workloads require the flexibility and scalability of elastic compute, not the fixed capacity of HPC clusters. In this way, AI demands infrastructure that delivers both the systems discipline of HPC with the operational flexibility of cloud computing; capabilities that require rethinking infrastructure from the ground up.
Talented teams with viable products are struggling to reach production because infrastructure is still catching up to AI’s requirements. Production AI demands comprehensive systems thinking across the entire stack. And as infrastructure evolves to meet these demands, the teams that succeed will be those who architect with this reality in mind from day one.
Watch Dave’s full session on YouTube here.