Follow

Keep up to date with the latest Stelia advancements

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

AI thinks HPC is boring. But that’s why it works.

SC25 demonstrated why AI needs HPC thinking: maturity, reliability and disciplined engineering will determine the future of AI at scale.

As someone who comes from a traditional high-performance computing (HPC) background, it’s been fascinating to watch the very different mentality that has emerged with the rise of AI. At SC25 this year, that contrast was more stark than ever before.

AI has come to be defined by speed and shiny features. It celebrates novelty. The industry has a habit of focusing on short-term vanity wins instead of thinking ahead to production-scale deployments that demand reliability and performance.

HPC on the other hand celebrates maturity and reliability, moving carefully with precision, and testing thoroughly.

What is not lost on me is that the AI industry is trying to solve challenges HPC solved 10 or 20 years ago. But the technology solutions are seen as boring and legacy, even though they’re powering some of the most important scientific simulations in the world.

Things like workload managers and job schedulers have been so deeply engineered within HPC that they have become invisible foundations.

What AI is actually doing, while reinventing the wheel, is highlighting the value of everything HPC has already optimised for: rigour, reliability, long-term investment and operational excellence.

What is HPC vs AI? Confusion is still rife

One of the challenges the AI industry continues to face is that people still conflate AI, Machine Learning, and HPC. They aren’t the same thing. It’s like confusing a Ferrari, a rally car and a tractor because they all have engines. But they’re designed for completely different terrains.

  • HPC is clustered computing for solving extremely large or complex computational problems for numerically intense scientific workloads.
    • Examples include weather and climate simulation; computational fluid dynamics, molecular dynamics; physics simulations; and large-scale data analysis.
    • The main characteristics are: a focus on raw computational power and parallel processing; heavy on numerical simulation; heavy use of MPI etc to target parallel compute cluster.
  • Machine Learning is a subset of AI focused on learning patterns from data. The key point being that algorithms learn patterns without explicit programming.
    • Examples include image recognition, speech recognition, fraud detection and recommendation systems.
    • In terms of characteristics: it requires data; produces statistical models; includes supervised and unsupervised reinforcement learning; uses frameworks e.g. PyTorch, TensorFlow.
  • AI is a broader field encompassing multiple subsets, including (but not limited to): machine learning, deep learning, planning and reasoning systems, natural language processing, robotics and autonomous systems.
    • Typical use cases include: chatbots, self-driving cars, intelligence scheduling or job planning tools.

They aren’t interchangeable but having the breadth of knowledge across all three is rare, yet quickly becoming table stakes for those who need to deliver AI systems that perform and are governable at scale.

AI-specific accelerators: what are the risks?

A growing architectural risk here is the industry’s shift toward AI-specific accelerators that excel at dense tensor operations but are fundamentally unsuitable for HPC workloads. HPC depends on high-precision numerics, complex branching, memory-intensive workloads and tightly coupled interconnects, none of which map efficiently onto systolic array architectures optimised for neural networks. If procurement increasingly follows AI demand, we risk ending up with infrastructure optimised for training throughput but incapable of running the large-scale simulations and modelling workloads HPC was designed for.

Public investment in AI supercomputing

Undoubtedly, the HPC world is undergoing its biggest transformation in decades. Europe is investing big in AI ‘gigafactories’; the U.S. is expanding the National Artificial Intelligence Research Resource (NAIRR) – providing researchers with shared access to AI computing power, data, and tools; and Japan is evolving its hybrid HPC-AI infrastructure, balancing simulation workloads with dedicated AI systems.

But as SC25 highlighted, while investment in AI infrastructure is accelerating, HPC centres must now satisfy three different areas:

  1. Simulation users
  2. ML training users
  3. AI users

Again, we are seeing a lot of divergence in execution when it comes to these three areas, which while different, have certain similarities that can be capitalised on. In having more common designs reused between HPC and AI, infrastructure has scope to be utilised by more customers across multiple use-cases.

Coming full circle: new AI tools are rediscovering old HPC lessons

Once again, ‘legacy’ tools like Slurm and other HPC staples showed up at SC25, cementing their positions as essential, trusted, scalable production tools. Despite the innovation that has been underway in workload management and job scheduling, they have not been thrown off the podium. This is proof that how HPC tackled problems all those years ago has stood the test of time and demonstrated they are still the most mature and reliable at large scale.

This aligns to our mindset at Stelia. We don’t adopt every new framework we see, but what we know is battle-hardened, adaptable and will survive into the future so that our customers have the most robust and resilient systems.

HPC is modernising as well

HPC is also evolving, and one of the most significant shifts is the adoption of data-layer patterns. We’re now seeing heavy use of object stores operating alongside traditional parallel filesystems within a unified data plane. This hybrid model enables HPC systems to handle AI and ML workloads that expect scalable, metadata-light object storage, while still supporting the high-throughput, low-latency requirements of simulation workloads. HPC is retaining its proven strengths while incorporating AI-driven design patterns that improve flexibility, data mobility, and workload interoperability across both ecosystems.

Why knowledge spanning HPC, ML and AI is vital for successfully productionising workloads at scale

The emerging HPC-AI landscape demands partners who understand the physics of compute, through to governance, policy and end-to-end data and inference pipelines.

Most companies today only focus on optimising one part of the stack as opposed to taking a step back to look at the modern AI stack as a whole and how it integrates and orchestrates as one. This is why systems are failing at scale.

What’s required to succeed is HPC-grade engineering discipline, governance that’s aligned with the needs of the most complex environments, design that’s built for longevity, and performance that’s optimised from silicon to model inference.

AI needs more HPC thinking, not less

SC25 reinforced my thoughts. The future of AI will be defined by maturity, reliability, and deep understanding of how to run high-performance systems at scale.

AI is transforming HPC, yes. But HPC is also grounding AI with deep-rooted lessons that have stood the test of time. And HPC is continuing to modernise at a sustained pace, with their adoption of new storage features being a prime example.

As more public funding flows into AI supercomputing, the world will need fewer trendy frameworks and more engineering discipline to deliver the best, safest, and most innovative outcomes with AI.

Enterprise AI 2025 Report