The changing face of supercomputing: why traditional benchmarks are falling behind

High-performance computing (HPC) is the backbone of many of today’s most demanding computational workloads – from training and scaling state-of-the-art AI models to advancing scientific discovery through large-scale simulations, complex data analysis, and engineering design.

With the complexity and diversity of the challenges they’re addressing growing at exponential rates, the need to be able to effectively measure their performance in a way that is reflective of today’s computing needs is paramount.

Yet for decades, we’ve ranked and compared these machines using a single, narrow benchmark. That benchmark, known as LINPACK, still dominates the global conversation about computing power.

In this article, through insights from Stelia’s leading platform engineers, we explore how the LINPACK TOP500 supercomputer rankings is no longer a representative assessment of the world’s most powerful AI infrastructure. Most importantly, we’ll outline what questions engineering teams should be asking when evaluating infrastructure, and how technical leaders can shift focus from singular benchmark scores to the end-to-end performance metrics that actually determine whether an AI system works effectively in practice.

When LINPACK made sense

Understanding why the TOP500 has become misleading requires considering why it succeeded in the first place. The TOP500 originally launched as a simple but revolutionary idea in 1993: rank supercomputers by their performance on a standardised benchmark, updating the list twice a year to track trends in high-performance computing. The list was an academic one, introduced at a time when supercomputers were all very similar, and so to solely focus on how many computations the system could perform at one time was a fair measure. This question produced fairly representative results in an ecosystem where the main difference between supercomputers was limited to how many servers they had – you double the servers, you double the compute count – and so, the TOP500 list was born.

The benchmark served its intended purpose for its intended audience perfectly. The community was small and tight-knit. Academic supercomputing centers, national laboratories, and a handful of commercial users formed a field where everyone understood the context: these were purpose-built research machines designed to run scientific simulations efficiently.

The LINPACK TOP500 provided a standardised way to track the exponential growth in computational power, helped procurement decisions for research institutions, and created healthy competition among vendors. But the homogenous world of academic supercomputing has been replaced by a heterogenous ecosystem where AI training clusters, high-frequency trading systems, and cloud infrastructure all compete for the same performance metrics. And while both the purpose and the audience have since changed, the benchmark has remained untouched, leading to misalignment about what fastest actually means.

Score manipulation in the GPU era

The first cracks in LINPACK’s credibility appeared in the early 2010s, as GPU computing entered the high-performance computing and scientific computing space and they excelled at one specific thing: the exact type of dense linear algebra that LINPACK’s benchmarking measured.

This area of expertise had very little to do with real scientific computing performance, and before long the incentive to use GPUs to manipulate LINPACK results entered the mainstream.

“There were some grumblings in the supercomputing sphere, because it was very apparent that when people design a new supercomputer in the classic sense, they would put one rack of GPUs next to it, which in practice, nobody used, but doubled the benchmark number”, noted Lukas Stockner, Principal Engineer at Stelia.

Extremely unsubtle, but extremely effective. Some organisations began to architect entire systems specifically to cheat LINPACK scores rather than to optimise for actual workloads. The GPUs sat largely idle in production, contributing nothing to scientific throughput while dramatically inflating benchmark results. Indeed, with time, the supercomputing community began to recognise that the benchmark was measuring something increasingly disconnected from real performance. In tandem, as GPU programming matured and applications learned to leverage parallel architectures effectively, the worst cases of benchmark manipulation became less common. But the damage to LINPACK’s credibility was done, and its simplicity in performance testing outdated.

Crucially, the rise of the GPU era exposed a key flaw in the benchmark’s logic: it assumed that peak theoretical performance translated to practical performance, but in a world where architectural complexity was increasing rapidly, that assumption no longer held. The stage was set for even bigger problems as the explosion of AI brought entirely new requirements that LINPACK was never designed to measure.

Why AI workloads broke LINPACK’s logic

The machine learning boom, and later the broader AI revolution, was the true exposer of LINPACK’s fundamental misalignment to modern computing. The shift first changed what hardware people buy, then revealed how meaningless raw compute numbers become when everyone buys identical components. These figures do not explain whether the system actually works for AI workloads.

The real performance bottlenecks that benchmarks should account for lie elsewhere. As Dave Hughes, Stelia’s CTO for Infrastructure Services Group, puts it: “HPC and AI workloads work in a triangle. Sure, at the tip of the triangle you’ve got the actual compute but at the two lower tips you’ve got network and storage. Everyone only posts benchmarks of their cluster based on the tip of the triangle, the compute”

This creates a dangerous blind spot, and countless questions left unanswered. “The problems actually come in with things like, does the storage perform correctly? Does the networking between the servers work without bottlenecks? How is the reliability? None of that is measured by LINPACK’s benchmark.”

These are factors that demand accounting for in the current climate. Large model training requires massive data throughput, with training datasets streaming continuously from storage to GPUs. As models scale beyond single-GPU memory limits, network communication between nodes and system reliability becomes critical. Without stressing the storage, flooding the network and doing a really heavy burn or soak of the compute as well, posting your compute performance alone is pointless.

Layer this with the fact that major tech companies avoid the rankings due to strict secrecy and you can see why for engineering leaders making infrastructure decisions, the benchmark that was supposed to guide procurement has become misleading and difficult to interpret.

The better benchmark alternatives

Fortunately, better alternatives do exist – albeit with their own flaws. The industry has recognised LINPACK’s limitations and developed several more sophisticated benchmarking approaches that address its shortcomings, however, each of these serve different purposes and bring their own trade-offs.

NVIDIA’s HPC Benchmarks represent the most comprehensive attempt at tackling LINPACK’s shortcomings while maintaining compatibility with existing practices. The suite includes four benchmarks: HPL (the traditional High Performance LINPACK), HPL-MxP, HPCG, and STREAM – each targeting different aspects of modern computing that LINPACK ignores.

HPL-MxP (and HPL-AI) were created to solve the precision mismatch problem. HPL-MxP aims to highlight the emerging convergence of HPC and AI workloads by using mixed-precision arithmetic that mirrors real AI training. Rather than requiring 64-bit accuracy throughout, HPL-MxP uses low-precision (typically 16-bit) computation for the LU factorisation, then applies iterative refinement to recover full 64-bit accuracy. This approach reflects how modern AI systems actually work, where Tensor Cores provide substantial speedups in mixed-precision workloads by accelerating precisions such as TF32, BF16, FP16, INT8, and INT4.

HPCG (High Performance Conjugate Gradients) complements HPL by addressing the memory bandwidth and latency gap. Unlike LINPACK’s dense matrix operations, HPCG uses sparse matrix computations that better reflect real scientific applications. Research shows a strong correlation between the HPCG result and the sustained memory bandwidth measured by the STREAM benchmark, making it particularly valuable for evaluating systems where memory, not compute, is the bottleneck.

STREAM, the fourth component of NVIDIA’s suite, directly measures sustainable memory bandwidth – an important factor for modern AI workloads where data movement often dominates performance.

Separately, MLPerf are worth an honourable mention. While not comparable as a direct alternative, MLPerf takes an novel approach by abandoning synthetic benchmarks entirely in favour of real AI workloads. MLPerf Inference measures actual deployment scenarios, while MLPerf Training measures real training performance to target quality metrics. We cannot liken this to a replacement for LINPACK-style benchmarks though, as it serves a different purpose, measuring end-to-end AI application performance rather than fundamental computational capacity.

What we must recognise is that none of these improvements or alternatives to LINPACK’s benchmark are an easy fix. The sophistication of these alternatives bring with them complexity. Where LINPACK offered a single number, modern benchmarking requires understanding multiple metrics across different precision types, memory patterns, and workload characteristics.

What are the questions engineering teams should be asking?

For technical leaders evaluating infrastructure, this means moving beyond single benchmark scores toward comprehensive performance profiles. With no silver bullet fix, engineering teams must learn to ask the right questions during vendor evaluation: What’s the actual training throughput for your model architecture? How does storage I/O performance scale with concurrent training jobs? What’s the mean time between failures for multi-week training runs? Can the network fabric handle the communication patterns of distributed training without bottlenecks?

These questions will expose the true performance capabilities of a model and paint a more accurate picture than any one benchmarking score can, not least the results of the LINPACK TOP500.

What does a trusted framework need to incorporate?

Of course these performance profiles are not an ideal long-term solution. To rebuild a trusted framework, the industry needs a complete change in approach. Rather than measuring synthetic workloads, effective AI infrastructure evaluation should test systems with realistic training scenarios – full end-to-end tests that actually train models rather than just crunching abstract numbers.

This would look like testing storage under the I/O patterns of actual model training, stressing network fabric with real distributed communication patterns and measuring reliability across multi-day training runs. As Stelia’s Principal Engineer, Lukas Stockner suggests: “A much better way to measure performance would be a real-life test, training a model from scratch, loading both the network and storage systems in realistic patterns.”

Of course, the challenge there is a practical one: such comprehensive testing would take significantly longer than LINPACK’s 10-minute runs, making adoption difficult. Nevertheless, for organisations making million-dollar infrastructure decisions, the investment in proper evaluation could pay dividends in avoiding costly performance surprises.

The reality check technical leaders need

Despite these notable advances in benchmarking, the uncomfortable reality is that, as systems grow more complex and use cases become more specific, the perfect adoptable benchmarks may never catch up to the pace of innovation. On top of that, the persistence of LINPACK’s TOP500 as the primary public ranking means many procurement conversations still default to traditional performance metrics.

But as technical decision makers, there is no time to wait for industry practices to fully evolve. Million dollar decisions are being made today, and the key becomes working more intelligently with the available data – evaluating systems across multiple performance dimensions rather than defaulting to familiar rankings.

The path forward requires technology leaders to become more sophisticated consumers of benchmark data. When vendors present quantitative claims, the critical questions become: Which benchmark was used, and how does it relate to your specific workloads? What precision format do these numbers represent? How does the system perform across the full range of benchmarks, rather than just the one that produces the most impressive headline number?

Most importantly, recognise that identical hardware specifications guarantee nothing about real-world performance. The difference between a well-architected AI system and an expensive collection of identical servers lies entirely in the integration – how the storage, network and compute work together under realistic loads. This is where comprehensive evaluation using multiple modern benchmarks becomes essential.

The TOP500 will likely continue to exist, and vendors will likely continue to cite convenient benchmark results in marketing materials. But technical leaders who understand these limitations, and actively demand better metrics from their vendors, can make substantially better informed decisions by focusing on the performance characteristics that actually determine whether their systems deliver business value.