Follow

Keep up to date with the latest Stelia advancements

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

What is the Mixture of Experts model approach, and does the efficiency promise hold up to its infrastructure demands?

The AI model architecture gaining ground across the industry – what it is, why it matters, and what it requires to work effectively.

Mixture of Experts (MoE) is an approach to building AI models that has been quietly gaining ground across the industry. Rather than activating an entire model for every query, a Mixture of Experts model divides a model’s intelligence across specialised sub-networks, routing each input only to the experts best suited to handle it. The result is a model that carries significantly more knowledge than it ever deploys at once; a more deliberate and selective way of putting large models to work.

The pace of this adoption is accelerating across the industry. Two releases in recent weeks tell the story clearly. DeepSeek’s latest model is built on MoE architecture at a scale that would have seemed implausible just a year ago – 1.6 trillion total parameters, with just 49 billion active at any given time, demonstrating massive capability, deployed at a fraction of the computational cost of a traditional model of equivalent size. Google’s Gemma 4 release earlier this month makes a different but equally telling case – applying the same MoE architectural logic to a 26 billion parameter model, demonstrating that Mixture of Experts is not solely the territory of data center-scale deployments. It is fast becoming the default approach for building efficient, capable models at any scale.

And this is part of a longer trajectory. Mistral AI – the French AI lab that became one of Europe’s most-watched AI companies on the back of its open-source model releases – made MoE central to its most significant model releases, most notably with the Mixtral series that first brought the architecture to mainstream attention. And DeepSeek’s broader model family – which rattled the AI investment consensus at the start of 2025 by matching frontier performance at a fraction of the expected cost – is built on MoE architecture throughout.

But momentum at the model level and clarity at the deployment level are often two different things.

For organisations looking to embed these models into their workflows, the headline claims tend to do most of the work. Promises of efficiency, reduced computational overhead, and models that punch well above their parameter weight. Less examined is what MoE models actually demand; what sits beneath the efficiency promise, and what needs to be in place to truly realise it.

In this blog, we look beneath the headline – at what MoE actually is, how it works, and what needs to be true for its benefits to be fully realised.

What problem does MoE actually solve?

Every architectural shift starts with a limitation worth solving, and for MoE, that limitation is the cost of scale.

In a traditional dense model, every parameter is activated for every token processed. This means a 70-billion-parameter model performs 70 billion calculations per word, whether the query demands the full weight of its knowledge or not. It is extraordinarily capable, but the computational cost is constant and scales directly with model size, meaning the entire model is running all the time, regardless of what you actually need from it.

Smaller, task-specific models offer an alternative; faster, cheaper, and well-suited to narrow, well-defined problems. But their strength is equally their weakness. They excel only when the problem is equally well-defined from the outset, and for some organisations spanning multiple functions, priorities, and directions of travel, designing a model for one specific, permanent job becomes a limitation in itself.

This is where MoE comes in. It sits in the middle ground, offering a third approach to that trade-off. Rather than activating everything at once, it divides the model’s feed-forward layers into distinct sub-networks (experts), each developing a degree of specialisation through training. When a token comes in, a gating network (the router) evaluates it and determines which experts are best suited to handle it. Out of eight experts, it might only need to activate two, and the rest sit dormant for that query.

The result is essentially a decoupling of total knowledge from active computational cost. Take Mixtral 8x7B for example: 47 billion total parameters, but only around 13 billion active at any given time. The intelligence of a large model, running at the computational cost of a much smaller one.

For enterprise use cases, this matters in a specific and practical way. An organisation that needs a single model to handle varied inputs, from technical research to financial analysis, no longer has to choose between a narrow model that cannot flex and a large, dense model that is prohibitively expensive to run at scale. With MoE, there is a third path: broad capability, selectively deployed.

What the headlines don’t cover

For organisations navigating the trade-offs between model size, cost, and flexibility, the case for MoE sells itself. But like most things in production AI, the distance between the architectural promise and the deployment reality is where the more challenging questions live, and with MoE, these considerations run deeper than the model layer itself.

Indeed, the efficiency gains are real – fewer active parameters per query, lower compute cost, and faster inference. But beneath this, what is less often accounted for is that the model still requires all of its experts to reside in memory simultaneously, whether they are being used or not. And while compute costs come down, memory costs remain unavoidable.

In a multi-GPU setup, the router is making millions of decisions per second, physically scattering tokens across servers to reach the right expert and creating sustained, unpredictable pressure on the network fabric that a dense model simply does not generate.

And the consequence that tends to go unexamined is this. When an expert becomes consistently overloaded – too many tokens routed to the same place at once – the model does not crash. Instead, it quietly drops those tokens, skipping that layer entirely and continuing to produce lower-quality outputs, and this degradation appears on the surface as an underperforming model.

Taken together, these are not actually model problems, but rather a sensitivity to infrastructure conditions. Two organisations running the same MoE model on different infrastructures can get meaningfully different results – in speed, reliability, and output quality – and attribute the gap entirely to the wrong layer.

MoE models are, in this sense, uniquely impacted by infrastructure quality in a way that dense models are not. The efficiency they promise is conditional, and these conditions are worth understanding before deployment decisions are made.

What this means in practice

The organisations that will get the most from Mixture of Experts are those that adopt such models decisively and with a comprehensive understanding of the performance and costs they could realistically experience.

In any case, at its core, MoE is an expression of a principle that is becoming increasingly important across the AI industry as a whole: that the right capability, matched to the right task, will consistently outperform brute-force scale. The routing mechanism in the MoE architecture can, at inference time, do what a good AI strategy does at the planning stage – by selecting deliberately rather than activating indiscriminately.

And as the architecture continues to evolve, that principle will only become more consequential. The shift from passive models to autonomous agentic systems – ones that execute multi-step workflows across tools, APIs, and data sources – will transform what routing actually means. Where today’s MoE router selects between neural sub-networks, tomorrow’s may function more like an operating system scheduler: directing traffic not between experts within a model, but between code interpreters, databases, and external APIs, and demanding infrastructure that is fluid, stateful, and instantly scalable.

The opportunity for Mixture of Experts as a solution is incredibly exciting. But its value is only realised when the infrastructure beneath it is built to support it.

This is why the conversation about AI deployment cannot stop at the model layer. VRAM, network performance, load balancing, orchestration – for an MoE model, these are direct determinants of output quality.

As MoE continues to define how the most capable open models are built, the foundations beneath the model layer become more important than ever.

Enterprise AI 2025 Report