Why we chose Ceph as part of our storage-related solutions for production-scale AI

Stelia’s Principal Architect unpacks the licensing, architecture, and governance trade-offs that shaped our approach to production-scale storage.

In the fast-paced world of DevOps and cloud infrastructure, there is a natural gravitation toward tools that offer instant gratification. We value the “Day 1” experience: the single binary download, the five-minute setup, and the immediate results. When a tool allows you to go from zero to a working prototype in the time it takes to drink a coffee, it gains adoption rapid-fire.

However, when you are architecting modern AI-ready cloud infrastructure from the ground up, the laws of physics – and the definition of success – are fundamentally different. We aren’t simply hosting static websites or lightweight user databases. We are building the high-throughput pipelines required to feed petabytes of training data into hungry H100/H200 GPU clusters. We are managing Retrieval-Augmented Generation (RAG) workflows where millisecond latency isn’t just a metric; it’s the difference between a functional product and a failed user experience.

In this high-stakes environment, the pressure to take infrastructure shortcuts is overwhelming. For years, the industry standard advice for object storage has been MinIO. If you ask a room full of startup technical leaders what to use for S3-compatible storage, their answer will be MinIO because it’s simple, fast, and works out of the box.

And they are not wrong. MinIO is an impressive piece of engineering. It is incredibly fast and offers a developer experience that feels like magic on Day 1.

But at Stelia, we realised early on that we couldn’t optimise for Day 1. We had to optimise for Day 1,000. We are building a fortress for organisations’ models, not a playground for prototypes. When we examined the long-term trajectory of the storage landscape, we saw a divergence between the free code and the paid product that was becoming too wide to ignore.

We faced a critical architectural choice: build our platform on technologies that offer ease of use but introduce significant supply chain risk, or choose the hard option and undertake the engineering rigour required to build on a true, community-governed foundation.

We chose the hard option; we chose to invest in long-term durability. And as a result, we selected Ceph as one part of our storage-related solutions.

Below, we outline why we made that decision, and why we believe it ensures organisations’ data is safer, cheaper, and more performant with us in the long run.

The evolution of open source business models

To understand why we moved away from the “easy” option, it is important to look at the business context without cynicism. Infrastructure companies need to monetise, and the “Open Core” model is a standard path. However, the strategies companies use to achieve profitability have profound downstream effects on the users building upon their software.

Over the last few years, we have witnessed a slow, calculated pivot in the object storage market. This wasn’t an overnight change. It was a gradual evolution that has made it increasingly difficult for infrastructure providers to rely on certain open-source projects without incurring massive enterprise licensing costs or legal complexity.

The licensing complexity (AGPLv3)

The first sign of this shift occurred in 2021, when the licensing landscape for MinIO changed from the permissive Apache 2.0 license to the GNU AGPLv3.

For the uninitiated, the distinction between these licenses is massive. Apache 2.0 is the ‘do what you want, just give us credit’ license. It allows for broad innovation and integration without legal strings attached.

AGPLv3, however, is designed to close the “SaaS loophole”. It essentially states that if you modify the software and let users interact with it over a network (which is the definition of a cloud service), you must release your source code as well.

For a hobbyist or a student, this distinction is irrelevant. But for a corporation building a proprietary AI platform, AGPLv3 must be assessed with caution. It introduces legal ambiguity. The question is: “Does linking our internal orchestration layers to the storage backend potentially require us to open-source our proprietary app?”

The answer is “maybe.” In the world of enterprise risk management, “maybe” is a stop sign. This licensing move forces many companies into a corner: purchase a commercial license to avoid the headache, or accept some compliance risks. We wanted a foundation where the legal ground wouldn’t shift beneath our feet.

The feature gap

Beyond the license, we began to notice a growing feature delta – a widening gap between what is available in the GitHub repository and what is sold in the enterprise binary.

The most visible casualty of this shift was the Web Management Console. In earlier iterations, the open-source version provided a robust user interface for managing buckets, users, identity policies, and lifecycle rules. It was a true single pane of glass for administrators.

Over time, however, the community version of this console was stripped down. Critical administrative features – such as OpenID Connect (OIDC) and LDAP integration for identity management, tiering configurations, and deep observability metrics – were removed or hidden behind the enterprise paywall. Today, the open-source console functions primarily as a file browser.

If you want the full administrative suite to manage a multi-petabyte cluster, you are now expected to pay for the enterprise product. For us, this signalled that the open-source version was no longer viewed as a standalone product, but rather as a demo for the paid tier.

Entering maintenance mode

Perhaps the most challenging development for DevOps teams has been the operational friction introduced recently. With the open-source edition effectively entering what many in the community call “maintenance mode,” the project has ceased to be a living, breathing foundation for new infrastructure.

Innovation has been bifurcated. Performance tuning, AI-specific optimisations, and advanced replication features are increasingly channelled exclusively into the commercial product. Even more disruptive was the change in how binaries and Docker images are distributed.

In a modern, containerised world, the inability to easily pull a verified, stable, and compliant image from a standard registry is a major hurdle. It forces teams to compile from source or rely on unverified third-party builds, introducing security risks into the supply chain. You cannot build a platform today on software that is essentially frozen in time.

The alternative: Ceph – an open-source ecosystem

When we decided to look for a different path, we turned to Ceph.

Ceph is an open-source ecosystem, not just a product. Often described as the ‘Linux of Storage’, Ceph is a distributed storage platform that delivers Object, Block, and File storage on top of a single, unified data plane.

The primary differentiator for us wasn’t only the code; it was the governance.

MinIO is controlled by a single corporation.

Ceph, by contrast, is governed by the Ceph Foundation under the umbrella of the Linux Foundation. Its board includes representatives from industry giants like Red Hat, IBM, Canonical, and scientific organisations like CERN. There is no single leader who can wake up tomorrow and decide to deprecate the open-source version. The code truly belongs to the community.

This governance structure aligns perfectly with our philosophy. We wanted a storage layer that would be as open and reliable in ten years as it is today.

In fact, CERN is the ultimate showcase for Ceph. They don’t just sit on the board; they rely on Ceph to manage over 100 petabytes of storage that underpins the IT infrastructure for the Large Hadron Collider. It is the high-performance backbone for their OpenStack cloud used by thousands of physicists to analyse particle collision data. For those sceptical about manageability, CERN’s engineering team regularly publishes “Ten-year retrospective” talks on YouTube. These videos detail how a small team manages this massive, mission-critical environment using the exact same open-source code we use.

Technical deep dive: architecture & data placement

Governance aside, the technical differences between Ceph and its competitors are profound. If you are a developer or an architect, it is important to understand why Ceph is historically considered harder to use, and why that complexity buys you scalability that other systems struggle to match.

The core difference lies in how these systems answer a simple question: “Where do I put this file?”

The “pool” problem in rigid architectures

Many object storage systems use a hashing ring architecture combined with erasure coding. In an ideal world, this creates a ‘shared-nothing’ architecture where every node is identical. This is fantastic for speed in small, static setups.

However, this rigidity creates a massive problem when it’s time to scale. In many of these systems, you cannot simply add one hard drive to a cluster. You generally have to scale by adding ‘server pools.’

Imagine you start with a cluster of 4 nodes, each with 4 drives (16 drives total). If you run out of space, you typically cannot just plug a new 20TB drive into an empty slot. To maintain the geometry of the erasure coding, you often have to add another symmetrical set of 16 drives. This step-function scaling is incredibly expensive.

Furthermore, these systems often lack automatic rebalancing. If you add a new pool of drives, new data is written there, but the old data stays on the old, full drives. You end up with “hot” and “cold” spots in your cluster. Your total throughput is limited by the performance of the new pool, rather than the aggregate power of the whole cluster.

Ceph and the CRUSH approach

Ceph takes a radically different approach. It eliminates the need for a central lookup table or rigid server pools using an algorithm called CRUSH (Controlled Replication Under Scalable Hashing).

In legacy storage systems, a central Metadata Server acts like a librarian.

  • Request: “Where is training_data_batch_1.json?”
  • Librarian: Checks database… “It is on Drive 4, Sector 2.”

As clusters grow to petabyte scale, this ‘librarian’ becomes a bottleneck. If the database gets too big or the librarian gets overwhelmed, the entire cloud slows down.

Ceph fires the librarian.

Instead, Ceph distributes a “map” of the cluster to every client (your application).

  • Request: “I want to write training_data_batch_1.json.”
  • Client: Runs the CRUSH algorithm locally. “Mathematically, given the current state of the cluster, this file must go to OSD #4.”
  • Action: The client talks directly to OSD #4.

Because the clients calculate data placement themselves, there is no central gateway bottleneck. You can hammer a Ceph cluster with millions of IOPS, and because the clients are doing the maths, the cluster scales linearly.

Self-healing data

This architectural difference shines when hardware fails – which, at scale, happens inevitably.

In Ceph, if we add a single new hard drive, the cluster detects it. The CRUSH map updates to reflect the new capacity. The cluster then automatically begins moving data from full drives to the new empty drive in the background. It balances itself like water finding its level.

Conversely, if a drive dies, Ceph marks it as “down” and immediately begins reconstructing the missing data bits onto the remaining survivors using its internal redundancy. We can sleep through a drive failure and replace it during standard business hours, knowing the data has already healed itself.

The complexity myth and the Kubernetes solution

The strongest argument against Ceph has historically been: “But it’s so hard to manage.”

Five years ago, we would have agreed. Managing a Ceph cluster used to require deep expertise in Linux internals, manual editing of text configuration files, and hand-calculating placement groups. It was a beast.

But the landscape has changed dramatically with the rise of Kubernetes and Rook.

Rook is a Cloud Native Computing Foundation (CNCF) project that acts as an “operator” for Ceph. It brings cloud-native automation to storage. Rook handles the dirty work:

  • Deployment: It automates the rollout of the storage daemons.
  • Upgrades: Want to upgrade Ceph? Change one line of YAML, and Rook handles the rolling restart, ensuring data safety the whole time.
  • Expansion: Plug in new drives, and Rook detects them, provisions the Object Storage Daemons (OSDs), and begins the rebalancing process.

Rook has democratised Ceph. It brings the ‘Day 1’ experience of Ceph much closer to the simplicity of other tools, without sacrificing the Day 1,000 power and freedom.

The developer cheat sheet

For the engineers and architects evaluating their options, here is how the two stacks compare in the current landscape:

FeatureThe ‘easy’ path (MinIO Open Source)The hard path taken by Stelia (Ceph via Rook)
StatusMaintenance Mode (Community edition)Active / stable / growing
GovernanceSingle vendor (VC-backed)Linux Foundation (Community)
ArchitectureSingle binaryDistributed Micro-daemons (MON, OSD, MGR)
Data PlacementHashing / Server PoolsCRUSH algorithm (calculated placement)
ScalabilityStep-function (must add full Pools)Linear (add a drive, add a node)
VersatilityObject (S3) onlyUnified (S3, Block/RBD, File/CephFS)
HardwarePrefers homogeneous (identical nodes)Heterogeneous (Mix NVMe/HDD easily)
ManagementCLI (Console stripped down)Dashboard + CLI + Kubernetes (Rook)

Don’t rent your foundation

Our decision to choose Ceph wasn’t about finding the easiest path, it was about finding the most sustainable one.

It was about moving away from platforms which historically demonstrated a willingness to remove features, change licenses, and freeze open-source code. Eventually, those costs trickle down to the customer – either in the form of higher prices to cover enterprise licensing fees or, worse, forced migrations when the free version becomes unmaintainable.

We will not pass that supply chain risk on to our customers.

We chose Ceph because it allows us to offer organisations a storage layer that is battle-tested, infinitely scalable, and free from the threat of vendor lock-in.

Ultimately:

We handle the complexity: Ceph is complex under the hood. We take on the burden of tuning CRUSH maps, managing deep scrubbing, and balancing placement groups so customers just get a fast, resilient S3 endpoint.

We control the costs: Because we aren’t paying a per-terabyte tax to a proprietary software vendor, we don’t have to charge customers one either. That means better egress rates and lower storage costs for your models.

In the AI gold rush, many vendors optimise for speed to market. We focus on building infrastructure that remains dependable, performant and resilient when systems reach production scale.

Enterprise AI 2025 Report