The illusive five nine’s SLA, a phrase often synonymous with major brand positioning, referring to the ‘only’ 5 minutes and 15 seconds of downtime within a 12-month period. 5 minutes of downtime is ok, right? Or is it. Probably not on black Friday, the run-up to Christmas, or when major event tickets are released and snapped up within moments of being published for purchase.
In a world of rapid technological innovation across all sectors, and the world’s combined technology community brainpower available (99.999% of the time) in the palm of your hand, why is 100% SLA so difficult to achieve, and why are so few operators able to commit to a level of service historically thought impossible?
Do we even need 100% uptime?
Service availability is one of the dirty areas of the internet and is sometimes approached by service providers through the lens of, ‘what can we get away with’, and ‘nobody will notice if it’s at 4 AM’. Infrastructure providers use tricks such as long polling intervals (checking of service availability) as long as 60 mins in some scenarios, leaving a huge margin of unmonitored ambiguity, and thus able to plausibly deny any service interruption ever happened as it wasn’t monitored. Or just deploying the ‘we didn’t notice anything’ stock response.
None of this is good enough, and the next generation of network, cloud and application providers are raising the bar to 100% service availability and holding others accountable for this too.
The journey to 100% service availability starts with transparency and an open approach to service monitoring. Defining the baseline for availability as observed internally is only one position and network providers often advocate no service interruption seen, whilst client services are unavailable for large segments of the user base. What about external availability, service, application, and where are the users based, single building, country, or global? And how is this achieved at scale, with a globally distributed customer base, and globalised infrastructure?
And then there’s cost. There’s no denying that high uptime connectivity service architecture will cost more than the lower availability targets simply due to the additional levels of equipment and diversity required to meet this. But it’s not impossible.
At Stelia we don’t have all the answers to this yet, but we can explain our journey, how we approach the 100% service uptime guarantee we offer for services across the fabric, and what this means for our globally distributed client base of growing technology partners.
..our core fabric design principles – Simple, Scalable, Secure.
Stelia engineered the fabric’s management backwards. We started with observability and needed to engineer observability as a service not only aware of our own fabric endpoints but also that of all connected service providers and upstream. This data platform is available at every fabric node and run between on-premises hardware and two different public cloud providers adopting both container and as code platforms. We power this data engine with as many ingress points of data we can, including internet registries, territory-managed service providers, global-looking glasses and other service providers. Through this engine, we can look in and look out of the fabric, towards clients and their upstream as well.
The Stelia fabric is built on a fully disaggregated architecture, freeing us of proprietary chassis-based hardware, and allowing us to distribute load and thus diversity across multiple pieces of independently powered and controlled high-capacity hardware. Pair this with carefully chosen underlying fibre providers, and the best in-region alt-nets, the elastic Fabric underlay is complete. Disaggregation allows Stelia to dynamically choose the specific packages and technologies available on a node, avoiding the complex multi-role architectures of traditional carriers, thus simplifying management, and reducing risk. Stelia adopts passive architectures across the backbone as far as possible and embraces the design capabilities only now available with 400G ZR+ standards removing more hardware from the backbone and further reducing risk.
All of these small decisions lead us to maintain and deliver our core fabric design principles – Simple, Scalable, Secure.
Much of this technology and design theory has been well developed and deployed by the large hyper-scale application providers and is only now becoming widely adopted by emerging backbone providers such as Stelia.
Stelia firmly believes 100% service uptime is absolutely achievable and should be the benchmark for all new network operators. As such, Stelia will SLA any service on-fabric at 100% based on a dual port configuration, either within or between territories, and at any location. We’re confident in our architecture, and ability to continually innovate our Fabric capability.
As a caveat – We can’t prevent an entire facility power meltdown – but if the entire facility is offline and we’re offline because of this, our customers are probably having a pretty bad day too.
This only scratches the surface of how our Fabric is designed to support growing technology businesses and we will be publishing much more information on our service architecture over the coming months.