Follow

Keep up to date with the latest Stelia advancements

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Voice AI is ready. The question is whether enterprises are.

The ElevenLabs Summit reinforced that voice AI has moved from emerging capability to deployment reality; the question now is enterprise readiness.

Last week I was in London for the ElevenLabs Summit, a day spent with founders, operators, and enterprise leaders all circling the same question from different angles: what does it actually take to deploy voice AI at scale, and what changes when you do?

Its a question I find myself returning to constantly in conversations with media and entertainment organisations. Voice is quickly becoming one of the most powerful interfaces in AI, shaping agents, real-time experiences, and the way audiences interact with content at every level. And what the summit made clear is that this conversation is shifting from what voice AI could theoretically do, to how organisations are actually deploying it in practice, and the architectural foundations that make deployment at scale possible.

What’s actually holding enterprises back?

Mati Staniszewski, ElevenLabs’ founder and CEO opened with the fact that voice AI is now moving from experimentation to deployment. And he highlighted the vital importance of the surrounding AI architecture catching up to make this happen. In reality, enterprises are increasingly confronting the architectural reality of deploying AI at scale. Integrating voice AI into production environments requires new layers of governance, observability, and system integration across fragmented enterprise infrastructure, challenges that are still being actively worked through.

The components required to scale AI responsibly – governance frameworks, risk controls, monitoring, and operational oversight – are becoming clearer, but remain unevenly implemented across most organisations. For many enterprises, the limiting factor is no longer model capability, but the readiness of their infrastructure to integrate, supervise, and operationalise AI safely. Addressing this architectural gap is now becoming the central challenge of enterprise AI adoption.

What is emerging is not simply a model deployment challenge, but an orchestration challenge. A practical example came from Deutsche Telekom’s deployment discussion at the summit. At telco scale, voice already sits at the centre of billions of interactions. When AI shifts from the app layer to the network layer, adoption dynamics change entirely. Enterprises are now working across multiple models, modalities, and systems (speech, video, text, data platforms, and internal workflows), all of which must operate cohesively. Voice AI does not exist in isolation; it must connect to enterprise data, trigger actions, and operate safely within production environments. This shifts the focus from individual model capability to the orchestration layer required to coordinate how AI systems interact, execute, and evolve at scale.

From pilots to production – carefully, but decisively.

Later in the day, a panel on deploying AI at scale brought together voices from ElevenLabs, BCG X, Naturgy, and Konecta, to discuss what separates those who break through the production barrier from those who don’t.

The companies making progress are redesigning workflows and committing to production with governance embedded from the outset, balancing speed with structural discipline. I see teams struggle with this across the media and entertainment landscape continuously. Those who treat AI as something to be perpetually evaluated never move, while those moving at speed without governance accumulate risk. But the organisations that are successfully building long-term resilient AI capability are those making architectural commitments, restructuring around the technology, and moving forward with intent.

In a market increasingly defined by AI urgency, it is important to note that successful adoption at scale is not reckless, but rather involves building the operational foundations that enable faster movement with confidence.

What production-ready voice AI actually looks like

An important signal from the summit was the introduction of Eleven Creative and the new Flows environment, which bring multimodal generation into a single, orchestrated creative workflow. These systems enable teams to move from idea to fully produced, localised campaigns in a matter of hours, while supporting continuous iteration and experimentation across audiences, formats, and markets.

Luke Harries demonstrated how brands are already using Eleven Creative to produce studio-quality advertising by combining speech, music, sound effects, image, and video generation within a unified platform. This underscores that the shift is not only creative acceleration, but operational transformation – creative becoming infrastructure-connected rather than campaign-based, where voice becomes a persistent identity layer and execution operates continuously.

ElevenLabs illustrated this transition through its collaboration with Michael Caine, whose voice has been integrated into the platform as part of its voice marketplace. Voice is becoming a licensable digital asset, allowing creators and actors to extend their presence across campaigns, formats, and markets without traditional production constraints. For media and entertainment organisations, this introduces new models for creative scalability, localisation, and audience engagement, while establishing voice as a persistent identity layer across platforms and experiences.

More fundamentally, creativity itself is beginning to shift from a production process to a programmable system. With platforms like Eleven Creative and Flows, creative workflows become connected to data, experimentation loops, and distribution infrastructure. This enables organisations to continuously generate, test, and optimise campaigns across audiences, formats, and markets, transforming creative from a static output into a dynamic, operational capability embedded directly into enterprise systems.

This shift reinforces that competitive advantage will increasingly depend not only on model capability, but on the infrastructure and orchestration layers that enable organisations to deploy, govern, and scale AI reliably across their operations.

System foundations and human AI collaboration

Arguably, the day’s most talked-about session was a fireside chat between Mati Staniszewski and Klarna’s Founder and CEO, Sebastian Siemiatkowski. Two areas in particular that the pair discussed are worth drawing out.

From an architecture perspective, Klarna’s experience moving away from a fragmented set of SaaS tools toward more unified internal systems reinforces a broader enterprise reality: AI performance scales in proportion to data coherence and architectural simplification. The foundations are critical – the data architecture, security frameworks, and system coherence – because without them, even the most capable models are constrained by the quality of the system beneath them.

The second thread was equally important as the pair discussed what the right human and AI model actually looks like in practice. Human + AI collaboration enables AI to handle scale and routine, while humans take care of connection, complexity, and the experience that builds lasting trust.

In the media and entertainment context specifically, this is a competitive principle. These are industries where creativity is the strategic advantage. By allowing AI to handle the volume, distribution and operational load, organisations pave the way for human attention to be spent on editorial instinct, creative judgement and building and sustaining relationships with an audience, factors that remain irreducibly human. The organisations that understand this distinction in strengths are protecting what makes them valuable while deploying AI more intentionally and in a way that drives more meaningful impact.


The road ahead

The ElevenLabs Summit left little doubt that voice AI has moved from emerging capability to deployment reality. The questions today are no longer about technology capability, but instead hinge on enterprise readiness; organisational, architectural, and cultural.

For media and entertainment leaders, where voice is set to redefine how audiences experience content, this means committing decisively to AI initiatives and building systems with governance, security and architectural discipline embedded from the ground up, that are designed not just to perform today, but to scale, adapt, and be governed effectively over time.

Enterprise AI 2025 Report