Thus far, large language models (LLMs) have been trained on vast amounts of publicly available content from across the internet. Wikipedia entries, academic papers, news articles, published books and open-source code repositories have all been scraped and ingested by AI companies. But now, the accessible, high-quality data that powered this first generation of models has largely been consumed.
Arthur Mensch, CEO of Mistral AI, put it plainly in a recent WSJ interview. For three years, AI companies have been able to steadily improve their models by training on the internet’s collective knowledge, but “now, we’re reaching a saturation point.” As publicly available data runs dry, for enterprise this new landscape represents an opportunity to capitalise on the very assets they already possess.
Because while public data sources have been exhausted, a different category of information remains largely untapped. The proprietary intelligence sitting within enterprises: domain expertise built over decades, operational patterns that exist nowhere else, and customer relationships generating continuous data streams that cannot be scraped or synthesised from the public web.
For enterprise, this reframes the competitive landscape, as the organisations positioned to extract the most value from AI will no longer just be those with the largest AI teams or access to the most sophisticated models. Instead, it will be those that recognise their proprietary data as their most defensible competitive advantage and understand how to leverage it as a strategic asset.
This article will examine why proprietary data has become the most valuable resource in AI development, the imperative to approach this opportunity strategically, and what it means for organisations that move to productise their data now compared to those that wait.
The public data plateau
Originally perceived as an infinite resource, the public web has been exhausted. And the accessible, high-quality data that powered the first generation of foundation models has already been ingested and encoded into model weights. This leaves LLM providers today with lower-quality sources and duplicate content adding marginal value and generating exponential computational cost.
In response to this crisis some organisations are using synthetic data, artificially generated training data created by AI models, rather than collected from real-world sources. Synthetic data delivers genuine value in specific applications such as augmenting rare edge cases for autonomous systems or balancing datasets for fraud detection.
But as a solution to fundamental data scarcity, it cannot replace human-generated content. Synthetic data remains derivative by nature, constrained by the knowledge embedded in the models generating it, and while it can fill gaps and extend existing datasets, it cannot capture the originality, depth and unexpected patterns that emerge from real-world human activity and domain expertise.
So, while LLM providers pursue alternative approaches to address scarcity, enterprises hold the competitive opportunity to capitalise on the unique intelligence they already possess within their organisations.
The real opportunity
Enterprises with proprietary data are sitting on the most valuable resource in the current AI landscape. This opportunity manifests in two distinct ways, each with different implications for how organisations should think about their data assets.
Proprietary content as a business model
Some organisations’ business models are fundamentally built on their data, such as legal research platforms, specialised publishers, and financial information providers. These companies have spent decades curating, verifying, and structuring domain-specific knowledge that required significant capital to build.
For these organisations the strategic question becomes: if you’ve invested decades building proprietary intelligence, how do you leverage it to power entirely new products and revenue streams?
Bloomberg’s approach is a key example of what’s possible when proprietary data meets strategic AI investment. The company used their proprietary sources to build BloombergGPT, a 50-billion-parameter model trained on 710 billion tokens of data, 52% of which came directly from earnings reports, market analyses, company filings, and internal communications that no competitor could access. Rather than selling this data for one-time licensing to foundation models, with no way to track attribution or retain ongoing value back, Bloomberg exemplified productising it. Their BQuant analytics platform and Document Search capabilities, which surface insights from more than 400 million company documents, unlocked new revenue streams built entirely on data assets the company already had sitting idle.
Thomson Reuters also adopted a similar path in legal services. With more than 20 billion proprietary legal documents, the company built CoCounsel, an AI assistance grounded in Westlaw’s extensive case law and legal analysis. Rather than licensing its legal database for others to ingest into general-purpose models, Thomson Reuters acquired Safe Sign Technologies to develop legal-focused language models and now partners with providers such as OpenAI to create custom versions of their LLMs tuned on Thomson Reuters’ proprietary content. Today, more than 12,200 law firms use CoCounsel, creating a new product category built on data Thomson Reuters already possessed.
These companies recognised that their proprietary data had the ability to power entirely new product lines while retaining control, attribution, and ongoing value.
Operational data as a renewable resource
The second category represents another opportunity unique to enterprise, one that addresses the finite nature of publicly available data. This applies to the organisations not just holding historical, proprietary data, but those that are continuously generating new data through ongoing customer relationships and operations.
Within the data-scarce landscape, this represents a renewable resource and underscores that while foundation model providers have exhausted the static public web, organisations with direct customer relationships possess the ability to create continuous streams of fresh, human-generated data that didn’t exist before, something far more valuable in today’s market.
Consider a SaaS platform with thousands of active users generating workflow data daily, a healthcare system capturing clinical patterns and operational insights through patient interactions, or a financial services firm processing transactions that reveal market behaviours and customer needs. These organisations are generating new proprietary intelligence each day through the relationships and operations that define their business.
This continuous generation of domain-specific, human-created data is precisely what is irreplicable through synthetic approaches or web scraping, which makes the strategic implication significant. While organisations with finite proprietary archives must be thoughtful about how they leverage their IP, organisations with renewable human-generated content have the opportunity to build AI capabilities that improve continuously, fed by an ongoing stream of proprietary data that competitors simply cannot obtain.
Capitalising on this opportunity requires architectural foresight
Organisations that recognise their proprietary data as a competitive asset and move to productise it now can establish advantages that compound over time through new revenue streams, expanded product lines, and competitive positioning built entirely on data assets they already possessed.
But this opportunity favours only those who approach it strategically. The difference between successfully productising proprietary data and squandering a competitive advantage lies in execution; understanding how to transform that data into products that retain control, attribution, and ongoing value. And having the foresight to scale these products safely and effectively. This demands a system-aware approach: architecting across the full stack to embed security, scalability and flexibility from the ground up. And protecting IP while unlocking the significant opportunity owned data represents.
The path forward
The data landscape has structurally shifted and exposed where value truly lies. While foundation model providers are on a quest for data, enterprises with unique intelligence hold what has become AI’s most valuable resource.
But knowing you have valuable data is only the beginning. The question now is which organisations will recognise this opportunity, move quickly enough to leverage it safely and securely, and build a future-proof opportunity.
Those who will establish lasting advantage in this new landscape will understand how their intelligence could be leveraged, and build with architectural foresight for a future where data serves as strategic infrastructure and scalable competitive differentiation. Being able to navigate through this complexity and accelerate time-to-market is vital, and why a number of enterprises are choosing to work with Stelia.
As this series progresses, we’ll explore key considerations organisations must keep front of mind when productising proprietary data and building AI capabilities that protect the competitive advantages their data represents.