At Stelia, we pride ourselves not only on pushing the boundaries of applied AI research, but on contributing to open-source work that benefits the wider AI ecosystem.
In this article, we outline the technical contributions of Lukas Stockner to the open-source rendering community. Lukas is principal platform engineer at Stelia and is also active in core development for established open-source software rendering engine Blender.
Rendering creates photorealistic images from mathematical descriptions – technology essential to film, entertainment, and gaming industries. It represents a compelling technical domain where complex, resource-constrained computation must deliver both visual fidelity and real-time performance. And all of this occurs within constantly evolving hardware constraints, a challenge that is fast translating across AI systems as a whole.
While the majority of professional rendering engines operate through controlled research cycles and carefully managed feature releases, the open-source community’s experimental and transparent approach allows for rapid integration and clear visibility into the technical realities that define modern GPU rendering.
Lukas, whose ray tracing optimisations have become core to the GPU-first architecture of Cycles, Blender’s primary production renderer, is positioned uniquely to understand the field’s evolution. From this vantage point, he can analyse how hardware acceleration has transformed rendering performance, and machine learning’s areas of strength and weakness. More importantly, his technical expertise provides insight into the innovations that will define the next generation of production pipelines. This article will examine these current developments while exploring the breakthrough technologies and optimisation strategies that represent the future of GPU rendering.
Hardware-accelerated ray tracing goes mainstream
Perhaps the most visible change for the rendering industry in recent years has been the mainstream adoption of hardware-accelerated ray tracing. Indeed, ray tracing itself has been a core production technique in certain industries for decades – architectural visualisation and product design have relied on it for more than 25 years, while the VFX and animation sectors began integrating it at scale around 15 years ago. The significant advancement in the past six years however, is the development of hardware-accelerated real-time ray tracing. The launch of NVIDIA’s RTX series and AMD’s RDNA GPUs made ray tracing fast enough for gaming and other real-time applications for the first time, unlocking entirely new use cases and pushing adoption into interactive graphics at scale never seen before.
GPU-first rendering architectures were well-positioned when this hardware shift arrived. For Cycles, which had been GPU-first for seven years already, since its inception in 2011, the arrival of RTX hardware, and the massive investment in fast ray tracing it stimulated in the gaming industry, completely validated this architectural approach. Techniques developed to maximise gaming performance, from efficient ray traversal algorithms to shading approximations, began flowing back into offline rendering, accelerating improvements in both speed and realism.
One of the clearest examples of this cross-pollination is Linearly Transformed Cosines (LTCs). Originally developed in the gaming world as a fast approximation for glossy light source reflections before hardware raytracing was feasible, LTCs were designed to squeeze believable lighting out of extremely tight real-time budgets. That same approach now underpins Disney’s cloth materials and is integrated into Cycles to handle complex light scattering at a fraction of the computational cost. This demonstrates how performance-driven real-time techniques can cross over into high-end production rendering, improving both speed and the physical plausibility of the final image.
Such cross-pollination is becoming increasingly common, as the technical gap between real-time and production rendering is narrowing faster than ever. For years, these coexisted in parallel, as almost entirely separate worlds. Real-time rendering was optimised for interactive performance at all costs, while production rendering pursued uncompromising image quality, regardless of how long it took. Now, many of the algorithms, approximations and hardware optimisations born in gaming are being adapted into offline pipelines – not as shortcuts, but as targeted accelerators for established high-quality rendering methods. The difference however, increasingly lies in their priorities. While real-time engines still sacrifice fidelity when performance demands it, production renderers set a fixed quality target and spend however long necessary to achieve it.
The convergence of real-time and production techniques is reshaping competitive dynamics across the industry. Although open-source alternatives may not be displacing Arnold in film VFX or V-Ray in architectural visualisation, the performance gap has narrowed considerably in many workflows. The combination of hardware-accelerated ray tracing, open-source accessibility, and node-based flexibility now delivers photorealism that rivals proprietary solutions, without licensing costs. For studios evaluating rendering pipelines, the technical advantages that once justified premium engines are becoming harder to defend against modern GPU-optimised renderers that can exploit the same hardware optimisations.
ML’s breakthrough moment in denoising
Alongside these hardware advances, machine learning also entered the rendering pipeline, with a deceptively simple but effective advancement: denoising – the process of removing noise from ray-traced images, expediting the creation of cleaner, high-quality images.
Denoising succeeded by finding a unique architectural approach. It runs once, as a post-processing step on the final image, making it simple to implement and avoiding the complexity of real-time integration. Modern renderers increasingly rely on Intel’s OpenImageDenoise, as it delivers state-of-the-art results for real-world scenes across every platform (CPU, NVIDIA, AMD, Intel, Apple). This cross-platform consistency is vital, eliminating the feature divergence nightmare that usually comes with hardware-specific optimisation.
In fact, Intel’s OpenImageDenoise even earned an Oscar for Technical Achievement in 2024, and the reasons are clear – it has fundamentally transformed hours of render refinement into one automated cleanup pass.
The performance wall blocking broader ML integration
Where broader ML integration has been less successful is from a practical perspective. While ML models tend to excel more the larger they get, as the LLM trend has proven, rendering demands a different approach. Finding ML applications intelligent enough to provide meaningful improvements but small enough not to impact performance can become challenging. As ray tracing consists of the same simple operation billions of times, anything added to that loop must be extremely fast or it overwhelms all other components by orders of magnitude.
This size limitation demands a new approach to model efficiency. As an example, OpenImageDenoise works with just 10 million parameters, a fraction of what is considered small for an LLM system. Combined with the fragmented software ecosystem, these restrictions create a significant engineering challenge for the industry.
Several promising approaches are emerging to work within these strict constraints. Neural textures are surfacing and should deliver significant GPU memory savings. Critically, the path to widespread ML adoption in modern rendering hinges on two key developments: research into small models built specifically for rendering tasks, and unified programming models that don’t require excessive development effort across different hardware vendors.
What’s next in GPU rendering?
Aside from broader ML adoption, the most immediate breakthrough in rendering development lies in texture caching, which tackles one of GPU rendering’s biggest bottlenecks. Complex scenes can often demand hundreds of gigabytes of textures, but the VRAM limits of GPUs – coupled with rendering’s requirement for everything to fit on each individual GPU – leaves the industry unable to circumvent the problem in the same way ML workloads can. This memory wall has kept much of the CGI and VFX industry on CPU rendering until hardware ray tracing made GPU performance impossible to ignore, and as such has emerged as the primary bottleneck to address.
The emergence of texture caching transforms this hard memory limit into dynamic resource management. With texture caching, modern GPU renderers load texture at runtime and unload them when space is needed, converting absolute memory requirements into a performance trade-off. Scenes with many unused textures could run at practical speeds using a fraction of the memory. While CPU renderers have done this for years, implementing this efficiently on GPUs is significantly harder, a problem that Lukas and the Blender team, as well as the industry as a whole, are racing to solve through approaches that could inform GPU memory management across industries.
The bigger picture
These optimisation strategies emerging within rendering reflect the same fundamental challenge we address in global-scale AI platforms: intelligent resource orchestration under strict performance constraints. Whether coordinating GPU memory across diverse hardware or scaling AI systems globally, the shift from brute-force scaling to smarter resource allocation is becoming the defining advantage.