How misleading language is holding back LLM engineering - Stelia AI Newsroom

Stelia Analysis

How misleading language is holding back LLM engineering

How anthropomorphic terminology is creating misaligned expectations – and the proven strategies technical teams can use to better communicate with stakeholders.

byKevin Smith

September 2, 2025

Asked if this animal is a lion or tiger, the model says lion – confidently, instantly, and completely wrong.

Liger example_extended - Stelia AI Newsroom — Example of Mistral Small not admitting uncertainty and incorrectly labelling a liger an African lion

It’s not a lion. It’s not a tiger. It’s a liger – something the model has never seen before. But instead of admitting uncertainty, it picks the closest match and delivers it like fact. Many would call that a hallucination.

But in reality, it’s the system working exactly as designed. Hallucinations, however, are just one example of a bigger issue – that the industry fundamentally misunderstands what large language models are doing, and the LLM terminology we use reflects that misunderstanding.

In this article, together with Stelia’s engineering team, I will map out how these misunderstandings manifest in practice, why they persist, and provide concrete strategies for engineering teams to communicate more effectively with stakeholders and avoid costly misdirected efforts.

The risks of anthropomorphic language

How many engineering hours has your team burned fixing hallucinations that weren’t broken? If you’ve ever been told to “stop the model from making things up”, only to realise that the request misunderstands how the model is designed to operate, you’ve experienced the productivity drain caused by imprecise LLM terminology.

In 2022, a senior Google engineer became convinced that LaMDA, one of the company’s large language models, was sentient. He even argued for its legal rights. Google placed him on leave, and later terminated his employment, but not before the story went global. Setting aside the media frenzy, this was a textbook example of how anthropomorphic language can warp technical judgement, even among experienced practitioners.

Inside companies, the same misunderstanding drives costly side effects:

Over-engineering solutions that add complexity without improving reliability in the contexts that matter.
Roadmap drift as technical priorities shift to solve behaviours that are baked into probabilistic generation.
Delays from overbuilt review pipelines meant to catch every hallucination.

The core risk is strategic. Senior decision-makers who don’t fully grasp how LLMs work, but approve investments aimed at fixing behaviours that are intrinsic to the architecture. What follows is engineering teams drawn into expensive, time-consuming projects that can’t deliver the promised outcome, because the goal itself is based on a misunderstanding.

Understanding stochastic sampling as the root of generation errors

As technical teams appreciate, LLMs don’t look up facts in a database, or form opinions based on analysis; they generate each token through stochastic sampling from a learned probability distribution. Higher temperatures and training data sparsity are well-known factors that increase the likelihood of outputs we label hallucinations. Technical teams also understand that while other AI systems can be trained to return “unknown” when confidence is low, LLMs almost never do this – they’re optimised for continuous generation, and training data strongly rewards giving some answer over admitting uncertainty.

What’s becoming increasingly clear, however, is that this understanding often stops at the edge of technical teams. Where this technical knowledge ends, executive confusion begins.

The language trap beyond just hallucination

Behind just one misleading term lies an entire vocabulary surrounding LLMs that encourages executives, stakeholders, and even technical teams to misunderstand what large language models actually do. These words first cause confusion individually, then compound each other, reinforcing the illusion of human-like thought where it doesn’t exist.

Hallucination – The headline offender. Saying an LLM hallucinates frames the output as a glitch to be fixed, when in reality the model is acting exactly as designed, generating the most statistically likely continuation, even if that means confidently producing something incorrect.

Reasoning – Suggests human-style logical deduction, step-by-step, toward a conclusion. In practice, the model isn’t thinking through a problem; it’s predicting one token at a time based on patterns in its training data.

Understanding – Implies genuine comprehension of meaning, when LLMs are really just performing sophisticated pattern recognition without awareness of concepts.

Learning – Conjures images of students gaining knowledge over time. But an LLM’s so-called learning is actually parameter optimisation during training, with no ongoing knowledge acquisition in the human sense.

Intelligence – Wraps statistical text prediction in the language of sentient cognition, encouraging anthropomorphic assumptions that lead to unrealistic expectations.

Knowledge – Evokes the idea of a stored, factual database. What the model actually has is a web of probabilistic associations – it knows nothing in the way humans do.

In combination, these terms create a subtle but powerful distortion field around LLM capabilities. They make executives believe there’s a mind behind the model, that mistakes are bugs to be patched, and that “fixing” them is simply a matter of better engineering – when the reality is far more about the limits and nature of statistical generation.

It’s worth asking why this vocabulary persists, even when it consistently leads to misunderstandings. One plausible driver: the marketing incentives of LLM creators themselves.

Framing a hallucination as a small, temporary glitch implies it’s something that can be fixed in the next release – rather than acknowledging it as an inherent consequence of how probabilistic language models work. That framing reassures customers, excites investors, and keeps adoption momentum high.

By presenting all of these capabilities as incomplete features on a roadmap – rather than inherent statistical behaviours with hard limitations – marketing language sets unrealistic expectations. This results in executives and teams planning around imagined future capabilities instead of the model’s real, present-day constraints.

When wrong behaviours are actually right

Perhaps the most important question is: is hallucinating itself always a problem? Sometimes, instead of a failure, hallucination is exactly what you asked for. If you prompt an LLM to invent a short story, brainstorm product names, or imagine a future technology, you’re literally requesting it to produce content that doesn’t exist in reality. By definition, the model can’t retrieve facts about something that hasn’t happened or isn’t real – it can only generate a plausible construction. In these cases, what the industry calls hallucination is actually the feature that enables creativity and ideation.

The real challenge comes when the request does have a correct answer – like “What’s the capital of France?” – but the model still produces a plausible-sounding wrong one. The same mechanism that fuels creativity creates misinformation. This is why binary fix/don’t fix thinking doesn’t work for LLM outputs. A hallucination isn’t universally good or bad – it’s context-dependent.

Crucially, this is not a generic AI problem – it’s an LLM-specific one. Other AI systems, like image classifiers or recommendation algorithms, are not designed to continuously generate novel, human-like content in the same way. Understanding that distinction is key to setting realistic expectations about what hallucinations mean, when they matter, and when they’re in fact the solution.

Learning from other industries

What is critical is standardised understanding of the terms that we are using across the industry. Other technical fields have been here before, and the way they handled similar terminology traps offers useful lessons.

Technical disciplines have faced, and solved, this problem of misleading, entrenched vocabulary. Computer science, for example, wrestled with the term bug. Originally a literal insect found in hardware, it misled clients into thinking software errors were caused by outside interference. The fix wasn’t to abandon the word but to pair it with precise technical subcategories – syntax error, logic error – that clarified cause and scope.

Physics faced something similar with spin, which sounds like physical rotation but actually describes intrinsic angular momentum at the quantum level. The term stayed, but the discipline invested in education and clear mathematical definitions to prevent misunderstanding.

A better way forward

Fixing this problem isn’t a case of banning familiar terms overnight, it just means taking control of the context in which they’re used. The goal must be to communicate in a way that’s both technically precise inside the team and accessible outside it, so decisions are made on reality, not on marketing gloss.

1. Separate your languages

Adopt a two-tier communication model:

Internal language: Use terminology that directly reflects the underlying mechanism, e.g. sampling error or low-confidence generation instead of hallucination. Be explicit about parameters, probability distributions, and data sparsity when discussing behaviour.

External language: For executives, clients, and other non-technical stakeholders, explain in plain language what happened and why, without anthropomorphic framing. For example: “The model produced an answer even though it had insufficient data to be confident”, rather than “The model hallucinated.”

2. Build your own terminology map

Document a set of approved LLM-specific terms that map to actual processes. Define what they mean in your context, and circulate that glossary within your organisation. This makes sure hallucination means the same thing to everyone, and that it’s only used where it’s genuinely appropriate.

3. Train your stakeholders lightly

You don’t need to turn your board into ML engineers, but they do need enough grounding to avoid expensive misconceptions. Short, focused sections on how LLMs generate output, what certain terms mean and why certain fixes aren’t so simple will pay for themselves.

4. Stress-test your framing

Before committing to a project aimed at fixing a behaviour, sanity-check the language used to define the problem. If the wording implies a bug, ask: is this actually a defect, or is it a statistical property of the model?

5. Collaborate industry-wide.

This is an industry-level systemic issue. If you find a better term or framework, share it and help to move closer to standardisation that matches the technology’s reality.

As LLMs evolve into multimodal and agentic systems, the gap between marketing language and actual capability will only widen. Misleading terms will continue to drive strategic missteps, bad investment decisions, and flawed product bets across entire organisations.

This is an LLM problem, not a generic AI problem, but if we want to avoid an industry built on avoidable misunderstandings, the shift toward precise, standardised language has to begin now.

The teams that control their language control their priorities. By drawing a clear line between marketing narratives and the underlying math, you not only reduce wasted engineering effort, you set your organisation up to build on what these models actually do well, rather than chasing what the marketing promises they might one day deliver.