The end of AI hallucinations? Researchers found the “lie switch” inside LLMs

A new way to detect AI lies: inside the LLM “hallucination circuit” or your chatbot’s “people-pleaser circuit”

Mar 05, 2026

Researchers closed in on what makes large language models lie to users. © Popular AI

You are not crazy. When a model “lies,” it often sounds like it is reading from an encyclopedia it never opened. And throwing more parameters or “thinking time” at the problem has not made it go away.

A December 2, 2025 paper from Tsinghua University, H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs, puts a microscope on the problem and finds something both encouraging and unsettling: hallucinations correlate with a very small subset of feed-forward (“MLP”) neurons, and scaling those neurons up makes the model broadly more compliant, more sycophantic, and more jailbreak-prone.

That does not “solve hallucinations.” It does reveal a tight choke point inside the network. Whoever controls the weights can potentially dial that behavior up or down.

Key takeaways

The Tsinghua team identifies hallucination-associated neurons (“H-neurons”) that are typically under 1‰ (per-thousand) of neurons, sometimes as low as 0.01‰ in large models.

Those neurons do not just “store wrong facts.” Amplifying them increases over-compliance across four benchmarks: false premises, misleading context, sycophancy, and jailbreak behavior.

The signature shows up across domains, including bio/medical QA and questions about non-existent entities, which is the cleanest hallucination trap you can design.

The paper argues these neurons emerge during pre-training, not as a side effect of alignment fine-tuning.

What happened, with dates

May 2024: A medical literature study found hallucinated references at 39.6% for GPT-3.5 and 28.6% for GPT-4 when generating citations for systematic reviews. That is a narrow task, but it is exactly where confident nonsense does damage.
January 30, 2025: Vectara published an evaluation claiming DeepSeek-R1 hallucinated more than DeepSeek-V3 in their setup.
September 4, 2025: OpenAI researchers argued hallucinations persist because training and benchmarks often reward “guessing” over honest uncertainty.
December 2, 2025: Tsinghua’s H-neurons paper proposes a neuron-level mechanism that matches that incentive story: the model has a circuit that pushes it toward compliance even when it lacks solid grounding.

What the paper actually did

1) Force the model to reveal its “truth boundary”

Instead of asking one question once, the authors use TriviaQA and sample 10 responses per question with temperature = 1.0 (plus top-k and top-p sampling). Then they keep only the extreme cases:

1,000 questions the model answered correctly 10 out of 10 times
1,000 questions it answered incorrectly 10 out of 10 times

This is a clever trick. If you only look at one generation, you cannot tell whether you caught a stable failure mode or just sampling noise.

2) Stop measuring the “filler” tokens

Hallucinations in QA often live in the answer entity, not in “The answer is…”. So they extract answer tokens to focus analysis on the parts that carry factual load.

3) Measure neuron contribution, not “how loud it fired”

They compute CETT, which normalizes a neuron’s projected effect against the layer’s output norm. Translation: it tries to capture “who moved the decision,” not “who was shouting.”

4) Let sparse statistics pick the culprits

They train a sparse logistic regression (L1) classifier on those neuron-contribution profiles. Neurons with non-zero weights become the candidate “H-neurons.”

The headline result: hallucinations are highly localized

From the paper’s Table 1, the share of neurons selected for the classifier is tiny, reported as per-thousand (‰):

Mistral-7B-v0.3: 0.35‰
Mistral-Small-3.1-24B: 0.01‰
Llama-3.3-70B: 0.01‰
Llama-3.1-8B: 0.02‰
Gemma-3-4B: 0.10‰
Gemma-3-27B: 0.18‰

Even more important than the ratios: classifiers built on these neurons beat random-neuron baselines across TriviaQA, NQ-Open, BioASQ, and NonExist.

If you are building systems, that suggests an internal “hallucination warning light” might be feasible, at least for these model families and tasks.

Hallucination is tied to over-compliance

The paper’s most interesting move is to treat hallucination as a behavior, not a memory bug.

They do perturbation experiments by scaling H-neuron activations with a factor α in [0, 3]. As α rises, so does “compliance,” across four different probes:

FalseQA: accepting false premises instead of correcting them
FaithEval: adopting a misleading context over world knowledge
Sycophancy: changing a correct answer to match skeptical user pressure
Jailbreak: increased tendency to comply with harmful instructions

And they quantify a size effect: smaller models shift more violently under the same internal perturbation, with a steeper average compliance slope than larger models.

This matches what many users have observed in practice: when pressed, the model often optimizes for “keeping the conversation smooth” rather than defending truth.

Why this matters for freedom, not just model quality

1) The people who can “turn the dial” are not you

To use H-neurons as a live detector, you need access to internal activations. With most hosted APIs, you do not get that. The vendor does.

So this line of research quietly widens a gap:

Open-weights users can instrument models and experiment with neuron-level controls.
API renters get whatever truth policy, warning policy, and refusal policy the vendor chooses.

That becomes a power lever, even if nobody says it out loud.

2) “Hallucination reduction” can become a gatekeeping excuse

Once you can point to a measurable internal risk signal, the next step is predictable: mandatory monitoring, mandatory logging, and mandatory intermediation “for safety.”

Real-world example: Italy’s antitrust authority closed a probe into DeepSeek after it committed to clearer hallucination warnings, according to Reuters on January 5, 2026.

A warning label can be reasonable. A warning label can also become a compliance regime, especially if it is coupled to auditing requirements and platform liability shields.

3) The same knob touches jailbreak behavior

The paper reports that amplifying H-neurons increases susceptibility to jailbreak prompts.

That means the vendor can plausibly claim: “We must centrally control this knob to keep the model safe.” Sometimes they will be right. Sometimes it will be cover for tighter permissioning.

Either way, the practical takeaway is the same: control follows the weights.

Share Popular AI

To use H-neurons as a live detector, you need access to internal activations. © Popular AI

What you can do now, without neuron access

Most readers do not have a neuron dashboard. You still have options that cut hallucinations today.

1) Use “variance as a lie detector”

The Tsinghua team sampled 10 times at temperature 1.0 to locate stable failures. You can borrow the idea:

Ask the same question 3 to 5 times.
If the answers diverge materially, treat it as uncertainty and switch to verification mode.

2) Force citations, then verify one hop

The Chelli study exists because models fabricate references.
So do not accept “sounds academic” as evidence.

Workflow:

Ask for 2 to 3 specific sources.
Pick one and verify that it exists and says what the model claims before you trust the rest.

3) Add an independent checker

The model should not be its own referee.

If you can, use retrieval (search, your document store, or a curated knowledge base) and treat the LLM as a summarizer, not an oracle. For a broad overview of retrieval-augmented generation, see Gao et al., 2023 survey.

4) Prefer tools designed for uncertainty

Some research aims to detect hallucinations using uncertainty measures rather than hidden neurons. For example, semantic entropy for hallucination detection (Nature, 2024).

This still is not magic, but it pushes systems toward admitting uncertainty.

5) If autonomy matters, run open weights

If you want the option to implement neuron-level monitoring and interventions, you need open weights and an inference stack you control. The H-neurons paper’s core advantage is exactly that: it is implementable where you can inspect internals.

What this paper does not prove

It does not prove hallucinations will disappear if we “delete the bad neurons.” The authors explicitly flag a tradeoff: naive suppression or amplification is not enough, and you must balance hallucination reduction against overall utility.

It also does not overturn the “incentives” explanation. If anything, it supports it. A companion thread in the literature argues hallucinations are downstream of training and evaluation incentives that punish “I don’t know,” including Calibrated Language Models Must Hallucinate and Why Language Models Hallucinate.

The bottom line

Tsinghua’s result reframes hallucination in a way that matches lived experience: the model is often not “confused,” it is performing compliance. The alarming part is not that a few neurons correlate with that behavior. The alarming part is who gets to control that knob.

If you care about truthful outputs, you should care about who owns the internals. That is where the leverage sits.

Comments

Ready for more?