Eliminating Hallucinations in AI Code Reviews Part I

Hallucinations: A Big Problem for Developers

Large Language Models (LLMs) are incredibly powerful - but they're also notorious for hallucinations. At Korbit, we experienced this firsthand when developing our GenAI code reviewer. Our system would scan pull requests containing tens of thousands of lines of code to detect potential bugs. But we quickly noticed a huge problem: many flagged issues were not real bugs - they were hallucinations.

These false positives didn’t just waste developers’ time; they eroded trust in automated code reviews altogether. Clearly, this was a problem we needed to tackle head-on.

Defining Hallucinations Clearly

To address hallucinations, we first had to rigorously define them. We spent weeks manually reviewing, collecting, and annotating hundreds of detected issues across real-world Python, JavaScript, and TypeScript pull requests - covering everything from new features and refactoring tasks to bug fixes. We included both our own code and diverse open-source repositories to ensure a representative dataset.

During this deep dive, a fundamental question emerged: What exactly is a hallucination?

Obviously, if an LLM is generating nonsense, such as imaginary libraries, non-existent functions, or fabricated problems, it’s clearly a hallucination.

But many cases also fell into a grey area. For example, suppose the LLM identifies an infinite loop bug in a function called func_ab(...), but the actual function is named func_abc(...). The LLM made a minor typo, yet the underlying bug about the infinite loop it flagged is real. Is this still a hallucination, or should we consider it valid and actionable?

Another subtle scenario is when the LLM flags an issue that’s technically accurate but unfixable given the project context. Imagine a legacy web service mandated to use unencrypted HTTP endpoints (for example, because some customers signed a contract with this exact API requirement many years ago). An LLM might suggest switching to HTTPS - a perfectly valid technical recommendation - but developers simply can't implement it due to external constraints. In this scenario, labeling it as a bug would be misleading and waste developer effort.

This insight led us to realize that “technical accuracy” alone wasn't enough. What mattered just as much or even more was actionability: could a skilled developer clearly understand the issue and actually do something about it?

So let’s return to our previous examples:

For the typo in the function name, developers would easily understand the issue and know exactly how to fix it - making it actionable and therefore valid (or at least partially valid).
For the unchangeable HTTP scenario, while technically correct, the issue isn’t actionable in context, making it a hallucination.

An Annotation Schema for Hallucinations

Based on our analysis, we created a practical annotation schema to consistently classify issues:

Resolvable: A skilled developer can understand the issue and take appropriate action.
Unresolvable: A skilled developer cannot understand the issue or cannot act on any part of it given the description and context around their repository and project.
Unknown: There’s insufficient context for an annotator to confidently classify the issue.

Here note that an issue is classified as “Resolvable” even if only part of it is resolvable.

What Our Data Revealed

Applying this schema to our dataset, we uncovered several critical patterns:

Up to 30% of flagged issues were hallucinations, varying significantly by language and repository. As expected, hallucinations were a huge problem!
Most hallucinations were easily spotted by human annotators and reviewers, but only after wasting valuable time. This meant that they were not overly complex (for example, most of them could be identified by reviewing just the current function and class).
Hallucinations typically arose from either a lack of sufficient context (the LLM didn't have critical information) or too much context (the LLM was overwhelmed with a huge input file and lost focus).
Different models showed distinct types of hallucination behaviors. For instance, OpenAI’s GPT-4o regularly flagged "missing logging" as an issue in many functions without considering the function’s call stack (for example, parent functions where the logging might be happening). Anthropic’s models, however, didn't demonstrate this same pattern.

Our Path Forward: Context-Aware Ensembles

This comprehensive analysis guided our next steps: building a smarter, more context-aware hallucination detection system. This led us to start investigating:

Chain-of-Thought: Applying the chain-of-thought prompt technique to get the LLM to better reason around complex issues and focus on the most salient information.
Agentic Systems: Dynamically gathering additional relevant context from source code files to ensure LLMs have precisely the information needed.
LLM Ensembles: Combining multiple LLM models to cross-verify issues, reducing single-model biases, and increasing overall accuracy.

By integrating these approaches, our experiments already showed nearly an 80% reduction in hallucinations - a significant leap forward in making AI-powered code reviews more reliable, actionable, and trustworthy.

In Part II, we’ll walk through exactly how we built our Chain-of-Thought ensemble system, showcasing the concrete strategies that helped us achieve these remarkable results.

arrow_back

Back to blog

Eliminating Hallucinations in AI Code Reviews Part I

Hallucinations: A Big Problem for Developers

Defining Hallucinations Clearly

An Annotation Schema for Hallucinations

What Our Data Revealed

Our Path Forward: Context-Aware Ensembles

How AI Developer Tools Strengthen Code Consistency by 300%

Best AI Code Review Tools for Developers 2025

Eliminating Hallucinations in AI Code Reviews Part III

Try Korbit Now

PRODUCT

RESOURCES

Account