Large Language Models (LLMs) are incredibly powerful - but they're also notorious for hallucinations. At Korbit, we experienced this firsthand when developing our GenAI code reviewer. Our system would scan pull requests containing tens of thousands of lines of code to detect potential bugs. But we quickly noticed a huge problem: many flagged issues were not real bugs - they were hallucinations.
These false positives didn’t just waste developers’ time; they eroded trust in automated code reviews altogether. Clearly, this was a problem we needed to tackle head-on.
To address hallucinations, we first had to rigorously define them. We spent weeks manually reviewing, collecting, and annotating hundreds of detected issues across real-world Python, JavaScript, and TypeScript pull requests - covering everything from new features and refactoring tasks to bug fixes. We included both our own code and diverse open-source repositories to ensure a representative dataset.
During this deep dive, a fundamental question emerged: What exactly is a hallucination?
Obviously, if an LLM is generating nonsense, such as imaginary libraries, non-existent functions, or fabricated problems, it’s clearly a hallucination.
But many cases also fell into a grey area. For example, suppose the LLM identifies an infinite loop bug in a function called func_ab(...), but the actual function is named func_abc(...). The LLM made a minor typo, yet the underlying bug about the infinite loop it flagged is real. Is this still a hallucination, or should we consider it valid and actionable?
Another subtle scenario is when the LLM flags an issue that’s technically accurate but unfixable given the project context. Imagine a legacy web service mandated to use unencrypted HTTP endpoints (for example, because some customers signed a contract with this exact API requirement many years ago). An LLM might suggest switching to HTTPS - a perfectly valid technical recommendation - but developers simply can't implement it due to external constraints. In this scenario, labeling it as a bug would be misleading and waste developer effort.
This insight led us to realize that “technical accuracy” alone wasn't enough. What mattered just as much or even more was actionability: could a skilled developer clearly understand the issue and actually do something about it?
So let’s return to our previous examples:
Based on our analysis, we created a practical annotation schema to consistently classify issues:
Here note that an issue is classified as “Resolvable” even if only part of it is resolvable.
Applying this schema to our dataset, we uncovered several critical patterns:
This comprehensive analysis guided our next steps: building a smarter, more context-aware hallucination detection system. This led us to start investigating:
By integrating these approaches, our experiments already showed nearly an 80% reduction in hallucinations - a significant leap forward in making AI-powered code reviews more reliable, actionable, and trustworthy.
In Part II, we’ll walk through exactly how we built our Chain-of-Thought ensemble system, showcasing the concrete strategies that helped us achieve these remarkable results.