Eliminating Hallucinations in AI Code Reviews II

Building a Hallucination Detection System with LLM Ensembles & Chain-of-Thought – Part II

Recap: Why Hallucinations Matter

In Part I, we investigated why hallucinations are such a challenge when working with Large Language Models (LLMs) - whether generating new code or reviewing existing code for potential bugs. We spent weeks manually annotating and analyzing hundreds of AI-generated issues, creating an annotation framework focused on actionability:

Resolvable: A skilled developer can understand the issue and take appropriate action.
Unresolvable: A skilled developer cannot understand the issue or cannot act on any part of it given the description and context around their repository and project.
Unknown: There’s insufficient context for an annotator to confidently classify the issue.

From this rigorous evaluation, we identified several recurring patterns in hallucinations:

Often, hallucinations occurred when LLMs lacked crucial context.
Conversely, too much irrelevant context could overwhelm the LLM, leading to errors.
Different models exhibited different patterns of mistakes.

These findings steered us towards developing smarter agentic systems with better reasoning and that dynamically incorporate the right context and utilize multiple LLMs as an ensemble for more accurate evaluations.

Rule #1: Start Simple

At Korbit, the first rule of building with LLMs is: Always start simple.

(Actually, Rule #1 is properly defining the problem - but we covered that extensively in Part I when we analyzed and labeled hundreds of issues.)

Instead of prematurely diving into complex solutions like multi-hop context gathering (e.g. jumping across files in the codebase) or lensembles of LLMs, we decided to first address the hallucination problem with an extremely simple approach: through prompt engineering. Our goal: build an effective Hallucination Detector using a carefully constructed LLM prompt.

The Initial Approach: LLM-as-a-Judge

We started with a straightforward strategy known as the "LLM-as-a-judge" method, inspired by recent academic research (Zheng et al., 2023). In this setup, one LLM reviews and evaluates another LLM's outputs in real-time, serving as a gatekeeper that classifies detected issues.

Our initial prompt (shown in the following table) was designed to quickly determine if an issue flagged by the code reviewer was a hallucination or not:

Table 1. LLM First Prompt

You are a senior software engineer working with a team of software engineers.

You are using a large language model for initial review of the files in pull requests.

The large language model reviews files in the pull request and tries to find issues in them. Some of the issues found by the large language model are hallucinations.

If the issue won't cause a real problem in the development or execution of the code, or if the issue depends heavily on external context outside the file, then the issue is considered a hallucination.

Your task is to review issues found by the large language model and detect hallucinations.

You will be provided with the following pieces of information:

- Issue description

- Issue code snippet

- Source code file

…

‍

Unfortunately, this prompt didn't meet expectations. In tests using GPT-4, it only detected 19% of hallucinations - barely performing better than random guessing - while misclassifying 10% of valid issues as hallucinations. It seemed the model wasn’t understanding the task or source code or focusing correctly on the provided context.

Reflecting on our earlier analysis, we recognized a core issue: too much context. Providing the LLM an entire source file overwhelmed it, making focused reasoning difficult.

Refining the Prompt: Chain-of-Thought (CoT) Prompting

To address these shortcomings, we experimented with a technique called Chain-of-Thought (CoT) prompting, a structured reasoning approach proven effective across various LLM tasks. CoT encourages models to explicitly think through intermediate steps, improving their reasoning over complex and long inputs.

We tailored a simple, question-answering variant of CoT to help the model focus on precisely the information needed to evaluate an issue. The method involved three steps:

Generate three simple questions about the flagged issue.
Answer these questions by reviewing the provided code snippet and relevant source file.
Evaluate the issue explicitly as "Valid Issue," "Hallucination," or "Undetermined."

Crucially, we allowed the LLM to provide the "Undetermined" response, signaling low confidence or insufficient context. This was key in preventing arbitrarily forced and potentially incorrect classifications, enabling downstream systems to manage uncertain cases effectively.

Chain-of-Thought (CoT) Prompting Improved Results Dramatically

This refined CoT prompt significantly improved our hallucination detection results:

GPT-4 now correctly identified 45% of hallucinations, more than doubling our initial detection rate.
False positives were reduced to just 9% (i.e. only 9% of non-hallucinations were incorrectly classified as “hallucination”), ensuring fewer valid issues were incorrectly discarded.

Three Essential Learnings from Our Prompt Experiments

Our extensive prompt experimentation surfaced three critical insights:

Structured Reasoning is Essential:
Requiring the LLM to explicitly generate and answer its own intermediate questions drastically improved performance compared to single-step prompts.
Avoid Ambiguous Labels:
Binary responses like "True/False" caused confusion. Using explicit tags - "Valid Issue," "Hallucination," and "Undetermined" - helped the LLM provide clearer, more accurate evaluations.
Embrace Uncertainty with "Undetermined":
Allowing the LLM to indicate uncertainty significantly reduced forced errors, enabling a smoother hand-off to an agentic workflow for further contextual investigation.

How the Hallucination Detector Works (Prompt Example)

Here's the refined hallucination detection prompt we arrived at after extensive experimentation:

Table 2. LLM CoT Prompt

You are a senior software engineer working with a team of software engineers.

You are using a large language model for initial review of the files in pull requests.

The large language model reviews files in the pull request and tries to find issues in them.

Some of the issues found by the large language model are hallucinations.

Your task is to review issues found by the large language model and detect hallucinations.

You will be provided with the following pieces of information:

- Issue description

- Issue code snippet

- Source code file

You first should read the issue description and issue code snippet to generate 3 simple questions that will help you evaluate the issue. For instance, if the issue mentions the calculate_sum function being defined twice, a simple question could be “how many calculate_sum function definitions exist in the source code file?“.

Once you generate these 3 simple questions, review the source code file to find the answers and write them down. Then, make a judgement based on your answers to determine if the issue is hallucination.

If you’re unsure about an issue or don’t have enough context to evaluate, then label this issue as undetermined. For the undetermined label, please provide the reason why you think it’s undetermined.

Here is the information to help you determine if the issue is real or hallucination:

- Issue description: {feedback}

- Issue code snippet: {diff}

- Source code file: {source_code_file}

Structure the output as a valid json like:

“question_1”: “The first question you generate.“,

“answer_1”: “Your answer to question_1.“,

“question_2”: “The second question you generate.“,

“answer_2”: “Your answer to question_2.“,

“question_3”: “The third question you generate.“,

“answer_3”: “Your answer to question_3.“,

“evaluation”: “Valid Issue” or “Hallucination” or “Undetermined”.

‍

The best way to understand how the system works is to walk through an example.

Here’s an example of an issue evaluated for hallucination by the prompt:

‍

‍

Now this issue is clearly a hallucination by our definition. The method “outdate_pr_comments” is indeed declared twice in the codebase, but once as a base class method (which is meant to be implemented later) and a second time in the child class (which inherits from the base class and defines the implementation properly). There is no issue to address here. There are no actions to take for the developer, so it is a hallucination by our definition above.

‍

Next, let’s look at the output from Claude 3.5 Sonnet:

‍

Table 3. Claude 3.5 Sonnet Output

Here is my evaluation of the issue in the structured JSON format:

"question_1": "How many times is the outdate_pr_comments method defined in the GithubRepositoryClient class?",

"answer_1": "The outdate_pr_comments method is defined once in the GithubRepositoryClient class.",

"question_2": "Is the first definition of outdate_pr_comments an empty method?",

"answer_2": "No, there is only one definition of outdate_pr_comments and it is not empty. It contains code to minimize comments using the GitHub GraphQL API.",

"question_3": "Will the existing outdate_pr_comments method cause any override issues?",

"answer_3": "No, since there is only a single definition of outdate_pr_comments, there will not be any override issues.",

"evaluation": "Hallucination"

‍

Claude correctly determines that it’s a hallucination. It says that the function is only defined once in that class (which is also correct, as the second definition was in the base class which is a different class), and it concludes that this will not cause any issues.

‍

In Part III of this series, we’ll dive into how we built our Ensemble System, and how it was able to leverage dynamic context extraction across files in the codebase to detect and remove almost 80% of all hallucinations.

‍

arrow_back

Back to blog

Eliminating Hallucinations in AI Code Reviews II

Recap: Why Hallucinations Matter

Rule #1: Start Simple

The Initial Approach: LLM-as-a-Judge

Refining the Prompt: Chain-of-Thought (CoT) Prompting

Chain-of-Thought (CoT) Prompting Improved Results Dramatically

Three Essential Learnings from Our Prompt Experiments

How the Hallucination Detector Works (Prompt Example)

Eliminating Hallucinations in AI Code Reviews

Cutting Through the Noise: How AI Filters Low-Value Issues in Code Reviews

5 Ways AI-Driven Code Reviews Level Up Your Devs

Try Korbit Now

PRODUCT

RESOURCES

Account