In Part I, we investigated why hallucinations are such a challenge when working with Large Language Models (LLMs) - whether generating new code or reviewing existing code for potential bugs. We spent weeks manually annotating and analyzing hundreds of AI-generated issues, creating an annotation framework focused on actionability:
From this rigorous evaluation, we identified several recurring patterns in hallucinations:
These findings steered us towards developing smarter agentic systems with better reasoning and that dynamically incorporate the right context and utilize multiple LLMs as an ensemble for more accurate evaluations.
At Korbit, the first rule of building with LLMs is: Always start simple.
(Actually, Rule #1 is properly defining the problem - but we covered that extensively in Part I when we analyzed and labeled hundreds of issues.)
Instead of prematurely diving into complex solutions like multi-hop context gathering (e.g. jumping across files in the codebase) or lensembles of LLMs, we decided to first address the hallucination problem with an extremely simple approach: through prompt engineering. Our goal: build an effective Hallucination Detector using a carefully constructed LLM prompt.
We started with a straightforward strategy known as the "LLM-as-a-judge" method, inspired by recent academic research (Zheng et al., 2023). In this setup, one LLM reviews and evaluates another LLM's outputs in real-time, serving as a gatekeeper that classifies detected issues.
Our initial prompt (shown in the following table) was designed to quickly determine if an issue flagged by the code reviewer was a hallucination or not:
Table 1. LLM First Prompt
You are a senior software engineer working with a team of software engineers.
You are using a large language model for initial review of the files in pull requests.
The large language model reviews files in the pull request and tries to find issues in them. Some of the issues found by the large language model are hallucinations.
If the issue won't cause a real problem in the development or execution of the code, or if the issue depends heavily on external context outside the file, then the issue is considered a hallucination.
Your task is to review issues found by the large language model and detect hallucinations.
You will be provided with the following pieces of information:
- Issue description
- Issue code snippet
- Source code file
…
Unfortunately, this prompt didn't meet expectations. In tests using GPT-4, it only detected 19% of hallucinations - barely performing better than random guessing - while misclassifying 10% of valid issues as hallucinations. It seemed the model wasn’t understanding the task or source code or focusing correctly on the provided context.
Reflecting on our earlier analysis, we recognized a core issue: too much context. Providing the LLM an entire source file overwhelmed it, making focused reasoning difficult.
To address these shortcomings, we experimented with a technique called Chain-of-Thought (CoT) prompting, a structured reasoning approach proven effective across various LLM tasks. CoT encourages models to explicitly think through intermediate steps, improving their reasoning over complex and long inputs.
We tailored a simple, question-answering variant of CoT to help the model focus on precisely the information needed to evaluate an issue. The method involved three steps:
Crucially, we allowed the LLM to provide the "Undetermined" response, signaling low confidence or insufficient context. This was key in preventing arbitrarily forced and potentially incorrect classifications, enabling downstream systems to manage uncertain cases effectively.
This refined CoT prompt significantly improved our hallucination detection results:
Our extensive prompt experimentation surfaced three critical insights:
Here's the refined hallucination detection prompt we arrived at after extensive experimentation:
Table 2. LLM CoT Prompt
You are a senior software engineer working with a team of software engineers.
You are using a large language model for initial review of the files in pull requests.
The large language model reviews files in the pull request and tries to find issues in them.
Some of the issues found by the large language model are hallucinations.
Your task is to review issues found by the large language model and detect hallucinations.
You will be provided with the following pieces of information:
- Issue description
- Issue code snippet
- Source code file
You first should read the issue description and issue code snippet to generate 3 simple questions that will help you evaluate the issue. For instance, if the issue mentions the calculate_sum function being defined twice, a simple question could be “how many calculate_sum function definitions exist in the source code file?“.
Once you generate these 3 simple questions, review the source code file to find the answers and write them down. Then, make a judgement based on your answers to determine if the issue is hallucination.
If you’re unsure about an issue or don’t have enough context to evaluate, then label this issue as undetermined. For the undetermined label, please provide the reason why you think it’s undetermined.
Here is the information to help you determine if the issue is real or hallucination:
- Issue description: {feedback}
- Issue code snippet: {diff}
- Source code file: {source_code_file}
Structure the output as a valid json like:
“question_1”: “The first question you generate.“,
“answer_1”: “Your answer to question_1.“,
“question_2”: “The second question you generate.“,
“answer_2”: “Your answer to question_2.“,
“question_3”: “The third question you generate.“,
“answer_3”: “Your answer to question_3.“,
“evaluation”: “Valid Issue” or “Hallucination” or “Undetermined”.
The best way to understand how the system works is to walk through an example.
Here’s an example of an issue evaluated for hallucination by the prompt:
Now this issue is clearly a hallucination by our definition. The method “outdate_pr_comments” is indeed declared twice in the codebase, but once as a base class method (which is meant to be implemented later) and a second time in the child class (which inherits from the base class and defines the implementation properly). There is no issue to address here. There are no actions to take for the developer, so it is a hallucination by our definition above.
Next, let’s look at the output from Claude 3.5 Sonnet:
Table 3. Claude 3.5 Sonnet Output
Here is my evaluation of the issue in the structured JSON format:
"question_1": "How many times is the outdate_pr_comments method defined in the GithubRepositoryClient class?",
"answer_1": "The outdate_pr_comments method is defined once in the GithubRepositoryClient class.",
"question_2": "Is the first definition of outdate_pr_comments an empty method?",
"answer_2": "No, there is only one definition of outdate_pr_comments and it is not empty. It contains code to minimize comments using the GitHub GraphQL API.",
"question_3": "Will the existing outdate_pr_comments method cause any override issues?",
"answer_3": "No, since there is only a single definition of outdate_pr_comments, there will not be any override issues.",
"evaluation": "Hallucination"
Claude correctly determines that it’s a hallucination. It says that the function is only defined once in that class (which is also correct, as the second definition was in the base class which is a different class), and it concludes that this will not cause any issues.
In Part III of this series, we’ll dive into how we built our Ensemble System, and how it was able to leverage dynamic context extraction across files in the codebase to detect and remove almost 80% of all hallucinations.