Introducing a set of design patterns aimed at LLMs to construct computational units and systems solving highly complex problems.

We’ve shown how LLMs have given rise to a new computational paradigm. With a new computational paradigm comes also new approaches, frameworks and design patterns for building software. In this blog post, we’ll introduce a set of design patterns aimed at LLMs to construct computational units and systems solving highly complex problems.

Most of these design patterns are specific to LLMs, but won’t apply to typical software systems.

Few-shot learning is a simple, but very powerful design pattern for LLMs that can drastically improve their performance. LLMs are trained to fill in words and sentences in documents (i.e. predict one or multiple missing/masked words in a sentence). And they are great at this!

The idea of few-shot learning is to give the LLM *n* example input-output pairs in the prompt, and then ask it to fill what’s missing for the *n+1*’th example.

Let’s look at the example of classifying emails as important or not important. Recall that in the last blog post we discussed how to design a software system (a personal assistant) using LLMs to detect important emails and notify the user on Slack to respond to them urgently.

Here’s an example prompt to GPT-4, where I ask it to classify a marketing email:

I have received the following email. Please tell me if it's important or not. If it's an important email, please write 1, otherwise please write 0.

Email:

Quick Question

Hey Iulian, I came across what you're doing at Korbit and would love to show you how we can help! We actually just helped a business like yours grow to over $100k/mo using linkedin ads.

We're offering a 50% discount on our service this month, but you have to act fast.

Let's chat this afternoon?

Cheers,

John Blarr

This is a marketing email I received, but GPT-4 returned “*1*” telling me that it’s an important email. GPT-4 could not solve this task with a standard prompt.

Let’s try it with few-shot learning. We are going to give the LLM n=2 examples and then ask it to classify the third example. Let’s see if it can classify the email from earlier correctly:

Your job is to classify emails as important or not important.

Email:

Unpaid Invoice

Your invoice is due today at 5pm.

If the invoice is not paid on time, you will be subject to a $100 fee.

Yours sincerely,

Annie Burton

Important: 1

Email:

Anderson Introduction

I'm working for Anderson. We're an international consulting firm specializing in obtaining tax credits for businesses such as yours.

Would it be possible to jump on a quick call?

Yours sincerely,

Alex Grayfield

Important: 0

Email:

Quick Question

Hey Iulian, I came across what you're doing at Korbit and would love to show you how we can help!

We actually just helped a business like yours grow to over $100k/mo using linkedin ads.

We're offering a 50% discount on our service this month, but you have to act fast.

Let's chat this afternoon?

Cheers,

John Blarr

Important:

When running this query, GPT-4 returned “0” indicating that the last email (the marketing email) is not important. In other words, using few-shot learning GPT-4 was able to solve this task!

If you go to the previous blog post, you’ll see how we added few-shot learning into the system classifying emails as important or not important. Here we discussed how few-shot learning could be incorporated into the system by adding two more computational units: one computational unit for a human reviewer to annotate examples, and one computational unit to read/write to a Vector DB. This is a very general way to combine this design pattern with a human-in-the-loop.

When we task LLMs to solve a more complex problem, it often helps to break it into two smaller problems. The idea behind CoT Prompting is to enable complex reasoning capabilities through intermediate reasoning steps. The usual pattern is to look for an intermediate output that’s needed to find (or correlated with) the expected output.

Let’s return to the example of classifying emails as important or not important. For example, we could say that an email is only important if I need to respond to it urgently.

We can use this to split the task into two sub-tasks. First, we’ll query GPT-4 to see if the email requires an urgent. Then we’ll take this result and query GPT-4 again to see if it’s an important email. Here’s what I query GPT-4 with at first:

You have received the following email. Please tell me if this email requires an urgent response. Please state your reasoning step by step.

Email:

Quick Question

Hey Iulian, I came across what you're doing at Korbit and would love to show you how we can help!

We actually just helped a business like yours grow to over $100k/mo using Linkedin ads.

We're offering a 50% discount on our service this month, but you have to act fast.

Let's chat this afternoon?

Cheers,

John Blarr

Here is the response GPT-4 gave me:

*This email does not require an urgent response. Here's the reasoning:*

*1. The sender, John Blarr, is offering a service, which means he is soliciting business. This is not an emergency or a situation that requires immediate attention.*

*2. The email mentions a discount that is available this month, but it does not specify a deadline or an expiration date for the offer. Therefore, it does not create a sense of urgency.*

*3. The sender suggests chatting in the afternoon, but does not specify a time or date. This suggests that the conversation could potentially be scheduled at a later date.*

*4. The email does not contain any information that would directly impact the operations or finances of your business in the immediate future.*

*5. The email does not mention any negative consequences that would occur if you do not respond immediately.*

*Therefore, while the email may be of interest and could potentially benefit your business, it does not require an urgent response.*

I then append this explanation (as free text) to a new prompt and ask it to classify the email as important or not important. Here’s the query for GPT-4:

I have received the following email. The email is only important if it's urgent.

This email does not require an urgent response. Here's the reasoning:

1. The sender, John Blarr, is offering a service, which means he is soliciting business. This is not an emergency or a situation that requires immediate attention.

2. The email mentions a discount that is available this month, but it does not specify a deadline or an expiration date for the offer. Therefore, it does not create a sense of urgency.

3. The sender suggests chatting in the afternoon, but does not specify a time or date. This suggests that the conversation could potentially be scheduled at a later date.

4. The email does not contain any information that would directly impact the operations or finances of your business in the immediate future.

5. The email does not mention any negative consequences that would occur if you do not respond immediately.

Therefore, while the email may be of interest and could potentially benefit your business, it does not require an urgent response.

Please tell me if it's important or not. If it's an important email, please write 1, otherwise please write 0.

Email:

Quick Question

Hey Iulian, I came across what you're doing at Korbit and would love to show you how we can help!

We actually just helped a business like yours grow to over $100k/mo using linkedin ads.

We're offering a 50% discount on our service this month, but you have to act fast.

Let's chat this afternoon?

Cheers,

John Blarr

GPT-4 returned “0” meaning it’s not an important email. This time we were able to solve the task using CoT Prompting, which required a different approach compared to few-shot learning.

This is great, but how do we know it will work well on other emails? How do we validate that it’s accurate on a statistically representative set of emails? Let’s look at statistical function testing.

When we think of validating software we often think of test cases. Given a (not too complex) deterministic function, we can write test cases to validate its behavior. We often aim to validate its behavior based on its internal logic. For example, if there is an if-clause, we might want to make sure our test cases cover both when the if-condition applies and when it doesn’t.

We typically cannot do that with LLMs (or “black box” APIs, or humans, or many other types of computational units). So we have to think like an ML engineer and use statistical testing.

The purpose of statistical testing is to provide a statistical guarantee that, under certain conditions, the system achieves a certain performance level. For example, the statistical guarantee might be something along the lines of: *with a 95%- confidence level, the system will on average be able to classify correctly 90% of input examples.*

In many cases, we can simulate our system to generate our dataset of expected input-output pairs for any given computational unit. We can make some reasonable assumptions about the statistical distribution and apply statistical testing to verify if the system achieves the performance required. If it does not pass the statistical test, then we know it must be improved.

Let’s again consider the LLM above classifying email as important or not important. Let’s say the maximum we will tolerate is that 1/10 emails are not classified correctly. This means we need to ensure a level of at least 90% accuracy.

I am arbitrarily picking the number 90%, but in general it will depend a lot on the system and the end user. In particular, if errors in this computational unit cascade downstream to other computational units, then we may want to set even higher performance requirements.

To build a statistical test, we can construct a dataset of say n=50 example input-output pairs. We can run these examples through the computational unit to get its predicted labels, calculate its accuracy and then run a **binomial test** to see if it achieves 90% accuracy or more at 95% confidence level. If this test fails, then our statistical test fails.

Given that we’re dealing with a binomial test, we have to use the **binomial proportion confidence interval** (see here). A good approximation is to check the following assertion holds:

Here ** p̂** is the percentage of correctly classified emails in the dataset, n is the number of examples in the dataset (n=50 in this example) and z is the quantile of a standard normal distribution (which should be 1.96 given that we want a 95% confidence level).

If this test (or assertion) passes, we can guarantee that with 95% confidence level, the accuracy of the LLM computational unit will be at least 90% on average.

We can use the idea of statistical testing to also improve the model itself.

It’s well-known that even for the same task with the same LLM, different prompts can lead to vastly different results. Even a small change in the prompt template, such as changing a single word, can have a significant impact on the LLMs ability to perform the task it has been given.

Given a dataset of n example input-output pairs for a given task and k prompt templates, we can use statistical testing to find the prompt template which yields the best result for the task.

The simplest approach is to use brute force. In this case, for every single prompt template, we evaluate the performance of that prompt template against all n example input-output pairs. For the example above, classifying if an email is important or not, we would simply calculate the accuracy of each prompt template against all n=50 examples. Then we would pick the prompt template with the highest accuracy and use this in the system.

This approach is easy to follow if the outcome is a binary variable (e.g. “important” or “not important” email), because we can do a direct comparison between the LLMs output and the expected output.

However, if the expected output is a more complex object, such as an image or paragraph of text (e.g. an email, a summary, or a translation), we cannot do a direct comparison between the LLMs output and the expected output. We’ll need a different way to compare these.

If we’re dealing with relatively short texts (e.g. 5-10 lines), then one way to solve this problem is to ask a second LLM to compare the two texts (the first LLM’s output and the expected output) and to rate their similarity on say a scale 1-5. We then collect a set of similarity scores and can use the average of these scores as the metric.

In this case the accuracy of this metric now depends on the accuracy of the second LLM. We can also model this statistically. Now both the first LLMs output and the measurement (similarity score) are random variables each following their own statistical distributions. If we make appropriate assumptions on these two variables, we can construct a new statistical test and apply this to validate the performance of the system.

We’ve discussed 4 design patterns applicable to LLMs. These design patterns change the way we build software with LLMs and can have a huge impact on a system’s maintainability, performance, speed and ability to scale.

I encourage you to keep learning about design patterns and share your insights with others.

----------

https://www.korbit.ai/get-started

How to Build Software with LLMs (Part 2)

Back to blog