Context in LLM Systems of Experts

I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I’m hoping to grow some purple potatoes for my village’s potato competition next year. This post is not written with AI because I like writing.

In a previous blog post I introduced the system of experts. Now I’m going to talk about how you handle context in these systems.

What is context? Also known as memory and conversation history, it’s essentially a record of your interactions with an LLM system which allows the system to serve meaningful responses.

The typical conversation turn through a system of experts using a routing workflow looks like this:

user: “I’d like to buy a bag of purple potatoes”
TriageExpert: "ORDERS"
OrderRequestExpert: variety: purple, number of bags: 1
OrderResponseExpert: “Ok great, please press confirm to buy one bag of purple potatoes”

Each time an Expert gets involved, we’re calling an LLM with a json payload. The key component of this payload we’re discussing is the messages array.

For the very first conversation turn, this will look something like the following:

TriageExpert:

[
  {"system": "categorise the user message into ORDERS...etc using all of the message history"},
  {"user": "I'd like to buy a bag of purple potatoes"}
]

RequestExpert:

[
  {"system": "identify the correct tool call and populate the parameters based on the following messages"},
  {"user": "I'd like to buy a bag of purple potatoes"}
]

ResponseExpert:

[
  {"system": "craft a delightful response to the human using the following messages as context but only respond to the last one"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "Ask the user for confirmation that they want to buy one bag of purple potatoes"}
]

As you can see, we’re passing along the same user message each time, but with a different system prompt. The response expert gets some additional context – in the RequestExpert we have extracted the “variety: purple, number of bags: 1” parameters and we are ready to process the order using TradCode (tongue-in-cheek term to mean Java or TypeScript or COBOL or your programming language of choice) so we just need to ask for confirmation.

However, let’s say the user changes their mind and sends a new message: “Actually, I want to buy two bags”.

At this point, if we haven’t implemented any memory, then our system won’t have the right context.

TriageExpert still has enough info to categorise it as an ORDERS interaction:

[
  {"system": "categorise the user message into ORDERS...etc using all of the message history"},
  {"user": "Actually, I want to buy two bags"}
]

RequestExpert, however, doesn’t have enough information – two bags of what?

[
  {"system": "identify the correct tool call and populate the parameters based on the following messages"},
  {"user": "Actually, I want to buy two bags"}
]

(Don’t worry, there are techniques for dealing with partial information which I’ll cover in a later blog post.)

This is where we need to start introducing memory into these systems, because actually what we want that second conversation turn to look like is this:

TriageExpert:

[
  {"system": "categorise the user message into ORDERS...etc using all of the message history"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"}
]

RequestExpert now has enough information: purple potatoes (from the first conversation turn) and two bags (from the second conversation turn).

[
  {"system": "identify the correct tool call and populate the parameters based on the following messages"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"}
]

And the ResponseExpert now gets this data:

[
  {"system": "craft a delightful response to the human using the following messages as context but only respond to the last one"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"},
  {"assistant": "Ask the user for confirmation that they want to buy two bags of purple potatoes"}
]

This brings us to a concept we call user thread. There isn’t really an industry standard around terminology for memory in LLM-based systems so we at Zopa created this term to describe the messages that travel back and forth between the user and an LLM-based system and which are used to give context to various LLM-based experts. It’s not guaranteed to be the same as a transcript, for reasons I’ll discuss below.

Here is the user thread at this point in the flow:

[
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"},
  {"assistant": "Ok can you confirm you want to buy two bags?"}
]

Note that there are no system messages, only user and assistant. Each expert has its own system prompt but they all get this user thread.

Why is the user thread not the same as a transcript?

Good question! Let’s imagine a very chatty customer who is ordering a lot of potatoes and goes back and forth about how many and what variety. This user thread could get quite long.

[
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"},
  {"assistant": "Ok can you confirm you want to buy two bags?"},
  {"user": "actually maybe Idaho potatoes are better this time of year"},
  {"assistant": "ok, shall I order some Idaho potatoes?"}
  {"user": "no let's go back to my first plan but make it three bags"}
]

Eventually it’s going to get so long that two things will happen:

  1. The LLM’s context window will get overloaded
  2. You’ll start having very expensive interactions

As LLM context windows grow larger, the chances of overloading the window decrease. You’d have to really like talking about potatoes and have very few demands on your time to overload it.

But because LLMs are usually billed on tokens, the longer your user thread, the more tokens you send each time and the more the model provider charges you.

So it might make sense, after a certain length, to do something about this big context. There are two options here:

  1. Drop a bunch of messages from the beginning of the conversation. This is not great though because the model may need the context of those earlier messages.
  2. Summarise the messages. This uses another LLM Expert so it’s a bit trickier than naively dropping messages but the end result is that instead of 6 outdated messages, you have 1 message with a summary of the earlier parts of the conversation.

So this is why the user thread and the transcript aren’t necessarily the same thing. The transcript is the record of what actually happened and the user thread is a useful approximation of the transcript that we can give an LLM Expert in a cost-effective manner.

The other key thing to remember is that all the “gubbins” of the system, the system prompts for each expert as well as their results, aren’t being stored in the user thread. But you’ll probably want to store them somewhere for audit and debugging purposes – just make sure they never get into the user thread or this unnecessary data will a) confuse the experts and b) incur more per-token costs.

And yes, potatoes really can be purple.

Coming up next: handling sensitive data in LLM-based systems

Leave a comment