Handling Sensitive Data in LLM Systems

on August 26, 2025August 26, 2025 by FiddlersCode

I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I love growing potatoes to break up difficult soil. This post is not written with AI because I like writing.

In my last post, Context in LLM Systems of Experts, I mentioned that you probably want to store an audit trail of LLM interactions.

However, some of that data may be very sensitive so you need to handle it in line with regulations as well as what’s best for your customer.

It’s best practice to assume that any data sent to an LLM can be extremely sensitive. Some examples:

a customer gives their name and address when ordering potatoes (PII, or personally identifiable information)
a customer enters their credit card number (PCI data) – you probably don’t want them entering this data in this way but users do weird things
a customer is explaining they can’t pay their potato bill because they tripped and fell over a bucket of Brussels sprouts and aren’t able to work due to a broken wrist (medical data)

So make sure wherever you’re storing this data has the correct safeguards – encryption, access control, the usual.

Do not log inputs and responses to LLMs as they may contain sensitive data. (Also, LLMs can be wordy so you might end up paying for a lot of log storage costs you don’t really need.)

It’s very tempting to take transcripts from LLM-based conversations and turn them into evals. If you’re going to do this, make sure you scrub any sensitive data first.

And make sure that you have an agreement with whatever model provider you’re using that they won’t use your data (including your customer’s data) for training models. If you want to use this data for fine tuning models and your T&Cs only say you won’t use the data for training models, you’re in a grey area. I’d make sure your customers know and as always, scrub the sensitive data first.

I firmly believe that software engineers have an ethical duty (as well as a legal obligation, at least in the UK) to protect our users’ data. Hopefully you agree, so make sure, as you’re building your potato and other vegetable LLM systems that you’re keeping your users’ best interests at heart.

I think what I’m trying to say is scrub the sensitive data like you’d scrub your potatoes before entering them into the village gardening competition.

Next up: handling partial data in system of expert workflows

Context in LLM Systems of Experts

on August 14, 2025August 14, 2025 by FiddlersCode

I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I’m hoping to grow some purple potatoes for my village’s potato competition next year. This post is not written with AI because I like writing.

In a previous blog post I introduced the system of experts. Now I’m going to talk about how you handle context in these systems.

What is context? Also known as memory and conversation history, it’s essentially a record of your interactions with an LLM system which allows the system to serve meaningful responses.

The typical conversation turn through a system of experts using a routing workflow looks like this:

user: “I’d like to buy a bag of purple potatoes”
TriageExpert: "ORDERS"
OrderRequestExpert: variety: purple, number of bags: 1
OrderResponseExpert: “Ok great, please press confirm to buy one bag of purple potatoes”

Each time an Expert gets involved, we’re calling an LLM with a json payload. The key component of this payload we’re discussing is the messages array.

For the very first conversation turn, this will look something like the following:

TriageExpert:

[
  {"system": "categorise the user message into ORDERS...etc using all of the message history"},
  {"user": "I'd like to buy a bag of purple potatoes"}
]

RequestExpert:

[
  {"system": "identify the correct tool call and populate the parameters based on the following messages"},
  {"user": "I'd like to buy a bag of purple potatoes"}
]

ResponseExpert:

[
  {"system": "craft a delightful response to the human using the following messages as context but only respond to the last one"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "Ask the user for confirmation that they want to buy one bag of purple potatoes"}
]

As you can see, we’re passing along the same user message each time, but with a different system prompt. The response expert gets some additional context – in the RequestExpert we have extracted the “variety: purple, number of bags: 1” parameters and we are ready to process the order using TradCode (tongue-in-cheek term to mean Java or TypeScript or COBOL or your programming language of choice) so we just need to ask for confirmation.

However, let’s say the user changes their mind and sends a new message: “Actually, I want to buy two bags”.

At this point, if we haven’t implemented any memory, then our system won’t have the right context.

TriageExpert still has enough info to categorise it as an ORDERS interaction:

[
  {"system": "categorise the user message into ORDERS...etc using all of the message history"},
  {"user": "Actually, I want to buy two bags"}
]

RequestExpert, however, doesn’t have enough information – two bags of what?

[
  {"system": "identify the correct tool call and populate the parameters based on the following messages"},
  {"user": "Actually, I want to buy two bags"}
]

(Don’t worry, there are techniques for dealing with partial information which I’ll cover in a later blog post.)

This is where we need to start introducing memory into these systems, because actually what we want that second conversation turn to look like is this:

TriageExpert:

[
  {"system": "categorise the user message into ORDERS...etc using all of the message history"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"}
]

RequestExpert now has enough information: purple potatoes (from the first conversation turn) and two bags (from the second conversation turn).

[
  {"system": "identify the correct tool call and populate the parameters based on the following messages"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"}
]

And the ResponseExpert now gets this data:

[
  {"system": "craft a delightful response to the human using the following messages as context but only respond to the last one"},
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"},
  {"assistant": "Ask the user for confirmation that they want to buy two bags of purple potatoes"}
]

This brings us to a concept we call user thread. There isn’t really an industry standard around terminology for memory in LLM-based systems so we at Zopa created this term to describe the messages that travel back and forth between the user and an LLM-based system and which are used to give context to various LLM-based experts. It’s not guaranteed to be the same as a transcript, for reasons I’ll discuss below.

Here is the user thread at this point in the flow:

[
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"},
  {"assistant": "Ok can you confirm you want to buy two bags?"}
]

Note that there are no system messages, only user and assistant. Each expert has its own system prompt but they all get this user thread.

Why is the user thread not the same as a transcript?

Good question! Let’s imagine a very chatty customer who is ordering a lot of potatoes and goes back and forth about how many and what variety. This user thread could get quite long.

[
  {"user": "I'd like to buy a bag of purple potatoes"},
  {"assistant": "OK great, please confirm your order"},
  {"user": "Actually, I want to buy two bags"},
  {"assistant": "Ok can you confirm you want to buy two bags?"},
  {"user": "actually maybe Idaho potatoes are better this time of year"},
  {"assistant": "ok, shall I order some Idaho potatoes?"}
  {"user": "no let's go back to my first plan but make it three bags"}
]

Eventually it’s going to get so long that two things will happen:

The LLM’s context window will get overloaded
You’ll start having very expensive interactions

As LLM context windows grow larger, the chances of overloading the window decrease. You’d have to really like talking about potatoes and have very few demands on your time to overload it.

But because LLMs are usually billed on tokens, the longer your user thread, the more tokens you send each time and the more the model provider charges you.

So it might make sense, after a certain length, to do something about this big context. There are two options here:

Drop a bunch of messages from the beginning of the conversation. This is not great though because the model may need the context of those earlier messages.
Summarise the messages. This uses another LLM Expert so it’s a bit trickier than naively dropping messages but the end result is that instead of 6 outdated messages, you have 1 message with a summary of the earlier parts of the conversation.

So this is why the user thread and the transcript aren’t necessarily the same thing. The transcript is the record of what actually happened and the user thread is a useful approximation of the transcript that we can give an LLM Expert in a cost-effective manner.

The other key thing to remember is that all the “gubbins” of the system, the system prompts for each expert as well as their results, aren’t being stored in the user thread. But you’ll probably want to store them somewhere for audit and debugging purposes – just make sure they never get into the user thread or this unnecessary data will a) confuse the experts and b) incur more per-token costs.

And yes, potatoes really can be purple.

Coming up next: handling sensitive data in LLM-based systems

You (probably) don’t need Agentic AI

on August 13, 2025August 14, 2025 by FiddlersCode

I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I recently failed to win any prizes in my village’s potato-growing competition. This post is not written with AI because I like writing.

Agentic AI is all the rage these days. An AI that can control its own destiny? How cool is that! (Or, depending on your view of AI, how utterly terrifying.)

But if you’re trying to build a system to interact with humans, what you (probably) really want is a system of experts. What’s the difference?

According to Wikipedia, Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks without human intervention.

A system of experts is a chain of foundational models (aka LLMs) which has a human on the end of each conversational turn. A very classic example for this is an AI chatbot.

But agentic AI sounds way cooler, so why can’t I have that?

Ok, let’s dig into an example. Let’s say you’re a potato farm and you want to use AI to handle your customer interactions. Everything from ordering potatoes to complaining that your Charlottes were too small. How does this work with a system of experts?

Your customer contacts you, via voice or chat or email or whatever, and says “I want to order another bag of Maris Piper potatoes.” This goes to a TriageExpert, which has a prompt something like:

If the customer is talking about orders, return ORDERS
If the customer is making a complaint, return COMPLAINT
If the customer is talking about something else, return CANNOT_HELP

If you think this sounds like an old-style IVR system, you’re not wrong. If you think you could do way better with agentic AI…hold your horses.

In our case, the customer is talking about orders so we use structured output to return ORDERS.

We then use TradCode (a snarky term I invented to mean the bog-standard, deterministic Java or Kotlin or Python or whatever code you’ve been writing since before LLMs arrived) to make a very fancy algorithm:

if ORDERS:
  -> OrdersRequestExpert
if COMPLAINT
  -> ComplaintRequestExpert
if CANNOT_HELP
  -> CannotHelpResponseExpert

C’mon, use an agent already I hear you say, possibly cheered on by a blog post you read.

Bear with me, though.

Next step is to let this OrdersRequestExpert do something. You’ve probably given it a tool call named CREATE_ORDERS and the LLM will return that tool call with params like this:

variety: Maris Piper
number of bags: 1

At this point, you’d still like that human in the loop and HEY IT TURNS OUT YOU HAVE A HUMAN! So you tell your customer, via the OrderResponseExpert, “I’m going to order a bag of Maris Piper potatoes. Press confirm or cancel.” The user is jonesin’ for potatoes, so they press confirm.

Then TradCode reappears, and you call your /orderpotato endpoint with the payload the OrderRequestExpert has identified for you.

This is a very basic workflow, where the foundational model is responsible for translating the human speak into computer speak. You then use all your standard APIs and code to actually fulfil the customer’s wishes.

Please can we use agentic AI now? OK, let’s see how that would look.

Your customer contacts you, via voice or chat or email or whatever, and says “I want to order another bag of Maris Piper potatoes.” Let’s say you have a RequestAgent, which can deal with any incoming request. It’s got a huge bunch of tool calls to choose from but it’s agentic AI so it’s really really good at choosing the right tool call and never makes a mistake.

[That was sarcasm. -Editor]

It calls the order potatoes tool call 100% of the time [more sarcasm], but it needs to question its own decisions so it now calls a JudgeLLM. This judge is going to look at what’s happened so far and decide whether the output of the OrderRequestExpert is good enough to process. So here’s a few examples of what the OrderRequestExpert might come up with assuming it’s selected the right tool call (which we know it reliably does) [even more sarcasm]:

variety: Maris Pooper
number of bags: 1

variety: Maris Piper
number of bags: 100

variety: Maris Piper
number of bags: one

variety: Marish Piper
number of bags: 1

variety: Maris Piper
number of bags: a

As you can see, there’s quite a few varieties of answer there. But the JudgeLLM has a simple task: when it receives a request, decide whether that request should be processed as-is or sent back to try again.

variety: Maris Pooper
number of bags: 1

JudgeAnswer: yes...
OR no

So the key difference here is that the JudgeLLM is doing what the human in the loop did in the first example: deciding if an output is correct.

So let’s say the JudgeLLM has finally decided that something should be processed. Let’s say it’s decided to process this output:

variety: Maris Piper
number of bags: 100

We’re now ordering 100x the number of bags of Maris Piper that our customer wanted. An easy mistake for an LLM to make, especially if there was voice transcription involved. Maybe we should present them with a confirmation page…

So we’re back to square 1 with agentic AI, where when we’re taking actions that could cost the customer money, we want the customer to confirm the action is right.

Now in both these scenarios, the OrderRequestExpert or RequestAgent might get it wrong. But there’s a few failure scenarios here that companies should care about:

OrderRequestExpert/Agent gives a wrong answer that the JudgeLLM says is correct but which the human says is wrong. Now all we’ve done is paid extra processing costs to [insert model provider of your choice] for the judge and still made the human do the work for us.

OrderRequestExpert/Agent gives a right answer that JudgeLLM says is wrong. This is even worse – we’ve paid more money for the JudgeLLM, plus more money for the next round of LLM interactions, and the human is waiting longer for their correct answer.

OrderRequestExpert/Agent gives a wrong answer that JudgeLLM (correctly) says is wrong and we end up in a looping request that costs extra money. The longer this loop goes on, the more expensive it is and the more frustrated the customer is.

Is it possible that agentic AI is hype driven by companies who charge per token cost?

Aside from the potentially expensive failure scenarios, there’s another problem with agentic AI: testing.

In this scenario, there is a defined number of outcomes (order potatoes, complain about potatoes, leave a review about potatoes, change order address, cancel order blah blah). It may be a long list but it’s not infinite. And you probably care about everything on that list working – an address update fails, that’s bad because where do you send your potatoes?

With the system of experts, you can “unit test” the experts. Given any input, the TriageExpert should only return a specific ENUM (ORDERS, COMPLAINTS, CANNOT_HELP). So you can test just that TriageExpert and tune it to give the correct results. Same with the OrderRequestExpert, although in this case you expect structured JSON instead of an enum. Only the OrderResponseExpert is tricky to test, because you need some NLP (this may be a case for a JudgeLLM). Or you could decide you don’t care that much about delighting your users with AI-generated responses and just return from a list of hardcoded responses.

Testing the agentic AI? Harder and more expensive because you need to test not just the inputs and outputs, but you also need to test how many tries it takes. So if your agentic AI takes 100 (internal) turns to come up with a correct answer, at which point the user would be bored, has your test passed or failed? And is it a Pyrrhic victory, because you’ve spent so much of your LLM budget just on tests?

This system is also not decomposable: the JudgeLLM has been tightly coupled to the RequestAgent – you can only meaningfully test them as a system, not as individual components. With latencies around 800ms [that’s a decent average I’ve seen using gpt-4.1 deployed in Azure in a geographically nearish region to me], each test will take 1.6 seconds AT LEAST to run, and potentially rack up a LOT of token costs. And you’ll probably need hundreds if not thousands of test cases.

So: you probably don’t need agentic AI.

Coming up next: Context in LLM Systems of Experts

Paula Muldoon

Tag: llm

Tag: llm

Handling Sensitive Data in LLM Systems

Context in LLM Systems of Experts

You (probably) don’t need Agentic AI