Hi, I’m Paula Muldoon – a staff software in multimodal GenAI at Zopa Bank in London.

Review season is over and if you’re like me, you’re having the usual existential crisis opportunity to understand what being a staff engineer is. It’s very clear that AI coding tools are transforming how we build software, so how does this change the role of a staff engineer?

Over the past ~10 years, we’ve come to an industry standard of the essentials of staff engineering. Claude summarises Will Larson and Tanya Reilly’s work as this:

A Staff Engineer is a senior individual contributor who leads through technical influence rather than management. As Will Larson describes, it’s leadership beyond the management track — setting technical direction, shaping strategy, and mentoring others. Tanya Reilly’s foreword reinforces this: the role demands broad organizational impact, not just deep expertise. – Sonnet 4.6

This definition of staff engineering, particularly the organisational impact, made a lot of sense before 2025. Staff engineers need to stop being hands-on with the code as the majority of their work and spend time teaching others, making strategy etc. I was fully on board with this – it’s probably not the best use of my time to spend a week implementing a feature that’s easy for me but a good growth opportunity for someone else, especially when there are more strategic goals to go after.

But if you think about it, org impact is a terrible metric. Do your customers care about your org? No, of course not. They care about whether your product works for them. But because writing code well has been a time sink, we’ve tried to make sure our most expensive engineers have spent time scaling others, because that was the most cost-effective use of their time.

AI software tools have changed that.

I’m not here to convince you of how AI software tools will change the industry. If you’re aware, you know, and if you don’t, go away, download Claude, and start playing with Opus 4.6. You’re going to have so much fun! (Don’t believe me? Listen to Kent Beck.)

That feature that would have taken me a week to build? It’s a day now. Analysing a system and coming up with a design? Minutes. Cost of change? Way lower than it used to be.

2026 is the year staff+ engineers need to get hands-on again.

Here’s why:

You need to recalibrate your tradeoff thinking

    One of the things that makes you valuable is your ability to weigh tradeoffs, informed by years of experience of how software gets built. You go into a meeting with product leads and say “This set of features will take six months to build, but we can cut this one feature and have something almost as good in 3 months.” But what took six months in January 2025 takes one month in March 2026. There’s no way you can know that unless you have hands-on experience building with these tools.

    Yes, you still need to scale the engineers around you, but they might be scaling you. Everybody is figuring out how to use these new tools and if you don’t figure it out, suddenly you’ll be recommending poor decisions to your org because you don’t understand how the world has changed.

    (Side note: I’m continually amazed at how much more quickly we can build things now – not just because of the code generation, but because of the systems analysis. You don’t need a week of technical spiking and research any more to determine feasibility. You give Claude the context of the relevant systems and 20 minutes later you have your answer. I am fully aware that I haven’t gotten used to this speed, and I’m scared of deploying so much software so quickly, and I think I can move faster. That’s my personal challenge.)

    You can move faster than anybody else with these tools.

    Remember all that time spent as a senior engineer where coding was most of the job? You have deep technical and product expertise. You can use these tools to accelerate yourself to ship fast. Suddenly you spending your time mentoring and coaching other engineers feels a bit like getting a Kentucky Derby winner to coach yearlings instead of running the race. (OK, terrible metaphor since I think racehorses flame out after a couple years but you get the point.) You should be blazing the trail of how to accelerate everybody, and the best way of doing it is to learn it yourself and then to point to results.

    Does it mean you don’t coach and mentor? No, but it’s not such a big part of your role for now. And crucially, there are early-career engineers who aren’t hampered by the past, who will be teaching YOU how to use Claude better. Have the humility to learn from them.

    Org impact doesn’t matter. Customer impact does.

    This is going to be controversial but I propose that we’ve been considering org impact as a proxy for customer impact. We assume that if a staff engineer is having org-wide impact, that translates into significant customer impact – but that’s not a given by any means. Customer impact is the only metric that matters. We’ve gotten very rigid about our definitions of staff engineer impact – ironically, as it’s actually tough to measure. It’s a lot easier to define and measure customer impact so why the hell aren’t we doing that? Org impact then just becomes an implementation detail under the hood. Sure you’ll still want to talk about it, but you shouldn’t be having conversations about how to have more org impact, you should be having conversations about how to have better customer impact, and sometimes that’s through org impact and sometimes it’s not.

    This customer impact that you should be held accountable for – it should be big in scope – we’re not talking a few features here, we’re talking entire product suites, with uptime guarantees etc. And customer impact still means strategic thinking but your strategy has to be informed by your hands-on knowledge of the new software engineering landscape.

    What about 2027?

    I think it’s likely that staff engineering will change again in 2027. At some point the AI tooling will level off and staff engineers will recalibrate and go back to the more strategic/coaching role. I don’t know when that is, and I don’t know when the next big shift will come. But for now, you need to be at the forefront of the new technologies.

    (Side note: depending on your company, you may prefer business impact over customer impact. One of the things I love about working for Zopa is that we genuinely do consider customer impact first. And we’re hiring, by the way, so come work with me!)

    With thanks to the colleague who has been challenging my thinking on the staff engineer role!

    I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I love growing potatoes to break up difficult soil. This post is not written with AI because I like writing.

    In my last post, Context in LLM Systems of Experts, I mentioned that you probably want to store an audit trail of LLM interactions.

    However, some of that data may be very sensitive so you need to handle it in line with regulations as well as what’s best for your customer.

    It’s best practice to assume that any data sent to an LLM can be extremely sensitive. Some examples:

    • a customer gives their name and address when ordering potatoes (PII, or personally identifiable information)
    • a customer enters their credit card number (PCI data) – you probably don’t want them entering this data in this way but users do weird things
    • a customer is explaining they can’t pay their potato bill because they tripped and fell over a bucket of Brussels sprouts and aren’t able to work due to a broken wrist (medical data)

    So make sure wherever you’re storing this data has the correct safeguards – encryption, access control, the usual.

    Do not log inputs and responses to LLMs as they may contain sensitive data. (Also, LLMs can be wordy so you might end up paying for a lot of log storage costs you don’t really need.)

    It’s very tempting to take transcripts from LLM-based conversations and turn them into evals. If you’re going to do this, make sure you scrub any sensitive data first.

    And make sure that you have an agreement with whatever model provider you’re using that they won’t use your data (including your customer’s data) for training models. If you want to use this data for fine tuning models and your T&Cs only say you won’t use the data for training models, you’re in a grey area. I’d make sure your customers know and as always, scrub the sensitive data first.

    I firmly believe that software engineers have an ethical duty (as well as a legal obligation, at least in the UK) to protect our users’ data. Hopefully you agree, so make sure, as you’re building your potato and other vegetable LLM systems that you’re keeping your users’ best interests at heart.

    I think what I’m trying to say is scrub the sensitive data like you’d scrub your potatoes before entering them into the village gardening competition.

    Next up: handling partial data in system of expert workflows

    I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I’m hoping to grow some purple potatoes for my village’s potato competition next year. This post is not written with AI because I like writing.

    In a previous blog post I introduced the system of experts. Now I’m going to talk about how you handle context in these systems.

    What is context? Also known as memory and conversation history, it’s essentially a record of your interactions with an LLM system which allows the system to serve meaningful responses.

    The typical conversation turn through a system of experts using a routing workflow looks like this:

    user: “I’d like to buy a bag of purple potatoes”
    TriageExpert: "ORDERS"
    OrderRequestExpert: variety: purple, number of bags: 1
    OrderResponseExpert: “Ok great, please press confirm to buy one bag of purple potatoes”

    Each time an Expert gets involved, we’re calling an LLM with a json payload. The key component of this payload we’re discussing is the messages array.

    For the very first conversation turn, this will look something like the following:

    TriageExpert:

    [
      {"system": "categorise the user message into ORDERS...etc using all of the message history"},
      {"user": "I'd like to buy a bag of purple potatoes"}
    ]

    RequestExpert:

    [
      {"system": "identify the correct tool call and populate the parameters based on the following messages"},
      {"user": "I'd like to buy a bag of purple potatoes"}
    ]

    ResponseExpert:

    [
      {"system": "craft a delightful response to the human using the following messages as context but only respond to the last one"},
      {"user": "I'd like to buy a bag of purple potatoes"},
      {"assistant": "Ask the user for confirmation that they want to buy one bag of purple potatoes"}
    ]

    As you can see, we’re passing along the same user message each time, but with a different system prompt. The response expert gets some additional context – in the RequestExpert we have extracted the “variety: purple, number of bags: 1” parameters and we are ready to process the order using TradCode (tongue-in-cheek term to mean Java or TypeScript or COBOL or your programming language of choice) so we just need to ask for confirmation.

    However, let’s say the user changes their mind and sends a new message: “Actually, I want to buy two bags”.

    At this point, if we haven’t implemented any memory, then our system won’t have the right context.

    TriageExpert still has enough info to categorise it as an ORDERS interaction:

    [
      {"system": "categorise the user message into ORDERS...etc using all of the message history"},
      {"user": "Actually, I want to buy two bags"}
    ]

    RequestExpert, however, doesn’t have enough information – two bags of what?

    [
      {"system": "identify the correct tool call and populate the parameters based on the following messages"},
      {"user": "Actually, I want to buy two bags"}
    ]

    (Don’t worry, there are techniques for dealing with partial information which I’ll cover in a later blog post.)

    This is where we need to start introducing memory into these systems, because actually what we want that second conversation turn to look like is this:

    TriageExpert:

    [
      {"system": "categorise the user message into ORDERS...etc using all of the message history"},
      {"user": "I'd like to buy a bag of purple potatoes"},
      {"assistant": "OK great, please confirm your order"},
      {"user": "Actually, I want to buy two bags"}
    ]

    RequestExpert now has enough information: purple potatoes (from the first conversation turn) and two bags (from the second conversation turn).

    [
      {"system": "identify the correct tool call and populate the parameters based on the following messages"},
      {"user": "I'd like to buy a bag of purple potatoes"},
      {"assistant": "OK great, please confirm your order"},
      {"user": "Actually, I want to buy two bags"}
    ]

    And the ResponseExpert now gets this data:

    [
      {"system": "craft a delightful response to the human using the following messages as context but only respond to the last one"},
      {"user": "I'd like to buy a bag of purple potatoes"},
      {"assistant": "OK great, please confirm your order"},
      {"user": "Actually, I want to buy two bags"},
      {"assistant": "Ask the user for confirmation that they want to buy two bags of purple potatoes"}
    ]

    This brings us to a concept we call user thread. There isn’t really an industry standard around terminology for memory in LLM-based systems so we at Zopa created this term to describe the messages that travel back and forth between the user and an LLM-based system and which are used to give context to various LLM-based experts. It’s not guaranteed to be the same as a transcript, for reasons I’ll discuss below.

    Here is the user thread at this point in the flow:

    [
      {"user": "I'd like to buy a bag of purple potatoes"},
      {"assistant": "OK great, please confirm your order"},
      {"user": "Actually, I want to buy two bags"},
      {"assistant": "Ok can you confirm you want to buy two bags?"}
    ]

    Note that there are no system messages, only user and assistant. Each expert has its own system prompt but they all get this user thread.

    Why is the user thread not the same as a transcript?

    Good question! Let’s imagine a very chatty customer who is ordering a lot of potatoes and goes back and forth about how many and what variety. This user thread could get quite long.

    [
      {"user": "I'd like to buy a bag of purple potatoes"},
      {"assistant": "OK great, please confirm your order"},
      {"user": "Actually, I want to buy two bags"},
      {"assistant": "Ok can you confirm you want to buy two bags?"},
      {"user": "actually maybe Idaho potatoes are better this time of year"},
      {"assistant": "ok, shall I order some Idaho potatoes?"}
      {"user": "no let's go back to my first plan but make it three bags"}
    ]

    Eventually it’s going to get so long that two things will happen:

    1. The LLM’s context window will get overloaded
    2. You’ll start having very expensive interactions

    As LLM context windows grow larger, the chances of overloading the window decrease. You’d have to really like talking about potatoes and have very few demands on your time to overload it.

    But because LLMs are usually billed on tokens, the longer your user thread, the more tokens you send each time and the more the model provider charges you.

    So it might make sense, after a certain length, to do something about this big context. There are two options here:

    1. Drop a bunch of messages from the beginning of the conversation. This is not great though because the model may need the context of those earlier messages.
    2. Summarise the messages. This uses another LLM Expert so it’s a bit trickier than naively dropping messages but the end result is that instead of 6 outdated messages, you have 1 message with a summary of the earlier parts of the conversation.

    So this is why the user thread and the transcript aren’t necessarily the same thing. The transcript is the record of what actually happened and the user thread is a useful approximation of the transcript that we can give an LLM Expert in a cost-effective manner.

    The other key thing to remember is that all the “gubbins” of the system, the system prompts for each expert as well as their results, aren’t being stored in the user thread. But you’ll probably want to store them somewhere for audit and debugging purposes – just make sure they never get into the user thread or this unnecessary data will a) confuse the experts and b) incur more per-token costs.

    And yes, potatoes really can be purple.

    Coming up next: handling sensitive data in LLM-based systems

    I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I recently failed to win any prizes in my village’s potato-growing competition. This post is not written with AI because I like writing.

    Agentic AI is all the rage these days. An AI that can control its own destiny? How cool is that! (Or, depending on your view of AI, how utterly terrifying.)

    But if you’re trying to build a system to interact with humans, what you (probably) really want is a system of experts. What’s the difference?

    According to Wikipedia, Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks without human intervention.

    A system of experts is a chain of foundational models (aka LLMs) which has a human on the end of each conversational turn. A very classic example for this is an AI chatbot.

    But agentic AI sounds way cooler, so why can’t I have that?

    Ok, let’s dig into an example. Let’s say you’re a potato farm and you want to use AI to handle your customer interactions. Everything from ordering potatoes to complaining that your Charlottes were too small. How does this work with a system of experts?

    Your customer contacts you, via voice or chat or email or whatever, and says “I want to order another bag of Maris Piper potatoes.” This goes to a TriageExpert, which has a prompt something like:

    If the customer is talking about orders, return ORDERS
    If the customer is making a complaint, return COMPLAINT
    If the customer is talking about something else, return CANNOT_HELP

    If you think this sounds like an old-style IVR system, you’re not wrong. If you think you could do way better with agentic AI…hold your horses.

    In our case, the customer is talking about orders so we use structured output to return ORDERS.

    We then use TradCode (a snarky term I invented to mean the bog-standard, deterministic Java or Kotlin or Python or whatever code you’ve been writing since before LLMs arrived) to make a very fancy algorithm:

    if ORDERS:
      -> OrdersRequestExpert
    if COMPLAINT
      -> ComplaintRequestExpert
    if CANNOT_HELP
      -> CannotHelpResponseExpert

    C’mon, use an agent already I hear you say, possibly cheered on by a blog post you read.

    Bear with me, though.

    Next step is to let this OrdersRequestExpert do something. You’ve probably given it a tool call named CREATE_ORDERS and the LLM will return that tool call with params like this:

    variety: Maris Piper
    number of bags: 1

    At this point, you’d still like that human in the loop and HEY IT TURNS OUT YOU HAVE A HUMAN! So you tell your customer, via the OrderResponseExpert, “I’m going to order a bag of Maris Piper potatoes. Press confirm or cancel.” The user is jonesin’ for potatoes, so they press confirm.

    Then TradCode reappears, and you call your /orderpotato endpoint with the payload the OrderRequestExpert has identified for you.

    This is a very basic workflow, where the foundational model is responsible for translating the human speak into computer speak. You then use all your standard APIs and code to actually fulfil the customer’s wishes.

    Please can we use agentic AI now? OK, let’s see how that would look.

    Your customer contacts you, via voice or chat or email or whatever, and says “I want to order another bag of Maris Piper potatoes.” Let’s say you have a RequestAgent, which can deal with any incoming request. It’s got a huge bunch of tool calls to choose from but it’s agentic AI so it’s really really good at choosing the right tool call and never makes a mistake.

    [That was sarcasm. -Editor]

    It calls the order potatoes tool call 100% of the time [more sarcasm], but it needs to question its own decisions so it now calls a JudgeLLM. This judge is going to look at what’s happened so far and decide whether the output of the OrderRequestExpert is good enough to process. So here’s a few examples of what the OrderRequestExpert might come up with assuming it’s selected the right tool call (which we know it reliably does) [even more sarcasm]:

    variety: Maris Pooper
    number of bags: 1
    
    variety: Maris Piper
    number of bags: 100
    
    variety: Maris Piper
    number of bags: one
    
    variety: Marish Piper
    number of bags: 1
    
    variety: Maris Piper
    number of bags: a

    As you can see, there’s quite a few varieties of answer there. But the JudgeLLM has a simple task: when it receives a request, decide whether that request should be processed as-is or sent back to try again.

    variety: Maris Pooper
    number of bags: 1
    
    JudgeAnswer: yes...
    OR no

    So the key difference here is that the JudgeLLM is doing what the human in the loop did in the first example: deciding if an output is correct.

    So let’s say the JudgeLLM has finally decided that something should be processed. Let’s say it’s decided to process this output:

    variety: Maris Piper
    number of bags: 100

    We’re now ordering 100x the number of bags of Maris Piper that our customer wanted. An easy mistake for an LLM to make, especially if there was voice transcription involved. Maybe we should present them with a confirmation page…

    So we’re back to square 1 with agentic AI, where when we’re taking actions that could cost the customer money, we want the customer to confirm the action is right.

    Now in both these scenarios, the OrderRequestExpert or RequestAgent might get it wrong. But there’s a few failure scenarios here that companies should care about:

    OrderRequestExpert/Agent gives a wrong answer that the JudgeLLM says is correct but which the human says is wrong. Now all we’ve done is paid extra processing costs to [insert model provider of your choice] for the judge and still made the human do the work for us.

    OrderRequestExpert/Agent gives a right answer that JudgeLLM says is wrong. This is even worse – we’ve paid more money for the JudgeLLM, plus more money for the next round of LLM interactions, and the human is waiting longer for their correct answer.

    OrderRequestExpert/Agent gives a wrong answer that JudgeLLM (correctly) says is wrong and we end up in a looping request that costs extra money. The longer this loop goes on, the more expensive it is and the more frustrated the customer is.

    Is it possible that agentic AI is hype driven by companies who charge per token cost?

    Aside from the potentially expensive failure scenarios, there’s another problem with agentic AI: testing.

    In this scenario, there is a defined number of outcomes (order potatoes, complain about potatoes, leave a review about potatoes, change order address, cancel order blah blah). It may be a long list but it’s not infinite. And you probably care about everything on that list working – an address update fails, that’s bad because where do you send your potatoes?

    With the system of experts, you can “unit test” the experts. Given any input, the TriageExpert should only return a specific ENUM (ORDERS, COMPLAINTS, CANNOT_HELP). So you can test just that TriageExpert and tune it to give the correct results. Same with the OrderRequestExpert, although in this case you expect structured JSON instead of an enum. Only the OrderResponseExpert is tricky to test, because you need some NLP (this may be a case for a JudgeLLM). Or you could decide you don’t care that much about delighting your users with AI-generated responses and just return from a list of hardcoded responses.

    Testing the agentic AI? Harder and more expensive because you need to test not just the inputs and outputs, but you also need to test how many tries it takes. So if your agentic AI takes 100 (internal) turns to come up with a correct answer, at which point the user would be bored, has your test passed or failed? And is it a Pyrrhic victory, because you’ve spent so much of your LLM budget just on tests?

    This system is also not decomposable: the JudgeLLM has been tightly coupled to the RequestAgent – you can only meaningfully test them as a system, not as individual components. With latencies around 800ms [that’s a decent average I’ve seen using gpt-4.1 deployed in Azure in a geographically nearish region to me], each test will take 1.6 seconds AT LEAST to run, and potentially rack up a LOT of token costs. And you’ll probably need hundreds if not thousands of test cases.

    So: you probably don’t need agentic AI.

    Coming up next: Context in LLM Systems of Experts