I’m Paula Muldoon and I’m a staff software engineer at Zopa Bank in London wrangling LLMs into shape for our customers. I’m also a classically trained violinist and I recently failed to win any prizes in my village’s potato-growing competition. This post is not written with AI because I like writing.
Agentic AI is all the rage these days. An AI that can control its own destiny? How cool is that! (Or, depending on your view of AI, how utterly terrifying.)
But if you’re trying to build a system to interact with humans, what you (probably) really want is a system of experts. What’s the difference?
According to Wikipedia, Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks without human intervention.
A system of experts is a chain of foundational models (aka LLMs) which has a human on the end of each conversational turn. A very classic example for this is an AI chatbot.
But agentic AI sounds way cooler, so why can’t I have that?
Ok, let’s dig into an example. Let’s say you’re a potato farm and you want to use AI to handle your customer interactions. Everything from ordering potatoes to complaining that your Charlottes were too small. How does this work with a system of experts?
Your customer contacts you, via voice or chat or email or whatever, and says “I want to order another bag of Maris Piper potatoes.” This goes to a TriageExpert, which has a prompt something like:
If the customer is talking about orders, return ORDERS
If the customer is making a complaint, return COMPLAINT
If the customer is talking about something else, return CANNOT_HELP
If you think this sounds like an old-style IVR system, you’re not wrong. If you think you could do way better with agentic AI…hold your horses.
In our case, the customer is talking about orders so we use structured output to return ORDERS.
We then use TradCode (a snarky term I invented to mean the bog-standard, deterministic Java or Kotlin or Python or whatever code you’ve been writing since before LLMs arrived) to make a very fancy algorithm:
if ORDERS:
-> OrdersRequestExpert
if COMPLAINT
-> ComplaintRequestExpert
if CANNOT_HELP
-> CannotHelpResponseExpert
C’mon, use an agent already I hear you say, possibly cheered on by a blog post you read.
Bear with me, though.
Next step is to let this OrdersRequestExpert do something. You’ve probably given it a tool call named CREATE_ORDERS and the LLM will return that tool call with params like this:
variety: Maris Piper
number of bags: 1
At this point, you’d still like that human in the loop and HEY IT TURNS OUT YOU HAVE A HUMAN! So you tell your customer, via the OrderResponseExpert, “I’m going to order a bag of Maris Piper potatoes. Press confirm or cancel.” The user is jonesin’ for potatoes, so they press confirm.
Then TradCode reappears, and you call your /orderpotato endpoint with the payload the OrderRequestExpert has identified for you.
This is a very basic workflow, where the foundational model is responsible for translating the human speak into computer speak. You then use all your standard APIs and code to actually fulfil the customer’s wishes.
Please can we use agentic AI now? OK, let’s see how that would look.
Your customer contacts you, via voice or chat or email or whatever, and says “I want to order another bag of Maris Piper potatoes.” Let’s say you have a RequestAgent, which can deal with any incoming request. It’s got a huge bunch of tool calls to choose from but it’s agentic AI so it’s really really good at choosing the right tool call and never makes a mistake.
[That was sarcasm. -Editor]
It calls the order potatoes tool call 100% of the time [more sarcasm], but it needs to question its own decisions so it now calls a JudgeLLM. This judge is going to look at what’s happened so far and decide whether the output of the OrderRequestExpert is good enough to process. So here’s a few examples of what the OrderRequestExpert might come up with assuming it’s selected the right tool call (which we know it reliably does) [even more sarcasm]:
variety: Maris Pooper
number of bags: 1
variety: Maris Piper
number of bags: 100
variety: Maris Piper
number of bags: one
variety: Marish Piper
number of bags: 1
variety: Maris Piper
number of bags: a
As you can see, there’s quite a few varieties of answer there. But the JudgeLLM has a simple task: when it receives a request, decide whether that request should be processed as-is or sent back to try again.
variety: Maris Pooper
number of bags: 1
JudgeAnswer: yes...
OR no
So the key difference here is that the JudgeLLM is doing what the human in the loop did in the first example: deciding if an output is correct.
So let’s say the JudgeLLM has finally decided that something should be processed. Let’s say it’s decided to process this output:
variety: Maris Piper
number of bags: 100
We’re now ordering 100x the number of bags of Maris Piper that our customer wanted. An easy mistake for an LLM to make, especially if there was voice transcription involved. Maybe we should present them with a confirmation page…
So we’re back to square 1 with agentic AI, where when we’re taking actions that could cost the customer money, we want the customer to confirm the action is right.
Now in both these scenarios, the OrderRequestExpert or RequestAgent might get it wrong. But there’s a few failure scenarios here that companies should care about:
OrderRequestExpert/Agent gives a wrong answer that the JudgeLLM says is correct but which the human says is wrong. Now all we’ve done is paid extra processing costs to [insert model provider of your choice] for the judge and still made the human do the work for us.
OrderRequestExpert/Agent gives a right answer that JudgeLLM says is wrong. This is even worse – we’ve paid more money for the JudgeLLM, plus more money for the next round of LLM interactions, and the human is waiting longer for their correct answer.
OrderRequestExpert/Agent gives a wrong answer that JudgeLLM (correctly) says is wrong and we end up in a looping request that costs extra money. The longer this loop goes on, the more expensive it is and the more frustrated the customer is.
Is it possible that agentic AI is hype driven by companies who charge per token cost?
Aside from the potentially expensive failure scenarios, there’s another problem with agentic AI: testing.
In this scenario, there is a defined number of outcomes (order potatoes, complain about potatoes, leave a review about potatoes, change order address, cancel order blah blah). It may be a long list but it’s not infinite. And you probably care about everything on that list working – an address update fails, that’s bad because where do you send your potatoes?
With the system of experts, you can “unit test” the experts. Given any input, the TriageExpert should only return a specific ENUM (ORDERS, COMPLAINTS, CANNOT_HELP). So you can test just that TriageExpert and tune it to give the correct results. Same with the OrderRequestExpert, although in this case you expect structured JSON instead of an enum. Only the OrderResponseExpert is tricky to test, because you need some NLP (this may be a case for a JudgeLLM). Or you could decide you don’t care that much about delighting your users with AI-generated responses and just return from a list of hardcoded responses.
Testing the agentic AI? Harder and more expensive because you need to test not just the inputs and outputs, but you also need to test how many tries it takes. So if your agentic AI takes 100 (internal) turns to come up with a correct answer, at which point the user would be bored, has your test passed or failed? And is it a Pyrrhic victory, because you’ve spent so much of your LLM budget just on tests?
This system is also not decomposable: the JudgeLLM has been tightly coupled to the RequestAgent – you can only meaningfully test them as a system, not as individual components. With latencies around 800ms [that’s a decent average I’ve seen using gpt-4.1 deployed in Azure in a geographically nearish region to me], each test will take 1.6 seconds AT LEAST to run, and potentially rack up a LOT of token costs. And you’ll probably need hundreds if not thousands of test cases.
So: you probably don’t need agentic AI.
Coming up next: Context in LLM Systems of Experts