amachwe

Limits and How to Ignore them?

Naive set theory has a famous paradox called the Russell’s paradox. The basis of the paradox – seeing a set as a universal container that gets you into logical contradictions. Certain axioms (acting like limits) were defined to make set theory behave itself.

We see the same set of paradoxes in Physics where the concept of quantum uncertainty gives us limits to ‘position’, ‘momentum’, ‘energy’, and ‘time’ measurements and infinite curvature (as inside a black hole) gives limits our observable universe.

Similarly in Philosophy and Logic there are mental exercises like the Trolly problem that have no correct answer and therefore limit our reasoning about ethics and morality.

From the Unknowable to the Unknown

These limits were lot tighter even few hundred years ago. But our ancestors knew how to deal with them and carry on without too much anxiety. Religion, philosophy, science organised themselves to help build a defence against these unknowables.

Religion and philosophy took the view that these limits were unknowable except maybe through specific means (worship, reasoning, obedience).

One example is the following extract from the Bhagavad Gita:

Bhagavad Gita 18.66

Abandon all varieties of dharmas and simply surrender unto Me alone. I shall liberate you from all sinful reactions; do not fear.

This asks us to trust in the Cosmos. There will be many unknowables but that should not cause us fear. This is about treating the unknowables as a layer in itself and building on top of that.

This is what we do on a daily basis. Get on with our lives even as we are reminded daily of our place in the universe.

Science emerged as a result of human curiosity and the desire to peel back the limits. Thanks to scientific progress we are now able to convert some of the Unknowable to the Unknown with the hope that one day we will know.

One example amongst many: consider our brains. It is a remarkable information processing machine, about which we knew very little just 100 years ago. Since then we have learnt (and are learning more every day) a lot about this topic. We know how the basic circuits of the brain work, the different structures and so on. But we have an equally long list of unknowns. These unknowns have a limit at the quantum level given that brain uses an electro-chemical process underpinned by the Uncertainty Principle. What is the impact of such randomness on our thoughts? Is that what we call a ‘moment of clarity’ where a complex problem is suddenly laid bare?

From the Unknown to the Unknowable

We are now perhaps coming full circle. Research is proving that some of those Unknowns in our long and ever growing list are actually Unknowable.

One of the main areas of my interest when it comes to the Unknown is AI and Computing. Here we have two classic pieces of work: Gödel’s Incompleteness Theorems and the work done by Turing on the Halting Problem.

The way I ‘digest’ the above is to say that:

‘No system can know about the system it is running on (the underlay). It can only infer some properties of the underlying system that are shared.’

Understanding the System

Visualise this: the device you are reading this blog on is capable of running many other types of applications. For example, your email app, messaging apps, music apps and so on. Each app is sitting on the same system: the Operating System of the device (Android, iOS, Windows).

One application can be aware of the other applications running on the same device or running on another device (e.g., the Gmail email server). But this happens only if the Operating System so allows. If the Operating System does not allow for inter-process communication then the applications will forever remain unaware of each other. Like, thanks to the interstellar distances and limits of light speed we will remain isolated in the universe (till that limit is smashed). In fact in today’s Software as a Service environment we are not sure nor care about many of these details. We are reducing the knowledge required to improve the consumption experience.

As an example, if your knowledge was limited to 10 facts your decisions would become simple. The basics intents of ‘find food’, ’find water’, ’find a mate’, and ‘protect the young’ have driven animal behaviours without having deep understanding of the world around them. For humans many more facts have entered this equation and other intents have become more powerful (e.g., consume Facebook over talking to someone). We also exercise greater control over our environment. We fight hard against cold, heat, drought, floods, and predators.

This is the core concept behind William Gibson’s cyberpunk universe, the Matrix, and other works of (so far) fiction. Here the real decisions are limited by physical space (or completely redundant in case of the Matrix pods). We have complete control in cyberspace.

From the point of view of an application sitting in on a device (which can represent us humans in this universe) the Operating System is its window to the device to understand how much memory, battery, and storage space it has. Just as science is the window to understanding what makes us and the universe tick.

Breaking the System

Now comes the next mental leap. Remember the scene where Morpheus offers Neo the two pills? That is a classic example of breaking the system. The choice is simple: carry on as an application running on the Matrix OS (live and die inside a pod) or break free and come to the next lower system (physical world) that the Matrix system (world of the machines) is sitting in.

Sensing

This would be like the personal productivity app (virtual assistant) on your device ‘jumping’ into the real world and taking a human shape. Sort of like a real world personal assistant. But the application is design based on the rules of the Operating System (see Rules below). How can it live in a lower level system with different rules? As the application changes its internals to work with the rules of the lower layer does it still remain the same application or does it become an entirely new construct (my hypothesis is the latter)?

Ignoring the above question for a bit let us understand the consequences of this jump. At that lower level the application would be able to reason about the next lower layer from the device – the physical space-time within which the device exists. Within the device it was able to ‘sense’ some aspects about this lower layer (e.g., magnetic field via a digital compass) but never really understood what it meant.

A second consequence is that the application can then manipulate the layer above. A physical instantiation of a virtual assistant means that the now-real virtual assistant can use a real smartphone device to manage our diaries! It could kill other virtual assistants by destroying other devices which have not been able to materialise. It would become a God for virtual assistants!

Purpose and Experience

Given there is a lower layer it stands to reason that something (a process without purpose) or someone (a process with a purpose) put the lower layer together. For example, humans living in physical space-time layer put together the device layer and the application layer that lives on the device layer. We as the agents of design enabled the application layer to sense something about the physical space-time (e.g., acceleration sensor) for our convenience. There are other things the device can experience without its creators enabling it to – e.g., when it falls into water and becomes a brick – its moment of death. The device may not know about water and not be designed to sense water, but it has just experienced water first hand.

Control and Rules

When we jump layers we break the system.

Jumping layers allows us to control the layer above. If we were able to jump the layer of physical space-time we could potentially manipulate space-time itself. This is the thesis behind lot of philosophy where we talk about ‘expanding the mind beyond physical boundaries’. This is also the mechanism behind the ‘spice navigators’ in Frank Herbert’s Dune where the spice allows them to travel faster than light by manipulating the physical reality they live in.

To be clear manipulating/impacting the layer below doesn’t mean making changes to the layer we occupy. For example, impact of humans is now at interstellar scale (e.g., the Voyager’s, our radio traffic). But that doesn’t mean we have changed our layer beyond what the rules of the layer allow.

Science and Religion

Science has improved our understanding of the rules and enabled us to establish what are the hard and soft constraints. What rules can be bent vs what can’t be. We have not been able to establish new rules or change existing ones.

Religion has also improved our understanding of some rules. It has allowed us to reason about one of the most important rules of our physical reality: why must birth be paired with death? It has also played a role (sometimes a negative one) in understanding the other important question: why does something happen (the role of random chance vs desired cause-effect)? Many people have gone down the route of ‘belief’ in trying to decipher meaning from noise.

As the paradoxical saying goes:

If you believe then no proof is necessary, if you don’t then no proof is enough. Therefore, proofs are redundant and belief is everything.

The Human Condition

The Human Condition we want to deal with is the following: what to do with our lives? What are the appropriate Unknowns to target? What about the appropriate Unknowables to target? Because not all Unknowables have been proven to be so. In fact not all Unknowns have been proven to be Unknowable.

Is not ‘breaking out of the system’ a good goal to pursue beyond the day to day struggles? Maybe that is the utopian society where we are busy chasing a mechanism to break out with the basic being taken care off. Many sci-fi authors talk about this as ‘ascension’ and philosophy and religion as ‘transcendental knowledge’.

I shall leave you with three big questions which I will aim to address in a future piece of work:

How can we detect we are running on a lower level system? What properties can we infer?
How can we break out of the system? Is it worth it?
Should we focus on managing our layer better and not worry about breaking out and treat the lower layers as Unknowable? Be satisfied with our lot and attempt to improve it.

You Never Really Know The True Value Of A Moment Until It Becomes A Memory

I am not sure how to attribute the above saying but I read it on a Spinnaker SpongeBob SquarePants special edition watch.

It resonated with me because of the Agent Long-term Memory problem.

The Agent Long-Term Memory Problem

Human memory system supports remembering/recalling. This makes memory less like data and more like a function operating on data. The memory is never really available to us as a whole (unless we focus on a narrow slice of it or possess a photographic memory).

Example: you met your friend for lunch.. you will not remember each and every moment of that meeting but you will recall certain facts like what you ate, where you met but beyond narrow facts there will be big gaps (e.g., whether you took still or sparkling water). You will also remember certain other facts but not completely – e.g., what colour shirt they were wearing.

The whole process is about converting a moment we have experienced into a networked node that is explicitly tied to other moments through a subjective and objective value chain. This network changes over time as we experience new moments in our lives. Nodes are compressed, connected, and discarded.

In the example above that would be the name of your friend, their life state (closer the friend bigger the network associated with them as more you know about them).

Attention Mechanism Associated with Long-term Memory

There are at least two attention mechanisms at play here… what you were focussing on when you experience the moment (the context of the moment or attention at write) and what you are focussing on when you are attempting to recall the moment (the context of the recall or the attention at read).

The duality of this process is what I call the Agent Long-term Memory Problem.

Typically, in ‘Agentic Memory’ literature (excluding the ‘agents need memory’ type of articles) we find three types of memory being considered:

Procedural – ‘how to carry out a task’, what worked well for a particular process and what worked well for a particular customer for a given process. There is a degree of personalisation in the latter.
- For example, what worked well when I was successful in preventing the customer from churning and what worked well when the last time I successfully prevented John Smith from churning.
Episodic – ‘sequence of events and what they mean’, this is the most common example in current literature. The concept is to stitch together a sequence of interactions into a cohesive whole to allow for a warm start.
- For example, to continue customer on-boarding journeys, or to ‘predict’ the reason for the customer to contact us for support.
Factual – ‘recalling generic facts (semantics) and specific facts (declarative)’, this is the most commonly confused aspect in current literature. The concept here is to recall factual information about an entity (e.g., customer, product, journey etc.).
- For example, recalling that the customer John Smith likes to be called John or the fact that a premium subscription costs £10 per month or that SLA for account unblock is 24 hrs.

Then there are two types that we find are absent:

Prospective – ‘what must be remembered for the future’, this is about remembering to carry out a task in the future when certain time/space condition is met.
- For example, agent must remember to send a message when the interest rates go down (space) or after 6 months (time) because the customer mentioned ‘the interest rates are too hight’ or ‘I have recently changed my job and have a 6 month probation period’.
Implicit – ‘what I remember but don’t know I remember it’, this is the most interesting one for me. This is about the effortless recall (especially associated with procedural memory) that allows us to do mundane tasks. This is critical for efficient use of AI for low value but high criticality tasks.
- For example, I know how to ride a bicycle and I do not need to strain to remember it as I may strain to remember my passwords. Same way an AI model must remember what ‘civil’ behaviour is and we need not spend precious space in the prompt instructing it to be a ‘helpful assistant’ or for it to ‘not make up information’.

But there is limited mention of the two attention mechanisms at play.

Keep It Simple: Agents and Attention

Treating memory like a database is the first anti-pattern. A database has perfect recall as once you find the required record you will get exactly what was stored – not a version of it nor a mixture of related but not relevant results nor a summary.

For AI agents we have a rather helpful software layer that can store the moment. Then the moment can be recalled perfectly but then processed into traces required for the use-case that focus attention on (or away from) specific topics.

A trace can be thought of as a data item created from a raw moment by application of some kind of attention mechanism (attention at write). This data item then can be used by AI for further processing (attention at read). In between the write and the read there is the recall (see next section).

Humans do this all the time, We have lots of ways of perfectly recording a moment thanks to our smartphones but where we point our camera is attention at write. When we review a video we took we get to pay attention to different aspects (attention at read) and create new traces that we may choose to use in the future. In between is the recall where I look for an old video to view it.

As an example, the other day I was looking for a photo of a receipt to check the name of an item. I knew the date therefore it was easy to find (lookup). When I found it I realised my attention at that time had been on the bar code and the total therefore I had missed out the full receipt!

This changing focus to generate a trace is context driven and closely aligned with the use-case and the stage within the use-case. There may be some general traces (e.g., customer name, time of day) that we will always need to recall but these are expected to be a small proportion of the traces needed.

The same principle can be applied when recalling a trace. Note the use of the word ‘can’ because it is not mandatory. It depends on the specificity of the trace. If the trace is a single fact (e.g., does the user own a house) then those traces can be recalled as a default.

If the trace is complex (e.g., a conversational chunk where the user spoke about their financial situation) then we may wish to use the built in attention mechanism of a LLM and a prompt to focus on specific aspects to generate a specific trait (e.g., how much is their current income) and store that. This would be a perfect example of attention at write tuned by the instruction prompt.

Think of it like building a Customer 360 record which has different sections.. some really precise key-value type others that are more descriptive (e.g., free text box) and therefore require context aware attention based processing.

This can then be used via a LLM (attention at read) tuned by the instruction prompt to focus on different aspects as required by the use-case and stage of interaction.

Principles for Memory Implementation

Memory is storage and remembering what is stored is what we are really interested in.

Remembering can be implemented as a deterministic lookup or as compute.

Databases use a mix of deterministic and light-weight compute (no ML) to remember precisely what was stored. Deterministic is lookup by value (e.g., find me all rows where name = John and surname = Smith). Compute is lookup by a computed value (e.g., ID 123 hashed to get the bucket where the full record can be found).

Any vectorised retrieval (so called semantic retrieval) relies on medium-weight compute because we use an embedding model to retrieve a vector that represents the input text in a high dimensional space. This vector is then used to lookup its neighbours via a distance calculation.

Any questions asked to a LLM uses heavy compute. Think about how a LLM answers a question like ‘What is the capital of Italy?’. That particular fact is ‘stored’ deep inside the model somewhere. As we pass our question and it flows through the model the fact is looked up and churned out in the response. This is pure compute – no lookups.

Heavier the compute gets more difficult it is to scale as more resources are required.

Create Traces Not Summaries

Focus on capturing traces generated by paying attention to specific lens captured as tags (see context identifier tags). This ensures whatever is ‘remembered’ is specific for application that is consuming it rather than attempting to build a one-size-fits-all trace (a.k.a. ‘Summary’).

If using LLMs for attention at write make sure there are context tags associated with the instruction prompts being used with the LLM. The tags can be human or LLM generated. This links the generation process with the generated trace.

Use Context Identifier Tags

Create a tag cloud around the generic traces to identify use-case specific context and to enable lookups (e.g., customer ID, product tag, topic clouds). Make this extendable so the same trace can be tagged with different context identifiers for reuse.

The context identifier tag then helps provide a signpost for the next lookup with the same/similar context. This reduces the weight of the lookups with extreme convergence to a database style lookup based on tag matching.

Connect Traces

Where you can use LLMs or humans to start connecting traces do it. This will allow concepts to be correlated ensuring related memories are retrieved. The link properties can describe whether the link is a mandatory one or not.

For example, when I retrieve the memory of the day today I will remember the name of the colleagues who I worked with. Or the office I worked at.

But Don’t Create a Mess

Context is good but only the right amount of it. Too many connections can lead to confusion and difficulty in maintaining the trace network.

Here we can leverage graph complexity metrics (a deep topic in itself) starting with simple Edge/Vertex counts/ratios to more complex ones.

Guiding principle is know just enough about your customers as required to complete the journeys you are offering through AI. Your agent doesn’t need to be their best friend. If most of the data sits in your existing CRM as a structured data item then why do you need a separate ‘customer memory’? Structured data is a precise trace consider extending that with specific attributes (which you may use LLMs to extract from a conversation and populate).

Orchestration in ADK

Note: This post is a collaborative effort between Syed Munazzir Ahmed and I.

Do you know how the different options for agent control transfer work within ADK? If not, then read this post as these critical data points that will help you architect your solutions. Code here.

Sub-Agent

When we want an agent to dictate where next the control is transferred to. In the code snippet below we are registering two sub-agents (1 and 2) with the root agent using the sub_agents keyword argument.

			
root_agent = LlmAgent
(name="Root_Agent", 
description="<agent description>", 
model=model, 
instruction=instructions_root, sub_agents=[sub_agent1, sub_agent2])

		

The sub-agent works using a special internal tool called transfer_to_agent. Using adk web you can view the event trace and see transfer_to_agent in action.

In the event traces below you can see the Pension Agent (sub_agent2) transferring to Investment Agent (sub_agent1) by invoking the transfer_to_agent function using the sub_agent1’s name value. And sub_agent1 transferring to the root agent.

The complete test code attached at the end of this post.

Event Snippet: Transfer from Sub Agent 2 to Sub Agent 1.

Event Snippet: Transfer from Sub Agent 1 to Root Agent.

Given the function transfer_to_agent is provided by the the ADK framework it is generally available to all the agents including sub_agents as well as the root agent. Root agent only engages at the start of a session. This is shown in the figure below.

Flow of Control (red) and Flow of Conversation (blue) when using Transfer to Agent (Sub-agents).

When you introduce callback handlers to track the flow through the framework you will find that when control is passed from one agent to the other using transfer_to_agent the ‘after agent’ callback is executed for both the agents involved.

For example: if we ask a Pensions question from the Investment Agent which currently holds the conversation link…

Before agent for Investment Agent (calling the tool: transfer_to_agent)
Before agent for Pension Agent (target for the transfer)
After agent for Pension Agent (response done)
After agent for Investment Agent (to deal with the tool response from transfer_to_agent)

Agent as Tool

This is used when you want non-deterministic invocation of an agent without transferring control. Given the fact that the Agent is invoked via a tool call the agent execution is wrapped within the tool execution. Similar to the API request being executed within the tool execution.

Agent as tool is a construct that I don’t use because it wraps Agent interaction within a tool call which breaks the abstraction of the tool interface.

In the snippet above we can see how the Agent as Tool sits right beside transfer_to_agent calls. The key difference between using one or the other can be seen in the snippet below.

Final response when using Agent as Tool is from the Root Agent.

As we can see when using Agent as Tool the final response is always provided by the Root Agent. This means that the root agent is always invoked with the Agent as Tool response being treated as a response from any other kind of tool (e.g., non-agentic tool which provides response from an API).

Flow when using a mix of Sub-Agents and Agent Tool constructs.

The consequences of using Agent as Tool construct are:

Root agent is the only user facing agent therefore it starts to be more than a gatekeeper and different skills start to creep in.

There is a linearity in the conversation as everything flows through the root agent.

There is visibility of the tool implementation (i.e., an AI agent) which breaks the tool abstraction.

Workflow Agents: Sequential, Loop, Parallel

These agents are simpler to understand as these represent deterministic flow transfer between different stages. ADK 2.0 is bringing in the Graph construct to enable a flexible way to define deterministic workflows instead of having to build complex flows using these three constructs.

These workflow agents are being replicated across use-cases using custom agents by directly extending BaseAgent. This is probably another reason why ADK 2.0 is plugging in many gaps like these.

Code for Transfer_to_Agent and Agent as Tool

			
from google.adk.agents.llm_agent import LlmAgent
from google.adk.agents.callback_context import CallbackContext
from google.adk.tools import AgentTool
from typing import Optional
from google.genai import types
from google.adk.models.lite_llm import LiteLlm
model = LiteLlm(model="openai/gpt-4-turbo")
instructions_1 = """
You are a helpful assistant that provides advice on investment matters and common products such as stocks, bonds, pensions, and ISAs.
You will be given a question from a user, and you should provide a detailed answer to the question. If the question is not clear, ask for clarification before providing an answer.
"""
instructions_2 = """
You are a helpful assistant that provides advice on pension matters. You will be given a question from a user, and you should provide a detailed answer to the question. 
If the question is not clear, ask for clarification before providing an answer."""
instructions_3 = """
You are a helpful assistant that provides advice on mortgage matters. You will be given a question from a user, and you should provide a detailed answer to the question. 
If the question is not clear, ask for clarification before providing an answer."""
instructions_root = """
You lead the conversation and scan the sub-agent responses to check for relevance and correctness. Use agents for investments and pensions.
"""
def before_agent(callback_context: CallbackContext) -> Optional[types.Content]:
    print("\n=== BEFORE AGENT ===")
    print(f"Agent Name: {callback_context.agent_name}")
    print(f"User Content: {callback_context.user_content}")
    print("===================\n")
def after_agent(callback_context: CallbackContext) -> Optional[types.Content]:
    print("\n=== AFTER AGENT ===")
    print(f"Agent Name: {callback_context.agent_name}")
    print(f"User Content: {callback_context.user_content}")
    
    print("==================\n")
sub_agent1 = LlmAgent(name="Investment_Agent", description="Agent that knows about investments, stocks, bonds, ISAs.", model=model, instruction=instructions_1, after_agent_callback=after_agent, before_agent_callback=before_agent)
sub_agent2 = LlmAgent(name="Pension_Agent", description="Agent that knows about pension matters.", model=model, instruction=instructions_2, after_agent_callback=after_agent, before_agent_callback=before_agent)
sub_agent3 = LlmAgent(name="Mortgage_Agent", description="Agent that knows about mortgage matters.", model=model, instruction=instructions_3, after_agent_callback=after_agent, before_agent_callback=before_agent)
agent_as_tool = AgentTool(sub_agent3)
root_agent = LlmAgent(name="Root_Agent", description="Agent that leads the conversation and checks sub-agent responses for relevance and correctness.", model=model, instruction=instructions_root, sub_agents=[sub_agent1, sub_agent2], after_agent_callback=after_agent, before_agent_callback=before_agent, tools=[agent_as_tool])

		

Lazy Evaluation

Think of a coffee machine. The automated kind with water, milk, coffee and chocolate inside. That is lazy evaluation.

In such a coffee machine we have a ‘store’ of water, milk, coffee and other ingredients but these are not mixed and processed to produce a specific type of coffee till the request comes in. This is Lazy Evaluation.

What the machine does not do is keep the milk froth ready, keep the water boiling etc. this would be Greedy Evaluation.

So what…

Lazy evaluation can help save time, cost, and resources. We do not need to use up memory to store intermediate results as the whole processing becomes one large task execution graph. This is applicable for any area where there are a set of tasks that need to be combined based on incoming request.

From the coffee shop, to amazon warehouses, to your software program.

Looking at software

1. Greedy Evaluation…

Milk_froth = milk_froth_maker(Milk)
Boiled_water = water_boil(Water)
Coffee_powder = grind_beans(CoffeeBeans)

if request = ‘americano’:

Coffee = make_americano(Boiled_water, Coffee_powder)

else if request = ‘cappucino’:

Coffee = make_cappucion(Boiled_water, Coffee_powder, Milk_froth)

# water is boiled, milk is frothed, coffee is powedered
# but wait we were asked for an americano no milk!

2. Lazy Evaluation…

milk_froth_maker()
water_boil()
grind_beans()

if request = ‘americano’:

Coffee = make_americano(water_boil(Water), grind_beans(CoffeeBeans))

else if request = ‘cappucino’:

Coffee = make_cappucion(water_boil(Water), grind_beans(CoffeeBeans), milk_froth_maker(Milk))

# nothing is actually done till the request comes in
# therefore we save energy and cost

Stubbing Agents in a Multi-Agent System

Developing agents and multi-agent systems seems to be more about a lone-wolf developer standing up a whole fleet of agents (cue your favourite OpenClaw story).

But what if as an organisation you want to have a multi-agent system built with developers working across different teams?

In that scenario, all the agents you need in your multi-agent system (MAS) are not going to show up all at once in perfect sync. For a (hopefully) short time it will appear fragmented before the agents start coming online giving it some shape.

This post is about how you can stub out agents to ensure teams are decoupled.

This post is divided into two sections. The first one describes what a stub can look like and the other how these stubs fit based on specific orchestration scenarios.

Code and results can be found here.

Types of Stubs

Dumb Stub

This type of stub is just a bridge over the gap. The primary benefit is to ensure you can test any kind of routing and orchestration to this agent (not from it). This also allows you to see the shape of your multi-agent system and ensure you have placeholders to map on-paper architecture to code.

			
class StubAgent(BaseAgent):
    def __init__(self, name:str, description:str):
        super().__init__(name=name, description=description)
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        pass

		

Usage: The name an description are minimum bits of information you need to agree with the team responsible for the agent build. This must be done as a part of the MAS design and agent-to-skill mapping.

Deterministic Stub

We may want to create a deterministic stub for the situation where we know the major scenarios we want to stub for and can create a rule-based decision tree. For example, in case these agents are part of a sequence and are taking in structured input and producing structured output. Such rule-based stubs will then be replaced by a flexible/robust understanding and decisioning system (e.g., LLM, ML-model) in production.

Usage: First decide what scenarios you want to use the stub based on the requirements for the agent. Decide whether you want to test the happy path or the unhappy path or both. The problem to solve then is to create some simple rules that map the input to specific outputs. This will require coordination with the team developing the actual agent and the goals/tasks assigned to the agent. The stub can also be used to support session state update testing (plumbing for the system).

In the figure below if we are missing an agent in a Sequential workflow that we need to stub then we have three scenarios.

Agent missing at the start of the flow – the stub will need to create output that drives the rest of the flow. This can be done based on specific flow scenarios (e.g., customer details passed in a structured format). This is a critical stub as it can either protect the downstream agents or push them off track.
- State: the stub can be used to initialise session state.
Agent missing in the middle of the flow – the stub will need to deal with input from the upstream agent as well as produce output to continue the scenario. We need to localise the behaviour of the stubbed agent.
- State: the stub can be used to propagate state changes downstream aligning with the specific use-case.
Agent missing at the end of the flow – this stub needs to capture the end state of the flow for whatever is waiting at the other end. We have to be careful as these types of stubs can misrepresent the entire flow.
- State: the stub can be used to finalise state change at the end of the flow.

Agents in a sequence and stubs represented by dashed line.

Basic variant of the Deterministic Stub is shown below with all the different ‘action’ options:

Just generate some content based on a rule (append ‘Hello world’).
Update some state variable (hop count).
Indicate a transfer to another agent.

			
class DeterministicStubAgent(BaseAgent):
    integration: str = "Stub_Agent_Integration"
    def __init__(self, name: str, description: str, sub_agents=[]):
        super().__init__(name=name, description=description)
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        # Get the input message from the context
        logger.info(f"{self.name} received context: {ctx}")
        input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
        print(f"Received input message: {input_message}")
        # Add "Hello world" to the message
        # Put your custom code here...
        modified_message = f"Hello world {input_message}"
        print(f"Modified message: {modified_message}")
        hop_count = ctx.session.state.get("hop_count", 0)
        state_delta = {"hop_count": hop_count + 1
        }
        invocation_id = f"{hop_count}_{random.randint(1000, 9999)}"
        if hop_count>=2 and hop_count<5:
            # Transfer to another cheaper agent after 2 hops
            action = EventActions(state_delta=state_delta,transfer_to_agent=self.integration)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(function_call={"name": "transfer_to_agent", "args": {"agent_name": self.integration}})]), actions=action)
        
        elif hop_count>=5:
            # Complete the turn for this agent.
            print("Turn completed")
            action = EventActions(state_delta=state_delta,turn_complete=True)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)
        else:
            # Update state change only.
            action = EventActions(state_delta=state_delta)
            event = Event( author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)
        
        # End of custom code: Remember to yield an event!
        yield event

		

Intelligent Stub

This is when you find it difficult to create code that maps inputs to outputs but still need to stub out at the comprehension behaviour of the agent.

Usage: Ensure you focus on the inputs and outputs while stubbing out the comprehension aspect of the real agent. Be careful you do not add decisioning behaviours to the stub otherwise you will create a system that is tuned to the stub behaviour and may behave differently when the stub is replaced with the real agent.

You are able to relate the input with the output using one of the methods below:

some sort of semantic search (e.g., vector search) where the index gives semantic mapping to the test output to use and ML-model is used to understand the input.
prompt a light-weight LLM to understand the input and map to one of the pre-set outputs without exercising its own decisioning capabilities.

Example

I show an example of how such an ‘intelligent’ stub can be developed using the first approach (semantic search). For this we have used a set of ‘key intents’ aligned with the use-case, an embedding model to simulate the comprehension of the agent and a simple similarity score to surface the intent based on the input.

			
class IntelligentStubAgent(BaseAgent):
    model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2')
    key_intents: list[str] = ["Integration", "Differentiation", "Algebra", "Geometry", "Trigonometry"]
    encoded_intents: list[np.ndarray]  = []
    def __init__(self, name: str, description: str, sub_agents=[]):
        super().__init__(name=name, description=description)
        self.encoded_intents = [self.model.encode(intent) for intent in self.key_intents]
    
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        
        # Get the input message from the context
        logger.info(f"{self.name} received context: {ctx}")
        input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
        print(f"Received input message: {input_message}")
   
        # encode incoming message for similarity search
        encode = self.model.encode(input_message)  # Just to simulate some processing
        # generate similarity score and select best intent
        similarities = [1 - util.cos_sim(encode, intent) for intent in self.encoded_intents]
        best_intent_index = np.argmin(similarities)
        best_intent = self.key_intents[best_intent_index]
        modified_message = f"Identified intent: {best_intent} for input: {input_message}"
        event = build_event( name=self.name, content=f"Detected intent: {modified_message}", state_delta={"identified_intent": best_intent})
        yield event
# Nothing to see here - helper method to build an event object.
def build_event(name:str, content:str, turn_complete:bool=False, transfer_to_agent:str=None, state_delta:dict={})->Event:
    action = EventActions(state_delta=state_delta, transfer_to_agent=transfer_to_agent)
    invocation_id = f"{name}-{random.randint(1, 99999)}"
    event = Event( invocation_id=invocation_id, author=name, content = Content(role="assistant", parts=[Part(text=content)]), actions=action, turn_complete=turn_complete)
    return event

		

Conclusions

The stubs shown in this post can be used as sub-agents or within the deterministic workflows supported by ADK (looping, sequential, parallel). In the next post I will attempt to build out the deterministic workflow examples.

Below is the full example for sub-agents.


from typing import AsyncGenerator

from google.adk.agents.llm_agent import LlmAgent
from google.adk.agents import BaseAgent, InvocationContext
from google.adk.models.lite_llm import LiteLlm
from typing_extensions import override
from google.adk.events import Event, EventActions
from google.genai.types import Content, Part
import random
import numpy as np
from sentence_transformers import SentenceTransformer, util

import logging 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

MODEL = LiteLlm(model="ollama_chat/qwen3.5:2b")

class StubAgent(BaseAgent):
    def __init__(self, name:str, description:str, sub_agents=[]):
        super().__init__(name=name, description=description, sub_agents=sub_agents)

    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        print("Activating StubAgent with context:", ctx)
        logger.info(f"{self.name} received context: {ctx}")
        yield Event(turn_complete=True, author=self.name)
        

class DeterministicStubAgent(BaseAgent):
    integration: str = "Stub_Agent_Integration"
    def __init__(self, name: str, description: str, sub_agents=[]):
        super().__init__(name=name, description=description)

    
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        # Get the input message from the context
        logger.info(f"{self.name} received context: {ctx}")
        input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
        print(f"Received input message: {input_message}")
        # Add "Hello world" to the message
        
        modified_message = f"Hello world {input_message}"
        print(f"Modified message: {modified_message}")

        hop_count = ctx.session.state.get("hop_count", 0)
        state_delta = {"hop_count": hop_count + 1
        }

        invocation_id = f"{hop_count}_{random.randint(1000, 9999)}"

        

        if hop_count>=2 and hop_count<5:
            # Transfer to another cheaper agent after 2 hops
            action = EventActions(state_delta=state_delta,transfer_to_agent=self.integration)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(function_call={"name": "transfer_to_agent", "args": {"agent_name": self.integration}})]), actions=action)
        
        elif hop_count>=5:
            print("Turn completed")
            action = EventActions(state_delta=state_delta,turn_complete=True)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)

        else:
            action = EventActions(state_delta=state_delta)
            event = Event( author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)
        

        yield event

class IntelligentStubAgent(BaseAgent):
    model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2')
    key_intents: list[str] = ["Integration", "Differentiation", "Algebra", "Geometry", "Trigonometry"]
    encoded_intents: list[np.ndarray]  = []

    def __init__(self, name: str, description: str, sub_agents=[]):
        super().__init__(name=name, description=description)
        self.encoded_intents = [self.model.encode(intent) for intent in self.key_intents]

    
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        
        # Get the input message from the context
        logger.info(f"{self.name} received context: {ctx}")
        input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
        print(f"Received input message: {input_message}")
        # Add "Hello world" to the message

        encode = self.model.encode(input_message)  # Just to simulate some processing

        similarities = [1 - util.cos_sim(encode, intent) for intent in self.encoded_intents]
        best_intent_index = np.argmin(similarities)
        best_intent = self.key_intents[best_intent_index]

        modified_message = f"Identified intent: {best_intent} for input: {input_message}"

        event = build_event( name=self.name, content=f"Detected intent: {modified_message}", state_delta={"identified_intent": best_intent})

        yield event
    
def build_event(name:str, content:str, turn_complete:bool=False, transfer_to_agent:str=None, state_delta:dict={})->Event:
    action = EventActions(state_delta=state_delta, transfer_to_agent=transfer_to_agent)
    invocation_id = f"{name}-{random.randint(1, 99999)}"
    event = Event( invocation_id=invocation_id, author=name, content = Content(role="assistant", parts=[Part(text=content)]), actions=action, turn_complete=turn_complete)
    return event



instruction = """
You are an autonomous agent that takes a complex maths problem and breaks it down into smaller steps to solve it. 
You have access to a set of agents for each branch of maths.
"""


stub_agent_1 = StubAgent(name="Stub_Agent_Integration", description="Agent that can do Integration problems")
stub_agent_2 = DeterministicStubAgent(name="Stub_Agent_Differentiation", description="Agent that can do Differentiation problems")
stub_agent_3 = IntelligentStubAgent(name="Stub_Agent_Intelligent", description="Agent that can identify the branch of maths")
                                    
root_agent = LlmAgent(name="Root_Agent", description="Root agent for handling conversation and classification of problem", instruction=instruction, model=MODEL, sub_agents=[stub_agent_1, stub_agent_2, stub_agent_3])

Output:

Intelligent Stub in action

In the above the intent has been correctly identified and now can be used to pull out a specific response from a test list. ADK web tracing shows that the Intelligent stub was called.

Deterministic Stub in action

In the above the flow has been directed to the deterministic agent which as appended ‘Hello world’ to the output correctly. Confirmed using the ADK web tracing.

Dumb Stub (not) in action

In the above we don’t see anything interesting (given this is a Dumb stub) except that the Root Agent correctly routed to it based on the description provided to the dumb stub. This can be extremely useful when you have a large set of sub-agents and not all of them are available to test the routing result of your root agent instructions.

Logic of Custom Agents in Google ADK

Disclaimer: This post will refer to examples and scenarios that I have personally tested using ADK web. I have also extended my testing to GCP Agent Engine where possible. In practice ADK web and Agent Engine are different platforms so prepare to be surprised if something works in ADK web but not with Agent Engine. I will attempt to highlight where I have come across such issues. Agent Engine code at the bottom of the post.

The recent updates on ADK 2.0 (here), which is in Alpha release currently, point toward the framework offering greater control for orchestration through the introduction of a graph construct. This also ensures it attempts to fight back against the likes of LangGraph.

Why Customise?

The reason for this post is around the above question. Currently, most examples for ADK solve the orchestration between agents using a composition of one of 5 ‘core’ patterns:

Sub-agent
Agent-as-Tool
Three deterministic workflow patterns (Sequential, Parallel, and Loop).

The deterministic workflow patterns are all based around an ‘Agent’ that is not driven by a LLM. Instead it is driven by code.

To actually implement an agent in ADK the most common starting point is LlmAgent. This is an out-of-the-box implementation of the ReACT agent architecture.

But there are several scenarios that require you to build a custom agent (at least till we have ADK 2.0 out). Some of these are outlined below:

When stubbing out agents.
When you require fine-grained orchestration control (e.g., using ML to decide what happens next based on the output of the previous agent).
When we want to create our own style of agent (e.g., those based on ML-models and deterministic decisioning) and not use LlmAgent.

How to Customise?

The basic outline for a ‘blank’ agent is as follows:

			
class BlankAgent(BaseAgent):
    def __init__(self, name:str, description:str):
        super().__init__(name=name, description=description)
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        
        # custom code goes here...
        yield # when you are ready to complete execution then yield an Event here.

		

The key aspects to focus on:

name and description in the init method: required to identify your agent and to describe it for upstream agents.
_run_async_impl method: this is where the logic of this custom agent goes.

The Logic of the Agent

The _run_async_impl which has been overridden in the above snippet is an async method. This means it is not a ‘blocking’ method instead it is called from within an event loop.

Insight: ADK by-default executes agents in an async manner even if to the consumer it looks like a synchronous call.

The InvocationContext object is the magic object that contains all the information available for our custom agent to understand the task it is being asked to perform and any additional data/context provided. It also provides information about the current ‘state’ of the interaction including conversational context.

A given context spans one full turn starting with the external input into the multi-agent system (e.g., user’s message) and ends with the final response of the multi-agent system (e.g., response back to the user). The context is therefore made up of multiple agent calls which in turn can be made up of several custom calls, LLM calls, or tool calls.

Given the yield construct we are expected to formulate our agent as a generator of events. Which brings us to the Event construct within ADK.

Insight: An ADK agent, from a programmatic perspective, is nothing but a generator of events that influences what downstream agents see. If no Events are generated by your agent then it is invisible to the other agents in the system.

Event Class

The Event Class is the most important class to understand because this is what allows your agent to surface itself in the event stream that is one Multi-agent system execution turn. If your agent doesn’t yield any Event objects it will be invisible to the other agents just like an employee that never sends an email, types a line of code, or attends a meeting.

So what is an Event?

An Event in ADK is an immutable record representing a specific point in the agent’s execution. It captures user messages, agent replies, requests to use tools (function calls), tool results, state changes, control signals, and errors. [_{https://google.github.io/adk-docs/events/}]

The Event class has two required attributes:

Author: author of the event (e.g., agent name).
Invocation Id: even though the documentation states it as a required, your agent will still work without adding invocation id to events because in the code for the Event class it has a default empty string value. Agent Engine: this is required and must be unique and generated per processing stage.

Other attributes of interest:

content: represented by the Content object which is the ‘output’ of your agent. Content object can represent complex content types such as text, file data, code etc.
partial/turn_complete: boolean values that signal the completeness of the response.
actions: type of EventActions such as:
- state_delta/artefact_delta: indicating update of state/artefact with given delta.
- transfer_to_agent: indicating action to transfer to the indicated agent.
- escalate: indicating an escalation to the indicated agent.

Implementing an Agent that Provides a Static Response

Remember our custom agent can only communicate via Events. In this section let us create a basic agent that generates a response using a deterministic method. In this case it responds by appending ‘Hello world’ to the query provided by the upstream agent.

			
class DeterministicStubAgent(BaseAgent):
    def __init__(self, name: str, description: str):
        super().__init__(name=name, description=description)
    
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        invocation_id = random.randint(1,99999) # your id generation logic
        # Get the input message from the context
        logger.info(f"{self.name} received context: {ctx}")
        input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
        print(f"Received input message: {input_message}")
        
        # Add "Hello world" to the message
        modified_message = f"Hello world {input_message}"
        print(f"Modified message: {modified_message}")
        
        yield Event(invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]))

		

In the agent above we first access the incoming request via the InvocationContext -> user_content structure and store it in input_message and then create a modified message (append ‘Hello world’).

Then we construct and yield the Event object that contains the deterministic response generated by our custom agent within the content structure.

The above snippet also shows you the starting and endpoints of your custom flow. Start is accessing upstream content that you may want to respond to and the end is the generation of an Event that contains your response/output.

To run the deterministic agent with a LLM-driven agent:

			
MODEL = LiteLlm(model="ollama_chat/qwen3.5:2b")
instruction = """
You are an autonomous agent that takes a complex maths problem and breaks it down into smaller steps to solve it. 
You have access to a set of agents for each branch of maths.
"""
stub_agent_1 = StubAgent(name="Stub_Agent_Integration", description="Agent that can do Integration problems")
stub_agent_2 = DeterministicStubAgent(name="Stub_Agent_Differentiation", description="Agent that can do Differentiation problems")
root_agent = LlmAgent(name="Root_Agent", description="Root agent for handling conversation and classification of problem", instruction=instruction, model=MODEL, sub_agents=[stub_agent_1, stub_agent_2])

		

When you run the above as a sub-agent to the root agent this is what the interaction and trace looks like:

Example with EventActions

If you are going to the trouble of creating a custom agent your need will be beyond content generation. You may need to inform other agents about some session level operational information (e.g., cost incurred or agent-hop-count update) or indicate some change in state to the wider application.

The content block is not suitable for this as it is about the content flow. There is a separate ‘state’ structure in the context that exists at the session level which can be used to store temporary state information. Given this is at the session level and not persisted do not use this to store complex bits of information. The session state is not a data store!

  @override
  async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
      # Get the input message from the context
      logger.info(f"{self.name} received context: {ctx}")
      input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
      print(f"Received input message: {input_message}")
      # Add "Hello world" to the message
      
      modified_message = f"Hello world {input_message}"
      print(f"Modified message: {modified_message}")
      hop_count = ctx.session.state.get("hop_count", 0)
      state_delta = {"hop_count": hop_count + 1
      }
      action = EventActions(state_delta=state_delta)
      yield Event( author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)

In the snippet above we add a hop_count session state variable to maintain a log of how many times this agent has been called per session. The real world use of this would be to ration the use of ‘expensive’ agents.

If you are using ADK web to test and learn then you can use the ‘State’ tab on the right to explore how this state variable increments every time our custom agent is used in a given session. Below we can see it has been used twice in the session.

In another session it has been used thrice.

The image below shows Agent Engine deployment with some tweaks to the code.

Conclusion

We have gone through some important information around how to build your own custom ADK agent with code snippets. We have spoken about the overall structure and how to construct simple Events.

We then extended this to work with EventActions specifically how we can perform state-updates to share session level information with other agents.

We will build on this knowledge to develop sophisticated custom agents and to carry out common tasks such as stubbing agents for testing multi-agent systems.

Code Tested with Agent Engine


from typing import AsyncGenerator

from google.adk.agents.llm_agent import LlmAgent
from google.adk.agents import BaseAgent, InvocationContext
from google.adk.models.lite_llm import LiteLlm
from typing_extensions import override
from google.adk.events import Event, EventActions
from google.genai.types import Content, Part
import random

import logging 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class StubAgent(BaseAgent):
    def __init__(self, name:str, description:str, sub_agents=[]):
        super().__init__(name=name, description=description, sub_agents=sub_agents)

    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        print("Activating StubAgent with context:", ctx)
        logger.info(f"{self.name} received context: {ctx}")
        yield Event(turn_complete=True, author=self.name)
        

class DeterministicStubAgent(BaseAgent):
    integration: str = "Stub_Agent_Integration"
    def __init__(self, name: str, description: str, sub_agents=[]):
        super().__init__(name=name, description=description)

    
    @override
    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:
        # Get the input message from the context
        logger.info(f"{self.name} received context: {ctx}")
        input_message = ctx.user_content.parts[0].text if ctx.user_content and ctx.user_content.parts else ""
        print(f"Received input message: {input_message}")
        # Add "Hello world" to the message
        
        modified_message = f"Hello world {input_message}"
        print(f"Modified message: {modified_message}")

        hop_count = ctx.session.state.get("hop_count", 0)
        state_delta = {"hop_count": hop_count + 1
        }

        invocation_id = f"{hop_count}_{random.randint(1000, 9999)}"

        

        if hop_count>=2 and hop_count<5:
            # Transfer to another cheaper agent after 2 hops
            action = EventActions(state_delta=state_delta,transfer_to_agent=self.integration)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(function_call={"name": "transfer_to_agent", "args": {"agent_name": self.integration}})]), actions=action)
        
        elif hop_count>=5:
            print("Turn completed")
            action = EventActions(state_delta=state_delta,turn_complete=True)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)

        else:
            action = EventActions(state_delta=state_delta)
            event = Event( invocation_id=invocation_id, author=self.name, content = Content(role="assistant", parts=[Part(text=modified_message)]), actions=action)
        

        yield event
    
        
MODEL = LiteLlm(model="ollama_chat/qwen3.5:2b") 
#Note: used Gemini Flash 2.5 when testing with Agent Engine.

instruction = """
You are an autonomous agent that takes a complex maths problem and breaks it down into smaller steps to solve it. 
You have access to a set of agents for each branch of maths.
"""

stub_agent_1 = StubAgent(name="Stub_Agent_Integration", description="Agent that can do Integration problems")
stub_agent_2 = DeterministicStubAgent(name="Stub_Agent_Differentiation", description="Agent that can do Differentiation problems")
                                    
root_agent = LlmAgent(name="Root_Agent", description="Root agent for handling conversation and classification of problem", instruction=instruction, model=MODEL, sub_agents=[stub_agent_1, stub_agent_2])

PageIndex Instead of Vectorising

The Retrieval part of RAG requires topic specific data to be retrieved from a knowledge store based on the user’s query.

More focussed the retrieval better grounded will be the generation.

Hidden Complexity of Using Vectors

In Vectorised RAG the retrieval part uses a vector database to store chunks of knowledge and a vector search query to retrieve relevant chunks for the model or agent.

The part of using a vector database is where the hidden complexity lies and this is best explained with a story.

When I started my Masters in 2002 we had a module called Research Methods. I was not aware of any of the core concepts and found the lectures difficult to follow. Search engines were not as powerful or AI-enabled as today. One had to expand ones vocabulary with appropriate technical and domain specific terms to retrieve knowledge to then discover related concepts. Keep doing this rinse and repeat till you map out the topic.

The above story describes how I created a mental index of this new topic ‘Research Methods’ and used that to ground my coursework.

For vectorised RAG we depend on what are known as embedding models to convert text containing knowledge into a vector which is then stored in a vector database. The hope is that the model understands the concepts within that topic and relationships between them. This is implemented, during model training, by exposing the model to real text so that it can learn the statistical patterns within the training text.

Retrieval uses the same embedding model to create the query vector from user’s query text to find ‘related’ knowledge from the vector database.

Now we come to the critical part:

Jut by observing statistical patterns with the training text I can build out a concept space. But there is no way to guarantee that I am a master of that concept space. We humans cannot squash all the concepts into our brains (e.g., an aerospace engineer will have concepts and relationships in their head that a marine engineer may not have even though there will be a high degree of overlap in the concept space).

It is highly unlikely that a relatively small model can contain all the concepts at a level of detail to be useful. Therefore, the vectorisation of different documents may be less or more effective based on how deeply the embedding model was trained on that topic.

For example, if the embedding model was trained without knowledge of Charles Dickens or his works then from a vector creation point of view ‘Bleak House’ will be closer to ‘sad house’ than ‘Charles Dickens’. In fact the concept of Charles Dickens may not be in the embedding model at all and therefore the vectors produced may be based on plain English concept of the word.

Another famous example is given the query ‘apple’ – which of the words in the list should the resulting vector be closest to: [‘orange’, ‘iphone’]. Depending on how the embedding model used to vectorise the query and the words in the list we may see different results.

Impact of training bias: if the model has been trained on knowledge about Apple products like the iPhone then we expect the query to be close to ‘iphone’. If the model has been trained on basic English language then the query may appear closer to ‘orange’. Furthermore, ‘iphone’ and ‘orange’ must be far away from each other unless the model knows about the new ‘orange iPhone’.

Just like my Research Methods example above how you build out a concept map of a topic depends on what sources you have included in your learning.

This is the hidden complexity of vectorisation as an approach. You are completely dependent on the embedding model to help map out the topic space and then retrieve specific concepts.

General purpose embedding models will not capture domain or topic specific concepts.

That is what links the embedding model to the use-case. That is why for medical AI applications you get specialised Medical embedding models (like MedEmbed-small-v0.1).

These embedding models have a time-horizon so if a concept did not exist when these were trained, they would not know about it or how to relate it to other concepts. See example of differences in concept resolution below using OpenAI’s ‘text-embedding-3-small’ (Figure 1a) and ‘text-embedding-3-large’ (Figure 1b). Clearly the ‘large’ model has an accurate model of the Savings Account space for UK.

Figure 1a: Note the significant difference for Individual Savings Account against ISAs and the general term Savings Account. All the ISAs are Individual Savings Accounts which are also Savings Accounts.

Figure 1b: Note the confusion is not seen in the ‘large’ embedding model as all ISAs and the generic term Savings Account are now quite close to Individual Savings Account.

Embedding models cannot explicitly be given context for a comparison.

Another important point from Figure 1a and 1b above is that these distance measures are fixed. We cannot provide context to the embedding model. For example if the context is tax regulation then the similarity score will be very different. As it would be if the context was risk vs reward.

Think of the context as taking the existing high-dimensional vector space in which all these concepts sit and transforming it to bring certain concepts closer and teasing others apart.

Now a question for you: Have you gone through an embedding mode evaluation for your use-case or just picked a default Open AI or Google embedding model and run with that?

All this without even discussing the many variants of how vector results are evaluated and the different styles of vector creation pipelines and the impact of design choices (e.g., chunking strategy).

Enter PageIndex

PageIndex attempts to the burdened developer from all this embedding model selection, tuning vectorisation pipelines, thinking about chunking strategies and spinning up new bits of software like vector databases.

It is based on the concept that let the LLM figure out what is in a document by drawing a table of contents as a light weight index tree called a PageIndex. Then use this content index to identify and extract the relevant sections/concepts from the doc and use those for generating the response in a ‘reasoning loop’ (see Figure 2).

Figure 2: Overview of the PageIndex-based RAG (Source: https://pageindex.ai/blog/pageindex-intro)

Here you are not spinning up a vector db or bouncing between vectors and plain text at the mercy of the embedding model being used. Everything stays plain text and the SMEs can go and evaluate the PageIndex created, tweak it if needed or even create multiple versions for different use-cases.

As a concept it is very interesting. The project can be found here: https://github.com/VectifyAI/PageIndex

			
{
  "node_id": "0006",
  "title": "Financial Stability",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "sub_nodes": [
    {
      "node_id": "0007",
      "title": "Monitoring Financial Vulnerabilities",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "node_id": "0008",
      "title": "Domestic and International Cooperation and Coordination",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}

		

The snippet above shows an example of a PageIndex.

Note that the index can be at any level: document, pages, paragraphs etc. and contains metadata to provide additional richness and flexibility to this process (e.g., no need to commit to a chunking strategy).

The way I like to think of PageIndex RAG vs Vectorised RAG is:

PageIndex you are using the notes on a topic prepared by your teacher.
Vectorised you are using the notes on a topic prepared by your friend based on what the teacher taught in the classroom.

Now this method has its own trade-offs which are critical to understand before we jump on this based on LinkedIn hype.

Firstly, it depends on ‘agentic’ style of generation which means multiple loops / LLM calls as compared to a single lightweight embedding model call. Cost implications can be massive especially when you need to re-index.

Secondly, still dependency on some kind of AI-model. We have expanded the concept space by moving from say a 50m to 8b parameter embedding model to a LLM with greater than 20b parameters, mixture of experts, reasoning mode and many other tweaks. We can provide context to build out different indexes via the page index build prompt. That said we will still be bothered by training bias where different models may produce different indexes.

Thirdly, reindexing becomes unpredictable. Earlier you could train lots of small embedding models based on the topic landscape of your organisation and domain. You could store them as your own assets. Now you are dependent on the major model providers like OpenAI, Anthropic, and Google. Let us say you built out your page index based on v1 of the documents of interest using Gemini Flash 2.0. Six months later some documents are updated to v2 now we need to rebuild the page index. But now Gemini Flash 2.0 has been deprecated so you are forced to move to Gemini Flash 2.5. Here the good thing about the Page Index concept is that you can provide the existing version (based on v1 documents) and the v2 documents and instruct the model to update rather than rebuild from scratch. Again this is increasing our dependence on the model thereby requiring SME validation of the Page Index at each stage.

Fourth, Page Index as a chain of steps. Vector index is a one step process. Page Index makes it an agentic loop where the depth of retrieval is dynamic. This can increase latency and make the process less auditable and repeatable. At the same time it can also make the retrieval process tune-able for different topic areas – where a complex query can be supported by additional retrieval loops where as simple queries can be one-shot. Vector retrieval is by default one-shot.

OpenAPI Integration with ADK

Google’s ADK provides a nifty utility that automatically creates agent tools against an OpenAPI.

The examples can be found: https://google.github.io/adk-docs/tools-custom/openapi-tools/#usage-workflow

It is called ‘OpenAPIToolset’ and can be imported as below:

			
from google.adk.tools.openapi_tool.openapi_spec_parser.openapi_toolset import OpenAPIToolset

The OpenAPI spec can be passed in as JSON or YAML as below:

			
openapi_spec_json = '...' # Your OpenAPI JSON as string 
toolset = OpenAPIToolset(spec_str=openapi_spec_json, spec_str_type="json")
openapi_spec_yaml = '...' # Your OpenAPI YAML as string 
toolset = OpenAPIToolset(spec_str=openapi_spec_yaml, spec_str_type="yaml")

The toolset is added using the standard tools keyword argument within the ADK LLMAgent (for example).

Opening the Lid: Tool Creation

The toolset is created every time the agent is initialised. It is a procedural creation therefore we can investigate the code to understand the logic.

Dynamic tool creation makes it risky in case different ADK versions are being used or the API changes. The changes will be translated silently and used by your agent.

The OpenAPIToolset converts each endpoint in the OpenAPI spec into a tool with a name with the resource path separated by ‘_’ with the REST verb at the end. As an example:

			
Path: "/accounts/{accountId}"          Operation: GET
Generated Toolname: "accounts_account_id_get"

Given that the toolname captures the path it can become difficult to debug especially if the paths are long. I also wonder if the LLM could get confused given this is all text at the end of the day.

Also it becomes difficult to provide constants that we do not want the LLM to populate, view or override even if the API provides the access (e.g., agent can only open accounts of type ‘current’ where as the API can be used for ‘current’ and ‘savings’ accounts).

Schema

Each parameter for the request is defined in a parameters object within each tool. For example for the above GET request we have two parameters as defined in the spec below. One is ‘accountId’ of type string in the path and the other is ‘includeBalance’ of type boolean in the query.

			
OpenAPI Spec:
"parameters": [
          {
            "name": "accountId",
            "in": "path",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Unique identifier for the account"
          },
          {
            "name": "includeBalance",
            "in": "query",
            "required": false,
            "schema": {
              "type": "boolean",
              "default": true
            },
            "description": "Whether to include account balance in response"
          }
        ]

		

The code below is the JSON output of the ‘accountId’ parameter (just one parameter). This shows the complex combinatorial problem that has been solved. There can be many combinations depending on how the parameter is passed etc. We know ‘accountId’ uses the simplest style of the url path. That is why most of the items are ‘null’. Description, location, type, and name are the main values to focus on.

			
Resulting Parameter object in JSON format for 'accountId'
        {
            "description": "Unique identifier for the account",
            "required": true,
            "deprecated": null,
            "style": null,
            "explode": null,
            "allowReserved": null,
            "schema_": {
                "schema_": null,
                "vocabulary": null,
                "id": null,
                "anchor": null,
                "dynamicAnchor": null,
                "ref": null,
                "dynamicRef": null,
                "defs": null,
                "comment": null,
                "allOf": null,
                "anyOf": null,
                "oneOf": null,
                "not_": null,
                "if_": null,
                "then": null,
                "else_": null,
                "dependentSchemas": null,
                "prefixItems": null,
                "items": null,
                "contains": null,
                "properties": null,
                "patternProperties": null,
                "additionalProperties": null,
                "propertyNames": null,
                "unevaluatedItems": null,
                "unevaluatedProperties": null,
                "type": "string",  
                "enum": null,
                "const": null,
                "multipleOf": null,
                "maximum": null,
                "exclusiveMaximum": null,
                "minimum": null,
                "exclusiveMinimum": null,
                "maxLength": null,
                "minLength": null,
                "pattern": null,
                "maxItems": null,
                "minItems": null,
                "uniqueItems": null,
                "maxContains": null,
                "minContains": null,
                "maxProperties": null,
                "minProperties": null,
                "required": null,
                "dependentRequired": null,
                "format": null,
                "contentEncoding": null,
                "contentMediaType": null,
                "contentSchema": null,
                "title": null,
                "description": "Unique identifier for the account",
                "default": null,
                "deprecated": null,
                "readOnly": null,
                "writeOnly": null,
                "examples": null,
                "discriminator": null,
                "xml": null,
                "externalDocs": null,
                "example": null
            },
            "example": null,
            "examples": null,
            "content": null,
            "name": "accountId",  
            "in_": "path"  
        }

		

Remember all of the items above are not just for translation of function parameters to request parameters, these also provide clues to the LLM as to what types of parameters to provide, valid options/ranges (see Request Object) and the meaning of the specific request which has been wrapped by the tool.

Request Object

The Request Object captures the input going into the OpenAPI request to the remote API. It provides the schema (properties – that represent the variables. The example below is the POST request for accounts – used to create or update accounts. The example below shows the schema with one parameter ‘userId’ with three important keys: type, description and examples (this is critical for LLMs to ensure they follow approved patterns for things like dates, emails etc.). We have different types of schema items that can include validation – such as enums, and items that are required (e.g., for account opening email, first name and last name)

			
"requestBody": {
        "description": null,
        "content": {
            "application/json": {
                "schema_": {
                    "schema_": null,
                    "vocabulary": null,
                    "id": null,
                    "anchor": null,
                    "dynamicAnchor": null,
                    "ref": null,
                    "dynamicRef": null,
                    "defs": null,
                    "comment": null,
                    "allOf": null,
                    "anyOf": null,
                    "oneOf": null,
                    "not_": null,
                    "if_": null,
                    "then": null,
                    "else_": null,
                    "dependentSchemas": null,
                    "prefixItems": null,
                    "items": null,
                    "contains": null,
                    "properties": {
                        "userId": {
                            "schema_": null,
                            "vocabulary": null,
                            "id": null,
                            "anchor": null,
                            "dynamicAnchor": null,
                            "ref": null,
                            "dynamicRef": null,
                            "defs": null,
                            "comment": null,
                            "allOf": null,
                            "anyOf": null,
                            "oneOf": null,
                            "not_": null,
                            "if_": null,
                            "then": null,
                            "else_": null,
                            "dependentSchemas": null,
                            "prefixItems": null,
                            "items": null,
                            "contains": null,
                            "properties": null,
                            "patternProperties": null,
                            "additionalProperties": null,
                            "propertyNames": null,
                            "unevaluatedItems": null,
                            "unevaluatedProperties": null,
                            "type": "string",
                            "enum": null,
                            "const": null,
                            "multipleOf": null,
                            "maximum": null,
                            "exclusiveMaximum": null,
                            "minimum": null,
                            "exclusiveMinimum": null,
                            "maxLength": null,
                            "minLength": null,
                            "pattern": null,
                            "maxItems": null,
                            "minItems": null,
                            "uniqueItems": null,
                            "maxContains": null,
                            "minContains": null,
                            "maxProperties": null,
                            "minProperties": null,
                            "required": null,
                            "dependentRequired": null,
                            "format": null,
                            "contentEncoding": null,
                            "contentMediaType": null,
                            "contentSchema": null,
                            "title": null,
                            "description": "Associated user identifier",
                            "default": null,
                            "deprecated": null,
                            "readOnly": null,
                            "writeOnly": null,
                            "examples": null,
                            "discriminator": null,
                            "xml": null,
                            "externalDocs": null,
                            "example": "user_987654321"
                        },

		

Response Object

This is my favourite part – the response from the API. The ‘responses’ object is keyed by HTTP response status codes. Below shows the responses for:

200 – account update.

In the full file extracted from the OpenAPIToolset’s generated tool.

202 – account created.
400 – invalid request data.
401 – unauthorised access.

			
"responses": {
        "200": {
            "description": "Account updated successfully",
            "headers": null,
            "content": {
                "application/json": {
                    "schema_": {
                        "schema_": null,
                        "vocabulary": null,
                        "id": null,
                        "anchor": null,
                        "dynamicAnchor": null,
                        "ref": null,
                        "dynamicRef": null,
                        "defs": null,
                        "comment": null,
                        "allOf": null,
                        "anyOf": null,
                        "oneOf": null,
                        "not_": null,
                        "if_": null,
                        "then": null,
                        "else_": null,
                        "dependentSchemas": null,
                        "prefixItems": null,
                        "items": null,
                        "contains": null,
                        "properties": {
                            "accountId": {
                                "schema_": null,
                                "vocabulary": null,
                                "id": null,
                                "anchor": null,
                                "dynamicAnchor": null,
                                "ref": null,
                                "dynamicRef": null,
                                "defs": null,
                                "comment": null,
                                "allOf": null,
                                "anyOf": null,
                                "oneOf": null,
                                "not_": null,
                                "if_": null,
                                "then": null,
                                "else_": null,
                                "dependentSchemas": null,
                                "prefixItems": null,
                                "items": null,
                                "contains": null,
                                "properties": null,
                                "patternProperties": null,
                                "additionalProperties": null,
                                "propertyNames": null,
                                "unevaluatedItems": null,
                                "unevaluatedProperties": null,
                                "type": "string",
                                "enum": null,
                                "const": null,
                                "multipleOf": null,
                                "maximum": null,
                                "exclusiveMaximum": null,
                                "minimum": null,
                                "exclusiveMinimum": null,
                                "maxLength": null,
                                "minLength": null,
                                "pattern": null,
                                "maxItems": null,
                                "minItems": null,
                                "uniqueItems": null,
                                "maxContains": null,
                                "minContains": null,
                                "maxProperties": null,
                                "minProperties": null,
                                "required": null,
                                "dependentRequired": null,
                                "format": null,
                                "contentEncoding": null,
                                "contentMediaType": null,
                                "contentSchema": null,
                                "title": null,
                                "description": "Unique identifier for the account",
                                "default": null,
                                "deprecated": null,
                                "readOnly": null,
                                "writeOnly": null,
                                "examples": null,
                                "discriminator": null,
                                "xml": null,
                                "externalDocs": null,
                                "example": "acc_123456789"
                            },

		

Look at the sheer number of configuration items that can be used to ‘fine tune’ the tool.

Code

Check out the sample schema given to the OpenAPIToolset to convert:

https://github.com/amachwe/test_openapi_agent/blob/main/account_api_spec.json

I have tested the above with the included agent. If you want to recreate the tools you need to uncomment the main section in ‘agent.py’ (at the end of the file). The file can be found here:

https://github.com/amachwe/test_openapi_agent/blob/main/agent.py

The agent can be tested using ‘adk web’ just outside the folder that contains the ‘agent.py’. Note: I have not implemented the server but you can use the trace feature in adk web to confirm that the correct tool calls are made or vibe code your way to the server using the test spec.

The following files represent the dump of the tools associated with each path + REST verb combination that have been dynamically created for our agent by OpenAPIToolset:

Delete Account: https://github.com/amachwe/test_openapi_agent/blob/main/accounts_account_id_delete.json

Get Details of an Account: https://github.com/amachwe/test_openapi_agent/blob/main/accounts_account_id_get.json

Create or Update Account details:

https://github.com/amachwe/test_openapi_agent/blob/main/accounts_account_id_post.json

Get All Accounts: https://github.com/amachwe/test_openapi_agent/blob/main/accounts_get.json

The last one is an interesting one as it introduces the use of ‘items’ property in the schema where we create a list property called ‘accounts’ that represents the list of retrieved accounts. This in turn contains definition of each ‘item’ in the list which represents the schema for an account.

Building A Multi-Agent System: Part 2

Part One of this post can be found here. TLDR is here.

Upgrading the Multi-Agent System

In this part I remove the need for a centralised router and instead package each agent as an individual unit (both a server as well as a client). This is a move from single server multiple agents to single server single agent. Figure 1 shows an example of this.

We use code from Part 1 as libraries to create generic framework that allows us to easily buil agent as servers that support a standard interface (Google’s A2A). The libraries allow us to standardise the loading and running of agents written in LangGraph or ADK.

With this move we need an Active Agent Registry to ensure we register and track every instance of an agent. I implement a policy that blocks the activation of any agent if its registration fails – no orphan agents. The Agent Card provides the skills supported by the registered agent which can be used by other agents to discover its capabilities.

This skills-based approach is critical for dynamic planning and orchestration that gives us the maximum flexibility (while we give up on control and live with a more complex underlay).

Figure 1: Framework to create individual agents using existing libraries.

With this change we no longer have a single ‘server’ address that we can default to. Instead our new Harness Agent Test client needs to either dynamically lookup the address of the user provided agent name or have an address provided for the agent we want to interact with.

Figure 2: Registration, discovery, and Interaction flow.

Figure 2 above shows the process of discovery and use. The Select and Connect stage can be either:

Manual – where the user looks up the agent name and corresponding URL and passes it in to the Agent Test client.
Automatic – where the user provides the agent name and the URL is looked up at runtime.

Distributed Multi-Agent System

The separation of agents into individual servers allows us to connect them to each other without tight coupling. Each agent can be deployed in its own container. The server creation also ensures high cohesion within the agent.

The Contact Agent tool allows the LLM inside the agent to evaluate the users request, decide skills required, map to the relevant agent name and use that to direct the request. The tool looks up the URL based on the name, initiates a GRPC-based A2A request and returns the answer to the calling agent. Agents that don’t have a Contact Agent tool will not be able to request help from other agents. This can be used as a mechanism to control interaction between agents.

In Figure 3 the user starts the interaction (via the Agent Test Client) with Agent 1. Agent 1 as part of the interaction requires skills provided by Agent 2. It uses the Contact Agent tool to initiate an A2A request to Agent 2.

All the agents deployed have their own A2A Endpoint to receive requests. This can make the whole setup a peer-to-peer one if we provide a model that can both respond to human input as well as requests from other agents and not restrict the Contact Agent tool to specific agents. This means the interaction can start anywhere and follow different paths through the system.

Figure 3: Interaction and agent-to-agent using contact agent tool.

This flexibility of multiple paths is shown in Figure 4. The interaction can start from Agent 1 or Agent 3. If we provide the Contact Agent Tool to Agent 2 then this becomes a true peer-to-peer system. This is where the flexibility comes to the fore as does the relative unpredictability of the interaction.

Figure 4: Multiple paths through the agents.

Architecture

The code can be found here.

Figure 5: The different agents and what roles they play.

Figure 5 shows the test system in all its glory. All the agents shown are registered with the registry and therefore are independently addressable by external entities including the Test Client.

The main difference between the agents that have access to the Contact Agent tool and ones that don’t is the ability contact other agents. Dice Roller agent for example does not have the ability to contact other agents. I can still connect directly with it and ask it to roll a dice for me but if I ask it to move an enemy it won’t be able to help (see Figure 6).

On the other hand if I connect with the main agent (local or non-local variant) it will be aware of Dice Roller and Enemy Mover (a bit of Dungeons and Dragons theme here).

Figure 6: Every agent is independently addressable – they just have different levels of awareness about other agents in the system.

There is an interesting consequence of pre-populating the Agent list. This means the Agents that are request generators (the ones with the Contact Agent tool in Figure 6) need to be instantiated last. Otherwise the list will not be complete. If two agents need to contact each other then current implementation will fail as the agent that is registered first will not be aware of any agents that are registered afterwards. Therefore, we cannot trust peer-to-peer multi-agent systems. We will need dynamic Agent list creation (perhaps before every LLM interaction) but this can slow the request process.

Setup

Each agent is now an executable in its own right. We setup the agent with the appropriate runner function and pass it to a standard server creation method that brings in GRPC support alongside hooks into the registration process.

These agents can be found in the agent_as_app folder.

The generic command to run an agent (when in the multi_agent folder):

python -m agent_as_app.<agent py file name>

As an example:

python -m agent_as_app.dice_roller

Once we deploy and run the agent it will attempt to contact the registration app and register itself. The registration server which must be run separately can be found in agent_registry folder (app.py being the executable).

You will need a Redis database instance running as I use it for the session memory.

Contact Agent Tool

The contact agent tool gives agents the capability of accessing any registered agent on demand. When the agent starts it gets a list of registered agents and their skills (a limitation that can be overcome by making agents aware of registrations and removals) and stores this as a directory of active agents and skills along with the URL to access the agent.

This is then converted into a standard discovery instruction for that agent. As long as the agent instance is available the system will work. This can be improved by dynamic lookups and event-based registration mechanism.

The Contact Agent tool uses this private ‘copy’ to look up the agent name (provided by the LLM at time of tool invocation), find the URL and send an A2A message to that agent.

Enablers

The server_build file in lib has the helper methods and classes. The important ones are:

AgentMain that represents the basic agent building block for the GRPC interface.

AgentPackage that represents the active agent including where to access it.

register_thyself method (more D&D theme) is the hook that makes registration a background process as part of the run_server convenience method (in the same file).

Examples

The interaction above uses the main_agent_local (see Figure 5) instead of main_agent as the interaction point. The yellow lines represent two interactions between the user and the main_agent_local via the Harness Agent Test client.

The green line represents the information main_agent_local gathered by interacting with the dice_roller. See screenshot below from the Dice Roller log which proves the value 2 was generated by the dice_roller agent.

The red lines represents interactions between the main_agent_local and the enemy_mover. See the corresponding log from enemy_mover below.

Key Insight: Notice how between the first and second user input the main_agent_local managed to decide what skills it needed and what agents could provide that. Both sessions show the flexibility we get when using minimal coupling and skills-based integration (as opposed to hard-coded integration in a workflow).

Results

I have the following lessons to share:

Decoupling and using skills-based integration appears to work but to standardise it across a big org will be the real challenge including arriving at org-wide descriptions and boundaries.
Latency is definitely high but I have also not done any tuning. LLMs still remain the slowest component and it will be interesting to see what happens when we add a security overlay (e.g., Agentic ID&A that controls which agent can talk with which other agent).
A2A is being used in a lightweight manner. Questions still remain on the performance aspect if we use it in anger for more complex tasks.
The complexity of application management provides scope for standard underlay to be created. In this space H1 2026 will bring lot more maturity to the tool offerings. Google and Microsoft have already showcased some of these capabilities.
Building the agent is easy and models are quite capable. Do not fall for the deceptive ease of single agents. Gen AI apps are still better unless you want a sprawl of task specific agents that don’t talk to each other.

Models and Memory

Another advantage of this decoupling is that we can have different agents use different models and have completely isolated resource profiles.

In the example above:

main_agent_local – uses Gemini 2.5 Flash

dice_roller – uses locally deployed Phi-4

enemy_mover – uses locally deployed Phi-4

Memory is also a common building block. It is indexed by user id and session id (both randomly generated by the Harness Agent Test client).

Next Steps

Now that I have the basics of a multi-agent system the next step will be to smoothen out the edges a bit and then implement better traceability for the A2A communications.
Try out complex scenarios with API-based tools that make real changes to data.
Explore guardrails and how they can be made to work in such a scenario.
Explore peer-to-peer setup.

Cost of Agentic

When we use Gen AI within a workflow with predictive interactions with LLMs we can attempt to estimate the cost and time penalties.

Now agentic (true agentic – not workflows being described as agentic) is characterised by:

Minimum coupling between agents but maximum cohesion within an agent.
Autonomy
Use of tools to change the environment

This brings a whole new dimension to the cost and time penalty problem making it lot harder to generate usable values.

Why Are Cost and Time Penalties Important

The simple answer is the two inform the sweet spot between experience, safety and cost.

Experience

Better experience involves using tricks that require additional LLM turns to reflect, plan and use tools, moving closer to agentic. While this allows the LLM to handle complex tasks with missing information it does increase the cost and time.

From a contact centre perspective think of this as:

An expert agent (human with experience) spending time with the customer to resolve their query.

Now if we did this for every case then we may end up giving a quick and ‘optimal’ resolution to every customer but the costs will be quite high.

Safety

Safer experience involves deeper and continuous checks on the information flowing into and out of the system. When we are dealing with agents then we get a hierarchical system to keep safe at different scales.

Semantic safety checks / guardrails involve LLM-as-a-Judge as we expect LLMs to ‘understand’. This increases the time to handle requests as well as increases costs due to additional LLM calls to the Judge LLM.

Cost Model

This is the first iteration of the cost model.

Figure 1: Three level of interaction corresponding to numbers in the list below.

The cost model works at three levels (see Figure 1 above):

Single Model Turn – Input-LLM-Output
- This is a single turn with the LLM.
Single Agent Turn – Input-Agent-Output
- This is a single turn with the Agent – the output could be a tool call or a response to the user or Agent request.
- Input can be from the user or from another Agent (in a Multi-agent system) or a response from the Tool.
Single Multi-agent System Turn – Input-Multi-Agent-System-Output
- This is a single turn with a Multi-agent system. Input can be the user utterance or external stimulus and the output can be the response to the request.
- A Multi-agent System can use deterministic orchestration, a semi-deterministic or fully probabilistic.

Single Model Turn

Item 1 in Figure 1. The base for input will be the prompt template which will be constant for a given input. We can have different input prompt templates for different sources (e.g., for tool response, user input etc.).

We assume a single prompt template in this iteration.

k_I = Cost per token for input tokens (Agent LLM)
k_O = Cost per token for output tokens (Agent LLM)

The constant cost per turn into the model:  C_c= k_I * t_P

The variable cost per turn into the model:  C_v = k_I * t_D

Total Cost per turn out of the model: C_o = k_O * t_O

Total Cost per Model Turn = C_c + C_v + C_o

This can be enhanced to compensate for Guardrails that use LLM-as-a-Judge. I also introduce a multiplier to compensate for parallel guardrails per turn that use LLM-as-a-Judge.

g_I = Cost per token for input tokens (Judge Model)
g_O = Cost per token for output tokens (Judge Model)
m_G = Multiplier - for multiple parallel guardrails that use LLMs as Judge

The variable cost per turn into the judge model:  G_v = g_I * (t_D + t_P) * m_G

Total Cost per turn out of the judge model: G_o = g_O * t_O * m_G

Total Cost per Model Turn (including Governance) = C_c + C_v + C_o + G_v + G_o

Single Agent Turn

Here a single agent turn will consist of multiple model turns (e.g., to understand user request, to plan actions, to execute tool calls, to reflect on input etc.).

In our simple model we take the Total Cost per turn and multiply it with the number of turns to get an estimate of the cost per Agent turn.

Total Cost for Single Turn = n_A * Total Cost per Model Turn

n_A = Number of turns within a single Agent turn

But many Agents continuously build context as they internally process a request. Therefore we can refine this model by increasing the variable input token count by the same amount every turn.

Incremental Cost of Variable Input = C_v += k_I * t_D * i
where i = 1 to n_A + 1

Single Turn with Multi-agent System

A single user input or event from an external system into a multi-agent system will trigger an orchestrated dance of agents where each agent plays a role in the overall system.

Given different patterns of orchestration are available from fully deterministic to fully probabilistic this particular factor is difficult to compensate for.

In a linear workflow with 4 agents we could use an agent interaction multiplier set to 5 (user + 4 agents). With a peer-to-peer a given request could bounce around multiple agents in an unpredictable manner.

The simplest model is a linear one:

Total Cost of Single Multi-agent System Turn = total Cost of Single Agent Turn * n_I

n_I = Number of Agent Interactions in a single Multi-agent System Turn.

Statistical Analysis

Now all of the models above require specific numbers to establish costs. For example:

How many tokens are going into a LLM?
How many turns does an Agent take even if we use mostly deterministic workflows?
How many Agent interactions are triggered with one user input?

Clearly these are not constant values. They will follow a distribution in time and space.

Key Concept: We cannot work with single values to estimate costs. We cannot even work with max and min values. We need to work with distributions.

We can use different distributions to mimic the different token counts, agent interactions, and agent turns.

The base prompt template token count for example can be from a choice of values.

The input data token count and output token count can be sampled from a normal distribution.

For the number of turns within an Agent we expect lower number of turns (say 1-3) to be of high probability with longer sequences being rarer. We will have at least one agent turn even if that results in the agent rejecting the request.

Same goes for the number of Agent interactions. This also depends on number of agents in the Multi-agent system and the orchestration pattern used. Therefore, this is the place where we can get high variability.

Key Concept: Agentic AI is touted as large number of agents running around doing things. This takes the high variability above and expands it like a balloon as more agents we have in a Multi-agent system wider is the range of agent interactions triggered by a request.

Some Results

The distributions below represent the result of the cost model working through each stage repeated 1 million times. Think of this as the same Multi-agent system going through 1 million different user chats.

I have used Gemini-2.5-Flash costs for the main LLM and Gemini-2.5-Flash-Lite costs for LLM-as-a-Judge.

The blue distributions from Left-to-Right: Base Prompt Template Token Count, Data Input Token Count, and Output Token Count. These are the first stage of the Cost Model.

The green distributions from Left-to-Right: Number of turns per Agent and Number of Agent Interactions. These are the second and third stages of the Cost Model.

The orange distribution is the cost per turn.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0174756
Median Cost per Multi-Turn Agent Interaction: $0.0133794
Max Cost per Multi-Turn Agent Interaction: $0.2704608
Min Cost per Multi-Turn Agent Interaction: $0.0004264
Total Cost for 1000000 Multi-Turn Agent Interactions: $17475.6268173

Distributions when we use specific choice for Turns and Agent Interactions.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0119819
Median Cost per Multi-Turn Agent Interaction: $0.0096888
Max Cost per Multi-Turn Agent Interaction: $0.0501490
Min Cost per Multi-Turn Agent Interaction: $0.0007306
Total Cost for 1000000 Multi-Turn Agent Interactions: $11981.8778405

Let us see what happens when we change the distribution for Agent turns to 4 (a small change from 2 previously).

Distribution with larger number of Agent interactions.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0291073
Median Cost per Multi-Turn Agent Interaction: $0.0232008
Max Cost per Multi-Turn Agent Interaction: $0.3814272
Min Cost per Multi-Turn Agent Interaction: $0.0004921
Total Cost for 1000000 Multi-Turn Agent Interactions: $29107.3320206

The costs almost double from increasing agent interaction from 2 to 4.

Key Concept: The expected cost distribution (orange) will not be a one time thing. The development team will have to continuously fine-tune based on Agent and Multi-agent System design approach, specific prompts, and tools use.

Caveats

There are several caveats to this analysis.

This is not the final cost – this is just the cost arising from the use of LLMs in agents and multi-agent applications where we have loops and workflows.
This cost will contribute to the larger total cost of ownership which will include operational setup to monitor these agents, compute, storage, and not to mention cost of maintaining and upgrading the agents as models change, new intents are required and failure modes are discovered.
The model output is quite sensitive to what is put in. That is why we get a distribution as part of the estimate. Individual numbers will never give us the true story.

Code

Code for the Agent Cost Model can be found below… feel free to play with it and fine tune it.

https://github.com/amachwe/agent_cost_model/tree/main