OpenAPI Integration with ADK

Google’s ADK provides a nifty utility that automatically creates agent tools against an OpenAPI.

The examples can be found: https://google.github.io/adk-docs/tools-custom/openapi-tools/#usage-workflow

It is called ‘OpenAPIToolset’ and can be imported as below:

			
from google.adk.tools.openapi_tool.openapi_spec_parser.openapi_toolset import OpenAPIToolset

The OpenAPI spec can be passed in as JSON or YAML as below:

			
openapi_spec_json = '...' # Your OpenAPI JSON as string 
toolset = OpenAPIToolset(spec_str=openapi_spec_json, spec_str_type="json")
openapi_spec_yaml = '...' # Your OpenAPI YAML as string 
toolset = OpenAPIToolset(spec_str=openapi_spec_yaml, spec_str_type="yaml")

The toolset is added using the standard tools keyword argument within the ADK LLMAgent (for example).

Opening the Lid: Tool Creation

The toolset is created every time the agent is initialised. It is a procedural creation therefore we can investigate the code to understand the logic.

Dynamic tool creation makes it risky in case different ADK versions are being used or the API changes. The changes will be translated silently and used by your agent.

The OpenAPIToolset converts each endpoint in the OpenAPI spec into a tool with a name with the resource path separated by ‘_’ with the REST verb at the end. As an example:

			
Path: "/accounts/{accountId}"          Operation: GET
Generated Toolname: "accounts_account_id_get"

Given that the toolname captures the path it can become difficult to debug especially if the paths are long. I also wonder if the LLM could get confused given this is all text at the end of the day.

Also it becomes difficult to provide constants that we do not want the LLM to populate, view or override even if the API provides the access (e.g., agent can only open accounts of type ‘current’ where as the API can be used for ‘current’ and ‘savings’ accounts).

Schema

Each parameter for the request is defined in a parameters object within each tool. For example for the above GET request we have two parameters as defined in the spec below. One is ‘accountId’ of type string in the path and the other is ‘includeBalance’ of type boolean in the query.

			
OpenAPI Spec:
"parameters": [
          {
            "name": "accountId",
            "in": "path",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Unique identifier for the account"
          },
          {
            "name": "includeBalance",
            "in": "query",
            "required": false,
            "schema": {
              "type": "boolean",
              "default": true
            },
            "description": "Whether to include account balance in response"
          }
        ]

		

The code below is the JSON output of the ‘accountId’ parameter (just one parameter). This shows the complex combinatorial problem that has been solved. There can be many combinations depending on how the parameter is passed etc. We know ‘accountId’ uses the simplest style of the url path. That is why most of the items are ‘null’. Description, location, type, and name are the main values to focus on.

			
Resulting Parameter object in JSON format for 'accountId'
        {
            "description": "Unique identifier for the account",
            "required": true,
            "deprecated": null,
            "style": null,
            "explode": null,
            "allowReserved": null,
            "schema_": {
                "schema_": null,
                "vocabulary": null,
                "id": null,
                "anchor": null,
                "dynamicAnchor": null,
                "ref": null,
                "dynamicRef": null,
                "defs": null,
                "comment": null,
                "allOf": null,
                "anyOf": null,
                "oneOf": null,
                "not_": null,
                "if_": null,
                "then": null,
                "else_": null,
                "dependentSchemas": null,
                "prefixItems": null,
                "items": null,
                "contains": null,
                "properties": null,
                "patternProperties": null,
                "additionalProperties": null,
                "propertyNames": null,
                "unevaluatedItems": null,
                "unevaluatedProperties": null,
                "type": "string",  
                "enum": null,
                "const": null,
                "multipleOf": null,
                "maximum": null,
                "exclusiveMaximum": null,
                "minimum": null,
                "exclusiveMinimum": null,
                "maxLength": null,
                "minLength": null,
                "pattern": null,
                "maxItems": null,
                "minItems": null,
                "uniqueItems": null,
                "maxContains": null,
                "minContains": null,
                "maxProperties": null,
                "minProperties": null,
                "required": null,
                "dependentRequired": null,
                "format": null,
                "contentEncoding": null,
                "contentMediaType": null,
                "contentSchema": null,
                "title": null,
                "description": "Unique identifier for the account",
                "default": null,
                "deprecated": null,
                "readOnly": null,
                "writeOnly": null,
                "examples": null,
                "discriminator": null,
                "xml": null,
                "externalDocs": null,
                "example": null
            },
            "example": null,
            "examples": null,
            "content": null,
            "name": "accountId",  
            "in_": "path"  
        }

		

Remember all of the items above are not just for translation of function parameters to request parameters, these also provide clues to the LLM as to what types of parameters to provide, valid options/ranges (see Request Object) and the meaning of the specific request which has been wrapped by the tool.

Request Object

The Request Object captures the input going into the OpenAPI request to the remote API. It provides the schema (properties – that represent the variables. The example below is the POST request for accounts – used to create or update accounts. The example below shows the schema with one parameter ‘userId’ with three important keys: type, description and examples (this is critical for LLMs to ensure they follow approved patterns for things like dates, emails etc.). We have different types of schema items that can include validation – such as enums, and items that are required (e.g., for account opening email, first name and last name)

			
"requestBody": {
        "description": null,
        "content": {
            "application/json": {
                "schema_": {
                    "schema_": null,
                    "vocabulary": null,
                    "id": null,
                    "anchor": null,
                    "dynamicAnchor": null,
                    "ref": null,
                    "dynamicRef": null,
                    "defs": null,
                    "comment": null,
                    "allOf": null,
                    "anyOf": null,
                    "oneOf": null,
                    "not_": null,
                    "if_": null,
                    "then": null,
                    "else_": null,
                    "dependentSchemas": null,
                    "prefixItems": null,
                    "items": null,
                    "contains": null,
                    "properties": {
                        "userId": {
                            "schema_": null,
                            "vocabulary": null,
                            "id": null,
                            "anchor": null,
                            "dynamicAnchor": null,
                            "ref": null,
                            "dynamicRef": null,
                            "defs": null,
                            "comment": null,
                            "allOf": null,
                            "anyOf": null,
                            "oneOf": null,
                            "not_": null,
                            "if_": null,
                            "then": null,
                            "else_": null,
                            "dependentSchemas": null,
                            "prefixItems": null,
                            "items": null,
                            "contains": null,
                            "properties": null,
                            "patternProperties": null,
                            "additionalProperties": null,
                            "propertyNames": null,
                            "unevaluatedItems": null,
                            "unevaluatedProperties": null,
                            "type": "string",
                            "enum": null,
                            "const": null,
                            "multipleOf": null,
                            "maximum": null,
                            "exclusiveMaximum": null,
                            "minimum": null,
                            "exclusiveMinimum": null,
                            "maxLength": null,
                            "minLength": null,
                            "pattern": null,
                            "maxItems": null,
                            "minItems": null,
                            "uniqueItems": null,
                            "maxContains": null,
                            "minContains": null,
                            "maxProperties": null,
                            "minProperties": null,
                            "required": null,
                            "dependentRequired": null,
                            "format": null,
                            "contentEncoding": null,
                            "contentMediaType": null,
                            "contentSchema": null,
                            "title": null,
                            "description": "Associated user identifier",
                            "default": null,
                            "deprecated": null,
                            "readOnly": null,
                            "writeOnly": null,
                            "examples": null,
                            "discriminator": null,
                            "xml": null,
                            "externalDocs": null,
                            "example": "user_987654321"
                        },

		

Response Object

This is my favourite part – the response from the API. The ‘responses’ object is keyed by HTTP response status codes. Below shows the responses for:

200 – account update.

In the full file extracted from the OpenAPIToolset’s generated tool.

202 – account created.
400 – invalid request data.
401 – unauthorised access.

			
"responses": {
        "200": {
            "description": "Account updated successfully",
            "headers": null,
            "content": {
                "application/json": {
                    "schema_": {
                        "schema_": null,
                        "vocabulary": null,
                        "id": null,
                        "anchor": null,
                        "dynamicAnchor": null,
                        "ref": null,
                        "dynamicRef": null,
                        "defs": null,
                        "comment": null,
                        "allOf": null,
                        "anyOf": null,
                        "oneOf": null,
                        "not_": null,
                        "if_": null,
                        "then": null,
                        "else_": null,
                        "dependentSchemas": null,
                        "prefixItems": null,
                        "items": null,
                        "contains": null,
                        "properties": {
                            "accountId": {
                                "schema_": null,
                                "vocabulary": null,
                                "id": null,
                                "anchor": null,
                                "dynamicAnchor": null,
                                "ref": null,
                                "dynamicRef": null,
                                "defs": null,
                                "comment": null,
                                "allOf": null,
                                "anyOf": null,
                                "oneOf": null,
                                "not_": null,
                                "if_": null,
                                "then": null,
                                "else_": null,
                                "dependentSchemas": null,
                                "prefixItems": null,
                                "items": null,
                                "contains": null,
                                "properties": null,
                                "patternProperties": null,
                                "additionalProperties": null,
                                "propertyNames": null,
                                "unevaluatedItems": null,
                                "unevaluatedProperties": null,
                                "type": "string",
                                "enum": null,
                                "const": null,
                                "multipleOf": null,
                                "maximum": null,
                                "exclusiveMaximum": null,
                                "minimum": null,
                                "exclusiveMinimum": null,
                                "maxLength": null,
                                "minLength": null,
                                "pattern": null,
                                "maxItems": null,
                                "minItems": null,
                                "uniqueItems": null,
                                "maxContains": null,
                                "minContains": null,
                                "maxProperties": null,
                                "minProperties": null,
                                "required": null,
                                "dependentRequired": null,
                                "format": null,
                                "contentEncoding": null,
                                "contentMediaType": null,
                                "contentSchema": null,
                                "title": null,
                                "description": "Unique identifier for the account",
                                "default": null,
                                "deprecated": null,
                                "readOnly": null,
                                "writeOnly": null,
                                "examples": null,
                                "discriminator": null,
                                "xml": null,
                                "externalDocs": null,
                                "example": "acc_123456789"
                            },

		

Look at the sheer number of configuration items that can be used to ‘fine tune’ the tool.

Code

Check out the sample schema given to the OpenAPIToolset to convert:

https://github.com/amachwe/test_openapi_agent/blob/main/account_api_spec.json

I have tested the above with the included agent. If you want to recreate the tools you need to uncomment the main section in ‘agent.py’ (at the end of the file). The file can be found here:

https://github.com/amachwe/test_openapi_agent/blob/main/agent.py

The agent can be tested using ‘adk web’ just outside the folder that contains the ‘agent.py’. Note: I have not implemented the server but you can use the trace feature in adk web to confirm that the correct tool calls are made or vibe code your way to the server using the test spec.

The following files represent the dump of the tools associated with each path + REST verb combination that have been dynamically created for our agent by OpenAPIToolset:

Delete Account: https://github.com/amachwe/test_openapi_agent/blob/main/accounts_account_id_delete.json

Get Details of an Account: https://github.com/amachwe/test_openapi_agent/blob/main/accounts_account_id_get.json

Create or Update Account details:

https://github.com/amachwe/test_openapi_agent/blob/main/accounts_account_id_post.json

Get All Accounts: https://github.com/amachwe/test_openapi_agent/blob/main/accounts_get.json

The last one is an interesting one as it introduces the use of ‘items’ property in the schema where we create a list property called ‘accounts’ that represents the list of retrieved accounts. This in turn contains definition of each ‘item’ in the list which represents the schema for an account.

Building A Multi-Agent System: Part 2

Part One of this post can be found here. TLDR is here.

Upgrading the Multi-Agent System

In this part I remove the need for a centralised router and instead package each agent as an individual unit (both a server as well as a client). This is a move from single server multiple agents to single server single agent. Figure 1 shows an example of this.

We use code from Part 1 as libraries to create generic framework that allows us to easily buil agent as servers that support a standard interface (Google’s A2A). The libraries allow us to standardise the loading and running of agents written in LangGraph or ADK.

With this move we need an Active Agent Registry to ensure we register and track every instance of an agent. I implement a policy that blocks the activation of any agent if its registration fails – no orphan agents. The Agent Card provides the skills supported by the registered agent which can be used by other agents to discover its capabilities.

This skills-based approach is critical for dynamic planning and orchestration that gives us the maximum flexibility (while we give up on control and live with a more complex underlay).

Figure 1: Framework to create individual agents using existing libraries.

With this change we no longer have a single ‘server’ address that we can default to. Instead our new Harness Agent Test client needs to either dynamically lookup the address of the user provided agent name or have an address provided for the agent we want to interact with.

Figure 2: Registration, discovery, and Interaction flow.

Figure 2 above shows the process of discovery and use. The Select and Connect stage can be either:

Manual – where the user looks up the agent name and corresponding URL and passes it in to the Agent Test client.
Automatic – where the user provides the agent name and the URL is looked up at runtime.

Distributed Multi-Agent System

The separation of agents into individual servers allows us to connect them to each other without tight coupling. Each agent can be deployed in its own container. The server creation also ensures high cohesion within the agent.

The Contact Agent tool allows the LLM inside the agent to evaluate the users request, decide skills required, map to the relevant agent name and use that to direct the request. The tool looks up the URL based on the name, initiates a GRPC-based A2A request and returns the answer to the calling agent. Agents that don’t have a Contact Agent tool will not be able to request help from other agents. This can be used as a mechanism to control interaction between agents.

In Figure 3 the user starts the interaction (via the Agent Test Client) with Agent 1. Agent 1 as part of the interaction requires skills provided by Agent 2. It uses the Contact Agent tool to initiate an A2A request to Agent 2.

All the agents deployed have their own A2A Endpoint to receive requests. This can make the whole setup a peer-to-peer one if we provide a model that can both respond to human input as well as requests from other agents and not restrict the Contact Agent tool to specific agents. This means the interaction can start anywhere and follow different paths through the system.

Figure 3: Interaction and agent-to-agent using contact agent tool.

This flexibility of multiple paths is shown in Figure 4. The interaction can start from Agent 1 or Agent 3. If we provide the Contact Agent Tool to Agent 2 then this becomes a true peer-to-peer system. This is where the flexibility comes to the fore as does the relative unpredictability of the interaction.

Figure 4: Multiple paths through the agents.

Architecture

The code can be found here.

Figure 5: The different agents and what roles they play.

Figure 5 shows the test system in all its glory. All the agents shown are registered with the registry and therefore are independently addressable by external entities including the Test Client.

The main difference between the agents that have access to the Contact Agent tool and ones that don’t is the ability contact other agents. Dice Roller agent for example does not have the ability to contact other agents. I can still connect directly with it and ask it to roll a dice for me but if I ask it to move an enemy it won’t be able to help (see Figure 6).

On the other hand if I connect with the main agent (local or non-local variant) it will be aware of Dice Roller and Enemy Mover (a bit of Dungeons and Dragons theme here).

Figure 6: Every agent is independently addressable – they just have different levels of awareness about other agents in the system.

There is an interesting consequence of pre-populating the Agent list. This means the Agents that are request generators (the ones with the Contact Agent tool in Figure 6) need to be instantiated last. Otherwise the list will not be complete. If two agents need to contact each other then current implementation will fail as the agent that is registered first will not be aware of any agents that are registered afterwards. Therefore, we cannot trust peer-to-peer multi-agent systems. We will need dynamic Agent list creation (perhaps before every LLM interaction) but this can slow the request process.

Setup

Each agent is now an executable in its own right. We setup the agent with the appropriate runner function and pass it to a standard server creation method that brings in GRPC support alongside hooks into the registration process.

These agents can be found in the agent_as_app folder.

The generic command to run an agent (when in the multi_agent folder):

python -m agent_as_app.<agent py file name>

As an example:

python -m agent_as_app.dice_roller

Once we deploy and run the agent it will attempt to contact the registration app and register itself. The registration server which must be run separately can be found in agent_registry folder (app.py being the executable).

You will need a Redis database instance running as I use it for the session memory.

Contact Agent Tool

The contact agent tool gives agents the capability of accessing any registered agent on demand. When the agent starts it gets a list of registered agents and their skills (a limitation that can be overcome by making agents aware of registrations and removals) and stores this as a directory of active agents and skills along with the URL to access the agent.

This is then converted into a standard discovery instruction for that agent. As long as the agent instance is available the system will work. This can be improved by dynamic lookups and event-based registration mechanism.

The Contact Agent tool uses this private ‘copy’ to look up the agent name (provided by the LLM at time of tool invocation), find the URL and send an A2A message to that agent.

Enablers

The server_build file in lib has the helper methods and classes. The important ones are:

AgentMain that represents the basic agent building block for the GRPC interface.

AgentPackage that represents the active agent including where to access it.

register_thyself method (more D&D theme) is the hook that makes registration a background process as part of the run_server convenience method (in the same file).

Examples

The interaction above uses the main_agent_local (see Figure 5) instead of main_agent as the interaction point. The yellow lines represent two interactions between the user and the main_agent_local via the Harness Agent Test client.

The green line represents the information main_agent_local gathered by interacting with the dice_roller. See screenshot below from the Dice Roller log which proves the value 2 was generated by the dice_roller agent.

The red lines represents interactions between the main_agent_local and the enemy_mover. See the corresponding log from enemy_mover below.

Key Insight: Notice how between the first and second user input the main_agent_local managed to decide what skills it needed and what agents could provide that. Both sessions show the flexibility we get when using minimal coupling and skills-based integration (as opposed to hard-coded integration in a workflow).

Results

I have the following lessons to share:

Decoupling and using skills-based integration appears to work but to standardise it across a big org will be the real challenge including arriving at org-wide descriptions and boundaries.
Latency is definitely high but I have also not done any tuning. LLMs still remain the slowest component and it will be interesting to see what happens when we add a security overlay (e.g., Agentic ID&A that controls which agent can talk with which other agent).
A2A is being used in a lightweight manner. Questions still remain on the performance aspect if we use it in anger for more complex tasks.
The complexity of application management provides scope for standard underlay to be created. In this space H1 2026 will bring lot more maturity to the tool offerings. Google and Microsoft have already showcased some of these capabilities.
Building the agent is easy and models are quite capable. Do not fall for the deceptive ease of single agents. Gen AI apps are still better unless you want a sprawl of task specific agents that don’t talk to each other.

Models and Memory

Another advantage of this decoupling is that we can have different agents use different models and have completely isolated resource profiles.

In the example above:

main_agent_local – uses Gemini 2.5 Flash

dice_roller – uses locally deployed Phi-4

enemy_mover – uses locally deployed Phi-4

Memory is also a common building block. It is indexed by user id and session id (both randomly generated by the Harness Agent Test client).

Next Steps

Now that I have the basics of a multi-agent system the next step will be to smoothen out the edges a bit and then implement better traceability for the A2A communications.
Try out complex scenarios with API-based tools that make real changes to data.
Explore guardrails and how they can be made to work in such a scenario.
Explore peer-to-peer setup.

Cost of Agentic

When we use Gen AI within a workflow with predictive interactions with LLMs we can attempt to estimate the cost and time penalties.

Now agentic (true agentic – not workflows being described as agentic) is characterised by:

Minimum coupling between agents but maximum cohesion within an agent.
Autonomy
Use of tools to change the environment

This brings a whole new dimension to the cost and time penalty problem making it lot harder to generate usable values.

Why Are Cost and Time Penalties Important

The simple answer is the two inform the sweet spot between experience, safety and cost.

Experience

Better experience involves using tricks that require additional LLM turns to reflect, plan and use tools, moving closer to agentic. While this allows the LLM to handle complex tasks with missing information it does increase the cost and time.

From a contact centre perspective think of this as:

An expert agent (human with experience) spending time with the customer to resolve their query.

Now if we did this for every case then we may end up giving a quick and ‘optimal’ resolution to every customer but the costs will be quite high.

Safety

Safer experience involves deeper and continuous checks on the information flowing into and out of the system. When we are dealing with agents then we get a hierarchical system to keep safe at different scales.

Semantic safety checks / guardrails involve LLM-as-a-Judge as we expect LLMs to ‘understand’. This increases the time to handle requests as well as increases costs due to additional LLM calls to the Judge LLM.

Cost Model

This is the first iteration of the cost model.

Figure 1: Three level of interaction corresponding to numbers in the list below.

The cost model works at three levels (see Figure 1 above):

Single Model Turn – Input-LLM-Output
- This is a single turn with the LLM.
Single Agent Turn – Input-Agent-Output
- This is a single turn with the Agent – the output could be a tool call or a response to the user or Agent request.
- Input can be from the user or from another Agent (in a Multi-agent system) or a response from the Tool.
Single Multi-agent System Turn – Input-Multi-Agent-System-Output
- This is a single turn with a Multi-agent system. Input can be the user utterance or external stimulus and the output can be the response to the request.
- A Multi-agent System can use deterministic orchestration, a semi-deterministic or fully probabilistic.

Single Model Turn

Item 1 in Figure 1. The base for input will be the prompt template which will be constant for a given input. We can have different input prompt templates for different sources (e.g., for tool response, user input etc.).

We assume a single prompt template in this iteration.

k_I = Cost per token for input tokens (Agent LLM)
k_O = Cost per token for output tokens (Agent LLM)

The constant cost per turn into the model:  C_c= k_I * t_P

The variable cost per turn into the model:  C_v = k_I * t_D

Total Cost per turn out of the model: C_o = k_O * t_O

Total Cost per Model Turn = C_c + C_v + C_o

This can be enhanced to compensate for Guardrails that use LLM-as-a-Judge. I also introduce a multiplier to compensate for parallel guardrails per turn that use LLM-as-a-Judge.

g_I = Cost per token for input tokens (Judge Model)
g_O = Cost per token for output tokens (Judge Model)
m_G = Multiplier - for multiple parallel guardrails that use LLMs as Judge

The variable cost per turn into the judge model:  G_v = g_I * (t_D + t_P) * m_G

Total Cost per turn out of the judge model: G_o = g_O * t_O * m_G

Total Cost per Model Turn (including Governance) = C_c + C_v + C_o + G_v + G_o

Single Agent Turn

Here a single agent turn will consist of multiple model turns (e.g., to understand user request, to plan actions, to execute tool calls, to reflect on input etc.).

In our simple model we take the Total Cost per turn and multiply it with the number of turns to get an estimate of the cost per Agent turn.

Total Cost for Single Turn = n_A * Total Cost per Model Turn

n_A = Number of turns within a single Agent turn

But many Agents continuously build context as they internally process a request. Therefore we can refine this model by increasing the variable input token count by the same amount every turn.

Incremental Cost of Variable Input = C_v += k_I * t_D * i
where i = 1 to n_A + 1

Single Turn with Multi-agent System

A single user input or event from an external system into a multi-agent system will trigger an orchestrated dance of agents where each agent plays a role in the overall system.

Given different patterns of orchestration are available from fully deterministic to fully probabilistic this particular factor is difficult to compensate for.

In a linear workflow with 4 agents we could use an agent interaction multiplier set to 5 (user + 4 agents). With a peer-to-peer a given request could bounce around multiple agents in an unpredictable manner.

The simplest model is a linear one:

Total Cost of Single Multi-agent System Turn = total Cost of Single Agent Turn * n_I

n_I = Number of Agent Interactions in a single Multi-agent System Turn.

Statistical Analysis

Now all of the models above require specific numbers to establish costs. For example:

How many tokens are going into a LLM?
How many turns does an Agent take even if we use mostly deterministic workflows?
How many Agent interactions are triggered with one user input?

Clearly these are not constant values. They will follow a distribution in time and space.

Key Concept: We cannot work with single values to estimate costs. We cannot even work with max and min values. We need to work with distributions.

We can use different distributions to mimic the different token counts, agent interactions, and agent turns.

The base prompt template token count for example can be from a choice of values.

The input data token count and output token count can be sampled from a normal distribution.

For the number of turns within an Agent we expect lower number of turns (say 1-3) to be of high probability with longer sequences being rarer. We will have at least one agent turn even if that results in the agent rejecting the request.

Same goes for the number of Agent interactions. This also depends on number of agents in the Multi-agent system and the orchestration pattern used. Therefore, this is the place where we can get high variability.

Key Concept: Agentic AI is touted as large number of agents running around doing things. This takes the high variability above and expands it like a balloon as more agents we have in a Multi-agent system wider is the range of agent interactions triggered by a request.

Some Results

The distributions below represent the result of the cost model working through each stage repeated 1 million times. Think of this as the same Multi-agent system going through 1 million different user chats.

I have used Gemini-2.5-Flash costs for the main LLM and Gemini-2.5-Flash-Lite costs for LLM-as-a-Judge.

The blue distributions from Left-to-Right: Base Prompt Template Token Count, Data Input Token Count, and Output Token Count. These are the first stage of the Cost Model.

The green distributions from Left-to-Right: Number of turns per Agent and Number of Agent Interactions. These are the second and third stages of the Cost Model.

The orange distribution is the cost per turn.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0174756
Median Cost per Multi-Turn Agent Interaction: $0.0133794
Max Cost per Multi-Turn Agent Interaction: $0.2704608
Min Cost per Multi-Turn Agent Interaction: $0.0004264
Total Cost for 1000000 Multi-Turn Agent Interactions: $17475.6268173

Distributions when we use specific choice for Turns and Agent Interactions.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0119819
Median Cost per Multi-Turn Agent Interaction: $0.0096888
Max Cost per Multi-Turn Agent Interaction: $0.0501490
Min Cost per Multi-Turn Agent Interaction: $0.0007306
Total Cost for 1000000 Multi-Turn Agent Interactions: $11981.8778405

Let us see what happens when we change the distribution for Agent turns to 4 (a small change from 2 previously).

Distribution with larger number of Agent interactions.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0291073
Median Cost per Multi-Turn Agent Interaction: $0.0232008
Max Cost per Multi-Turn Agent Interaction: $0.3814272
Min Cost per Multi-Turn Agent Interaction: $0.0004921
Total Cost for 1000000 Multi-Turn Agent Interactions: $29107.3320206

The costs almost double from increasing agent interaction from 2 to 4.

Key Concept: The expected cost distribution (orange) will not be a one time thing. The development team will have to continuously fine-tune based on Agent and Multi-agent System design approach, specific prompts, and tools use.

Caveats

There are several caveats to this analysis.

This is not the final cost – this is just the cost arising from the use of LLMs in agents and multi-agent applications where we have loops and workflows.
This cost will contribute to the larger total cost of ownership which will include operational setup to monitor these agents, compute, storage, and not to mention cost of maintaining and upgrading the agents as models change, new intents are required and failure modes are discovered.
The model output is quite sensitive to what is put in. That is why we get a distribution as part of the estimate. Individual numbers will never give us the true story.

Code

Code for the Agent Cost Model can be found below… feel free to play with it and fine tune it.

https://github.com/amachwe/agent_cost_model/tree/main

Building A Multi-agent System: Part 1

This post attempts to bring together different low level components from the current agentic AI ecosystem to explore how things work under the hood in a multi-agent system. The components include:

A2A Python gRPC library (https://github.com/a2aproject/a2a-python/tree/main/src/a2a/grpc)
LangGraph Framework
ADK Framework
Redis for Memory
GenAI types from Google

Part 2 of the post is here.

As ever I want to show how these things can be used to build something real. This post is the first part where we treat the UI (a command-line one) as an agent that interacts with a bunch of other agents using A2A. Code at the end of the post. TLDR here.

We do not need to know anything about how the ‘serving’ agent is developed (e.g., if it uses LangGraph or ADK). It is about building a standard interface and a whole bit of data conversion and mapping to enable the bi-directional communication.

The selection step requires an Agent Registry. This means the Client in the image above which represent the Human in the ‘agentic landscape’ needs to be aware of the agents available to communicate with at the other end and their associated skills.

In this first part the human controls the agent selection via the UI.

There is a further step which I shall implement in the second part of this post where LLM-based agents discover and talk to other LLM-based agents without direct human intervention.

Key Insights

A2A is a heavy protocol – that is why it is restricted to the edge of the system boundary.
Production architectures depend on which framework is selected and that brings its own complexities which services like GCP Agent Engine aim to solve for.
- Data injection works differently between LangGraph and ADK as these frameworks work at different levels of abstraction
- LangGraph allows you full control on how you build the handling logic (e.g., is it even an agent) and what is the input and output schema for the same. There are pre=made graph constructs available (e.g., ReACT agent) in case you did not want to start from scratch.
- ADK uses agents as the top level abstraction and everything happens through a templatised prompt. There is a lower level API available to build out workflows and custom agents.
Attempting to develop using the agentic application paradigm requires a lot of thinking and lot of hard work – if you are building a customer facing app you will not be able to ignore the details.
- Tooling and platforms like AI Foundry, Agent Engine, and Copilot Studio are attempting to reduce the barriers to entry but that doesn’t help with customer facing applications where the control and customisation is required.
The missing elephant in the room – there are no controls or responsible AI checks. That is a whole layer of complexity missing. Maybe I will cover it in another part.

Setup

There are two agents deployed in the Agent Runner Server. One uses a simple LangGraph graph (‘lg_greeter’) with a Microsoft Phi-4 mini instruct running locally. The other agent uses ADK agent (‘adk_greeter’) using Gemini Flash 2.5. The API between the Client and the Agent Runner Server is A2A (supporting the Message structure).

Currently, the agent registry in the Agent Runner Server is simply a dict keyed against the string label which holds the agent artefact and appropriate agent runner.

It is relatively easy to add new agents using the agent registry data structure.

Memory is implemented at the Agent Runner Server and takes into account the user input and the agent response. It is persisted in Redis and is shared by all agents. This shared memory is a step towards agents with individual memories.

There is no remote agent to agent communication happening as yet.

Output

The command line tool first asks the user which agent they want to interact with. The labels presented are string labels so that we can identify which framework was used and run tests. The selected label is passed as metadata to the Agent Runner Server.

The labels just need to correspond to real agent labels loaded on the Agent Runner Server where they are used to route the request to the appropriate agent running function. It also ensures that correct agent artefact is loaded.

The code is in test/harness_client_test.py

In the test above you can see how when we ask a question to the lg_greeter agent it then remembers what was asked. Since the memory is handled at the Agent Runner level and is keyed by the user id and the session id it is retained across agent interactions. Therefore, the other agent (adk_greeter) has access to the same set of memories.

Adding Tools

I next added a stock info tool to the ADK agent (because LangGraph agent running on Phi-4 is less performant). The image below shows the output where I ask ADK (Gemini) for info on IBM and it uses the tool to fetch it from yfinance. This is shown by the yellow arrow towards the top.

Then I asked Phi-4 about my previous interaction which answered correctly (shown by the yellow arrow towards the bottom.

Adding More Agents

Let us now add a new agent using LangGraph that responds to the user queries but is a Dungeons&Dragons fan therefore rolls 1d6 as well and gives us the result! I call this agent dice (code in dice_roller.py)

You can see now we have three agents to choose from. The yellow arrows indicate agent choice. Once again we can see how we address the original question to dice agent and subsequent one to the lg_greeter and then the last two to the adk_greeter.

A thing to note is the performance of Gemini Flash 2.5 on the memory recall questions.

Possible extensions to this basic demo:

Granular memory
Dynamic Agent invocation
GUI as an Agent

Code

https://github.com/amachwe/multi_agent

You will need a Gemini API key and access to my genai_web_server and locally deployed Phi-4. Otherwise you will need to change the lg_greeter.py to use your preferred model.

Check out the commands.txt for how to run the server and the test client.

Building An Agentic Application

In my previous post I walked through the major components of an Agentic Application beyond the Agent itself.

I had shared four important parts of such an app:

Agent Runner and API to allow Agents to be accessed externally.
Agent Execution and Session State Management that allows Agents to remember context and share state with each other (so called Short Term Memory).
Long Term Memory that allows Agents to learn facts between sessions.
Logging Telemetry to allow Agents to be monitored.

In this post I will cover points 1, 2 and 4. Point 3 I feel is needed only in specific use-cases and current tooling for Long Term Memory is evolving rapidly.

Interaction between the Agentic Application, Agent Runtime and API Server.

The diagram above shows the major components of an Agentic Application. The API Server is responsible for providing an endpoint (e.g., https REST, message queue) that external applications (e.g., chat client) can use to access the Agent App 1.

The API Server then invokes the appropriate handler method when the API is invoked. The handler method is responsible for triggering the Agent Runner in the Agent Runtime that deals with the execution of the Agent App 1.

The Agent Runtime is the component that:

1. Sets up the correct session context (e.g., current session history if the conversation thread is resuming) using the Session Manager

2. Manages the Agent Runner which executes the Agent App 1 by triggering its Root Agent.

Remember as per ADK one Agentic Application can only have one Root Agent per deployment.

The Agent Runner is then responsible for finishing the agentic app execution (including handling any errors). Agent Runtime then cleans up after the Agent Runner and returns any response generated by the Agent App 1 back to the handler where it can be returned to the caller using API constructs.

Key Concept: If the Agentic App does not have a sequential workflow and instead depends on LLMs or contains loops then the app keeps going till it emits a result (or an error).

This makes it difficult to set meaningful time-outs for request-response style and we should look at async APIs (e.g., message based) instead.

API Server

REST API Handler function example using Flask.

The code above shows the API handler function using Flask server.

Lines 77 – 80 are all about extracting data from the incoming request to deal with User, Session management and the Query integration (e.g., incoming text from the user for a chat app). Here we assume the requesting application manages the User Id (e.g., a chat app that handles the user authentication and authorisation) and Session Id.

Lines 82-84 are all about setting up the session store if no existing session context is found. This will usually trigger when the user say first engages with the agent at the start of a new conversation. It is indexed by User Id and Session Id.

Key Concept: The session boundary from an agents perspective is something that needs to be decided based on the use-case and experience desired.

Line 88 is where the Agent Runtime is triggered in an async manner with the Application Name, User Id, Session Id, and the Query. The Application Name is important in case we have multiple Agentic Applications being hosted by the same Agent Runner. We would then have to change the session store to also be indexed by the App Name.

Line 90 extracts the final response of the Agent from the session state and is executed once the Agent Runtime has finished executing the Agentic Application (Line 88) which as per our Key Concept earlier is when the final result or an error is produced by the Root Agent.

Beyond Line 90 the method simply extracts the results and returns them.

This code is to show the API Server interacting with the Agent Runtime and must not be used in production. In production use async API style that decouples API Server from the Agent Runner.

Running Agents and Session State Management

ADK defines a session as a conversation thread. From an Agentic App perspective we have three things to think about when it comes to sessions:

Application Name
User ID
Session ID

These three items when put together uniquely identify a particular application handling requests from a given user within a specific conversation thread (session).

Typically, session management requires managing state and lots of record keeping. Nothing very interesting therefore ADK provides a few different types of Session Managers:

InMemorySessionManager – the most basic Session Manager that is only suitable for demos and learning more about session management.
DatabaseSessionManager – persisted version of the Session Manager.
VertexAISessionManager – the pro version which utilizes the VertexAI platform to manage the Session State. Best thing to use with Agent Engine for production workloads.

In this post I use the InMemorySessionManager to show how session management works and how we execute an agentic application.

Setting up the Agent Runner and session in ADK.

The main method (Line 62 onwards in the above – invoked on Line 88 in previous listing) represents the Agent Runtime (e.g., Agent Engine in GCP) triggering the agents it is hosting (Agent App 1). It is taking the App Name, Session Id, User Id, and the incoming Query as described previously.

The Agent Runner is setup on Line 64.

On Line 66 the Agent Runtime initialises the current session (an instance of InMemorySessionManager) in ADK and provides the starting session state from the session state store. This will either be a freshly initialised session (blank state) or an existing session as per the logic shown previously.

Finally, on Line 69 we use the ‘call_agent’ method to configure and execute the Agent Runner. As you can see we are passing the ‘root_agent’, current session, and other details like session Id and query to this method.

This is the fun bit now…

Lines 31 and 32 are all about extracting what we received from the external application (in this case what the user typed in the chat box) and preparing a ‘default’ response (in case of issues with agent execution).

Lines 35-42 is where the core execution happens for Agent App 1. Since the app is running in async mode it will go through a set of steps where the root agent is triggered and it in turn triggers sub-agents and tools as needed. The async for goes through the responses till the root agent provides the final response signalling the end of the execution of the app. The final response is extracted and stored for eventual return back to the API Server.

Lines 46-51 simply extract the final session state and log it. Nothing interesting there unless you are after an audit trail.

Lines 55-58 is where we build up the session which allows the agents to remember the context and previous inputs/outputs in the conversation session. We extract the current state from the state store, add to it the user’s request and the agent’s response (think of it like adding a request – response pair). Finally the state store is updated (using the ‘history’ key) so when user responds to the agent’s current output the session history is available to guide the agent on what to do next.

The session history is also called Short Term Memory. When you use VertexAISessionManager with Agent Engine or the ‘adk web’ testing utility you get all of this for free. But now you know how it works!

Logging and Monitoring

Line 35 is where we enter the mysterious async-probabilistic realm of Agent Execution and we need logging and monitoring to help us comprehend the flow of the agent execution as the control passes between Agents and from Agents to tools.

Utilities like ‘adk web’ show the flow of control within the Agentic Application through a connected graph. But how does this work? What mechanisms are available for developers to get telemetry information? By default Agent Runtimes like Google’s Agent Engine provide built-in capability to generate telemetry using OpenTelemetry standard that can then be consumed by likes of Cloud Trace or AgentOps.

In this section we look at the internals of the root agent and see how we collect information as it executes. I also show my own (vibe-coded no less) version of the ‘adk web’ flow visualisation.

The root agent is defined in Line 97 as per the standard constructor for LLMAgent till Line 103.

We see the usual parameters for the name, model, description, instruction (prompt), tools (using Agent as Tool paradigm) and the output key for the root agent.

Then come the callbacks (Lines 103-108) that allow us to track the flow of the Agent application. There are six types of callbacks and between them they tap the strategic points in the Agent-Agent and Agent-Tool flows.

The six callbacks supported by ADK and their touchpoints in the Agentic Application.

Before and After Tool: wraps a tool call. This allows us to tap into all the tool calls the agent makes and any responses returned by the tool. This is also the place to execute critical guardrails around tool calling and responses.

Before and After Model: wraps the call to the Large Language Model. This allows us to tap into all the prompts going into the LLM and any responses returned by the Model. This is also the place to execute critical guardrails around input prompts and responses – especially to ensure LLMs are not called with unsafe prompts and any harmful responses blocked.

Before and After Agent: wraps the call to the agent which allows us to tap into all the inputs going into the agent (including user inputs and agent requests) and any outputs.

These callbacks are defined at the level of the agent therefore, it can be used to track the flow through the Agent App 1 going from one agent to another.

The above shows callbacks registered for the Sub-Agent (named info_gather_agent).

Callbacks

The above examples show two callback examples out of the 6 possible for the root agent. The standard pattern I have used is:

Log the input provided using python logging for audit purposes.
Record the complete trace of the short term memory as a memory record (in MongoDb) for us to visualise the flow.

One example flow of Human to Root Agent to Sub-Agent to Tool and so on.

In the flow diagram above we see:

Time flowing from top of the image to the bottom.
Human input as the blue dots.
Purple dots are the Sub-Agent (called info_gather_agent).
Green dots are the Root Agent’s response.
Yellow dots are tool calls.

Given we use Agent as tool for communication we see Yellow -> Purple -> Yellow signifying the Root Agent invoking the Sub-Agent as a tool.

Yellow -> Green -> Yellow is the Sub-Agent responding to the Root Agent and the Root Agent processing that input.

Green -> Blue -> Yellow is the Root Agent responding to the Human and the Human responding with a follow up question.

This visualisation was completely vibe-coded based on the document structure of the memory record in MongoDb.

Key Concept: Note the short return trace after ‘were we talking about aapl’ the last question from the Human. The Root Agent does not need to engage with any other agent or tool. It can simply examine the history we have been collecting to answer the question.

Chat App output can be seen below (entry in blue is the human asking questions and grey is the Root Agent responding):

The Chat web-app was also vibe coded in 10 mins just to have something visual to show instead of Postman requests.

It also shows how the web-app could collect User Id and Session Id defaulting to the ‘application-led’ ID and Authorisation model.

We can play around with the session Ids to give parallel chat experience (same user with two open sessions) – see above.

session_1759699185719 – talking about IBM stock (yellow boxes)

session_1759699185720 – talking about AAPL stock (red box)

Code

By now I hope you are ready to start playing around under the hood to improve your understanding of the tech. Remember concepts never die – they just become abstract implementations.

The complete code can be found here: https://github.com/amachwe/agent/tree/main/agents

main.py has the Agent Runtime.

agent.py in the root is the root agent.

the folder sub_agents contains the info_gather_agent.

memory.py is the vibe-coded memory visualiser (you will need MongoDb running so that short term memory trace can be stored).

https://github.com/amachwe/agent/blob/main/chat_client.html – contains the vibe-coded chat webapp.

Take Care of Your Agents!

How does one identify a ‘real’ AI Agent developer? They don’t use ‘adk web’ to demo their agent.

Silly jokes aside, in this post I want to talk about building an app with AI Agents. The focus is on GCP stack. Google’s Agent Engine – ADK combination looks like a really good option for pro-code development that abstracts out many common tasks.

I do not like magical abstraction as it blocks my understand of ‘how things work’. Therefore, I will use Google’s ADK to build our agents and instead of hosting it on Agent Engine I will write the application that provides a snug and safe environment for the agents.

This will be a text heavy post where I introduce key aspects. I will follow this up with a code heavy set of posts looking at Sessions, Memory, and Telemetry.

Let us first look at what Agent Engine helps us out with as an Agent developer:

Session State: the internal state that the agent uses to carry out the task – includes knowledge and facts gathered
Long Term Memory: the long term memory created by compressing and updating based on the internal state of the agent
Logging/Telemetry: the data stream coming from the agent that helps us monitor, troubleshoot and audit.
Agent Runner and API: the infrastructure that runs the agents and creates the API for our agent to be published for use.

All of the things in the list above are done for us when we use ‘adk web’ or similar ‘agent testing’ tool. This container for the agents is the ‘Agent Application’.

Agent Engine and adk are two examples of ‘Agent Applications’ that we can use to deploy Agents written in ADK. Agent Engine also supports LangGraph and other frameworks.

Writing an Agent Application

Writing an Agent Application requires a deep dive into concepts that are critical to agent operations: session, memory, and telemetry. Fortunately, ADK gives us all the core capabilities to build these out. For example, when using adk web our agents remember what the user said before, this is because the adk web Agent Application is taking care of session management for us.

One also needs to understand the lifecycle of an interaction so let us look at this first.

Interaction Lifecycle

In any agentic interaction there is usually a user or external system (or agent), there is also an agentic application they interact with, and the interaction is usually bound to an intent called a session (e.g., I want to open a new account). The interaction continues till the intent is satisfied or rejected.

The interaction lifecycle gives us scope for any state and memory we use.

User: Scope is the user. This typically will be information associated with the user like their username, preferences etc. This will span application and session scope. Identified by ‘user_id’.

Application: Scope is the application. This typically will be information required to operate the app. This allows the same agent to have different application state when used in different applications. For example, credentials for agents, application level guardrails (e.g., do not give advice) and global facts. Identified by ‘app_name’.

Session: Scope is the current session. This will contain the short-term (or scratchpad) state associated with the ongoing interaction for the agent to use while the session is active. This includes the users inputs like the reason to contact and any responses by the agent. The client in this case will need to ensure state management by reusing the same user, session and app ID when it wishes to maintain the session continuity (e.g., in case of a disconnect).

Now let us look at the Session, Memory and Telemetry.

Session

Session is all about the local state of the Agent. ADK has the Session object that provides the data structure to store session info and a SessionService object that manages the session data structure.

The SessionService is responsible for the entire session state management and Google do not want us to manually write to the session object. The only two special cases being when the interaction starts (to preload state) and when it finishes (to save state). Even for this we are expected to use the appropriate lifecycle callbacks. More on callbacks later.

SessionService comes in three flavours:

InMemorySessionService – for testing locally (outside Agent Engine).
VertexAISessionService – for production deployment using Agent Engine (abstracts away).
DatabaseSessionService – developer managed generic persistent session service for production environment that can use any SQL db.

The Session object is identified by the combination of session_id, user_id, and app_name.

Session object has the following properties:

History – an automatically updated list of significant events (managed by ADK).
Session State – the temporary state of the agent to be managed by the application for use by the agent as a scratchpad.
Activity Tracking – this is a timestamp of the last event for tracking (managed by ADK).

A key question for me is how does one manage shared user sessions (e.g., in case of a joint bank account). Some creative coding and user management is needed.

Memory

This represents the session state processed and stored as a ‘memory’ for use in later interactions. The retrieval happens based on the user_id and app_name to ensure compartmentalisation.

The MemoryService manages the creation, management, and retrieval of memories. As a developer we do not have any control in the memory creation and management process. If we want full control then there are examples of accessing closed session state or checkpoint running session state to create/manage memories manually.

It comes in two flavours:

InMemoryMemoryService – nicely named service for when we are testing.
VertexAIMemoryBank – to be used with Agent Engine which abstracts the process and provides persistence.

Telemetry

Telemetry includes all information that flows in and out of an agent. These streams then have to be tied up together as information flows through a system with multiple interacting agents to identify issues and troubleshoot when things go wrong.

Without telemetry we will not be able to operate these systems in production. For regulated industries like banking and healthcare additional audit requirements are there. Therefore, focus on what to collect. This post will talk about how to collect.

When using Agent Engine a variety of tools are available that can collect the OpenTelemetry-based feeds produced by it. Cloud Trace is the GCP native service for this.

Collecting with Callbacks

ADK supports 6 basic callbacks. This enables us to access state and input/output information as the agent operates.

Before/After LLM: Callbacks triggered just before and after the LLM is engaged. Use this to audit LLM request and response as well as interception for guardrails (checking input for relevance, harm etc. and checking output for quality and content).

Before/After Tool: Callbacks triggered just before and after a tool is engaged. Use this to audit tool use (inputs and outputs) as well as for debugging tool requests generated by the LLM.

Before/After Agent: Callbacks triggered just before and after the agent engages. This is used to audit inputs going into the agent and the response produced by the agent. This is specifically the input and output to the user.

LLM and Tool Callbacks tell us about the inner workings of the agent (private thoughts). Agent callbacks tell us about the external workings of the agent (public thoughts).

Callbacks are the Google recommended mechanism for changing state and they also provide a set of patterns (sharing links here):

Remember callbacks must be implemented in a manner that:

Does not add latency – use async mechanism
Does not introduce parallel logic for processing data
Does not become an alternative integration point

Logging

ADK also supports python logger sub-system. If DEBUG level is used then every aspect of a running agent is logged plus more. But this increases the data processing load as well as tends to slow down the processing.

High data volumes will require significant processing (including correlation of different streams, root cause analysis, and time-series analysis) before it can be converted into an event stream prioritised by criticality to enable Human-on-the-loop (human as a supervisor). I do not expect human-in-the-loop (human as an approver) to be a viable method for controlling and monitoring multi-agent systems.

Amusement Park

Imagine you live in a small town in the hinterland.

Imagine the authorities in power ask you to contribute some money towards building a big amusement park in the capital city.

Now imagine, year after year, you hear impressive things about the amusement park—how new rides are being added, and how more and more people visit every year.

Pretty soon, you learn that entrance tickets are now difficult to get and getting more expensive. There’s a waiting list and entry criteria. But the amusement park continues to attract even more visitors each year.

Then one day, you say to yourself: “I need to go and see what the fuss is all about.” You work out the entry criteria and fill in the proper forms. You pay the steep entry fees. Then you patiently wait for your turn, all the while thinking about the fun you and your family will have.

One day, your turn comes. As you enter the amusement park, clutching your ticket, you look around and find people attempting to enjoy the rides.

What you also realise is that the park is really crowded. Certain popular rides have long queues. The food is expensive. The hotel is expensive. There is even a queue at the toilets. Then you wonder—why would people wait and struggle to come in the first place?

Then you see new visitors arriving through the gates. A little bit of hate starts to develop in your heart as you realise that the wait and the queues will only increase. You want the park authorities to stop the entry of people for a while. Then, to your horror, you spy people attempting to climb over the walls. You feel cheated given the effort you made to enter the park through the proper channels.

There is now a general lack of warmth, as people inside the park are increasingly losing their patience and empathy. An accidental bump turns into a heated argument.

You look at your children to see if they are still having fun. They seem to be—but you can see what their future will be like. The struggle they will have.

You’ve had enough. You make your way towards the exit. Only to discover that not only is there a long queue to exit, but there is also an exit fee. You check your wallet to see if you have enough money to leave.

What happens next…?

How to Start your Agentic AI Journey?

There are many posts providing insights about Agentic AI, the protocols (MCP, A2A, etc.), the frameworks (Langchain, Google ADK, etc.), and not to mention how amazing AI Agents are because they have ‘memory’ and ‘actions’.

What no one talks about is the ‘how’. How does one build and operate agents and multi-agent systems?

This lack of the ‘how’ is what leads to expectation mismatch and the selection of agentic solutions where a simpler solution would have delivered value quicker, with less effort and cost.

In this post I want to talk about the ‘how’.

The First Step

The first step is to get your approach correct. Multi-agent systems are less like building IT Apps or Gen AI Apps and more like building a team of people. You have to continuously iterate between the individual and the team because a change in one will impact the other.

Start defining the agent-enabled journey in terms of:

expected outcomes of the journey
tasks associated with different stages of the journey
constraints (hard and soft) associated with the tasks
tools required for the tasks
communication pathways between tasks
ontology/knowledge/data to carry out the task
handoffs and human-in-the-loop mechanisms for each task
information exchange mechanisms with external entities (e.g., other systems and humans)

Don’t start by worrying about Agentic frameworks, MCP, A2A, etc. These will help you build correctly, not build the correct thing.

The Next Step

Go to the next level of detail:

How do the tasks, constraints, tools and knowledge group into agents? This is not about writing code. Coding complexity of agents is low. For AI agents complexity is in writing the prompts and testing.
Describe how will we test individual agents in isolation then how do we start bringing them together. Can agents deal with failures around them (in other agents)? Can they deal with internal failures and degrade gracefully?
How will we monitor the agents? What patterns are we going to use (e.g., Watchdog pattern) to enable monitoring without degrading agent performance? What actions can be taken to deal with issues identified. Can the agent be isolated rapid to prevent scaling up of the issue? Think of this like writing a diary at the end of the day where you describe and rationalise what you did. Relevant to the agent, interesting perhaps for other agents in the group.
How will we test Agent ensembles?Validate parts of the multi-agent system by describing inputs to each agent, outputs provided, failure scenarios, and upstream/downstream agents. How do groups of agents deal with issues? Can they recover or at least prevent failure from spreading beyond their boundaries? Can they prevent external events (to the ensemble) from disrupting the ensemble?
How do we monitor agent ensembles? How do we combine streams from related agents to give a view of what these agents are up to. Remember with agents grouped together we will need to stitch a narrative from the monitoring feeds. Think of it like folktales relevant to few agents within a group but interesting for other related groups to know.
Bring Agent ensembles together and start to test the whole system. Validate inputs to the system and expected outputs, failure scenarios, and external entities the system interacts with. The exact same layering as with the ensemble with the same set of questions but answered at a higher level.
How do we monitor the whole system. Remember the whole system includes the operators, users, agents, IT systems, and the knowledge required and generated. So monitoring needs to feed the system-wide narrative of what is going on. Think of this like a history of a civilisation. All about what agents/users did to get us to this point. Relevant for everyone.

Hopefully by this time you are convinced of this layered step-by-step approach. How individual interactions give rise to interactions between groups and so on. The same scaling works for other aspects like testing, monitoring, recovery, and operations.

Finally, hope you are excited about the journey that awaits you as you enter the world of Agents!

LangChain Custom Model with Tools

I wanted to demystify tool use within LLMs. With Model Context Protocol integrating tools with LLMs is now about writing the correct config entry.

This ease of integration with growing capability of LLMs has further obfuscated what is in essence a simple but laborious task.

Preparing for Calling Tools

LLMs can only generate text. They cannot execute function calls or make requests to APIs (e.g., to get the weather) directly. It might appear that they can but they cannot. Therefore, some preparation is required before tools can be integrated with LLMs.

Describe the schema for the tool – input and output including data types.
Describe the action of the tool and when to use it including any caveats.
Prompt text to be used when returning the tool response.
Create the tool using a suitable framework (e.g., Langchain @tools decorator).
Register the tool with the framework.

Steps 1-4 are all about the build phase where as step 5 is all about the integration of the tool with the LLM.

Under the Hood

But I am not satisfied by just using some magic methods. To learn more I decided to implement a custom model class for Langchain (given its popularity and relative maturity). This custom model class will integrate my Gen AI Web Server into Langchain. The Gen AI Web Server allows me to host any model and wrap it with a standard REST API.

Code: https://github.com/amachwe/gen_ai_web_server/blob/main/langchain_custom_model.py

The BaseChatModel is the Langchain (LC) abstract class that you need to extend to make your own model fit within a LC workflow. I selected the BaseChatModel because it requires minimal additions.

Creating my custom class: LangChainCustomModel by extending BaseChatModel gives me the container for all the other bits. The container is made up of Fields and Functions, let us understand these in detail. The reason I use a container paradigm is because the BaseChatModel uses pydantic Fields construct that adds a layer of validation over the python strong-but-dynamic typing.

Fields:

These are variables that store state information for the harness.

model_name – this represents the name of the model being used – in case we have different models we want to offer through our Langchain (LC) class. Google for example offer different variants of their Gemini class of models (Flash, Pro etc.) and we can provide a specific model name to instantiate the required model.

client – this is a my own state variable that holds the specific client I am using to connect to my Gen AI Web Server. Currently this is just a static variable where I am using a locally hosted Phi-4 model. I could make this a map and use model_name to dynamically select the client.

tools – this array stores the tools we register with the Langchain framework using the ‘bind_tools’ function.

tool_prompt – this stores the tool prompt which describes the tools to the LLM. This is part of the magic that we don’t see. This is what allows the LLM to understand how to select the tool and how to structure the text output so that the tool can be invoked correctly upon parsing the output once the LLM is done. The tool prompt has to make it clear for the LLM when and how to invoke specific tools.

Functions

This is where the real magic happens. Abstract functions (starting with ‘_’) once overloaded correctly act as the hooks into the LC framework and allow our custom model to work.

_generate – the biggest magic method – this is what does the orchestration of the end to end request:

Assemble the prompt for the LLM which includes user inputs, guardrails, system instructions, and tool prompts.
Invoke the LLM and collect its response.
Parse response for tool invocation and parameters or for response to the user.
Invoke selected tool with parameters and get response.
Package response in a prompt and return it back to the LLM.
Rinse and repeat till we get a response for the user.

_llm_type – return the type of the LLM. For example, the Google LC Class returns ‘google_gemini’ as that is the general ‘type’ of the models provided by Google for generative AI. This is a dummy function for us because I am not planning to distribute my custom model class.

bind_tools – this is the other famous method we get to override and implement. For a pure chat model (i.e., where tool use is not supported) this is not required. The LC base class (BaseChatModel) has a dummy implementation that throws ‘NotImplemented’ exception in case you try to call it to bind tools. The main role of this method is to populate the tools state variable with tools provided by the user. This can be as simple as an array assignment.

Testing

This is where the tyres meet the road!

I created three tools to test how well our LangChainCustomModel can handle tools from within the LC framework. The three tools are:

Average of two numbers.
Moving a 2D point diagonally by the same distance (e.g., [1,2] moved by 5 becomes [6,7]).
Reversing a string.

My Gen AI Web Server was hosting the Phi-4 model (“microsoft/Phi-4-mini-instruct”).

Prompt 1: “What is the average of 10 and 20?”

INFO:langchain_custom_model:Tool Name: average, Args: {'x': 10, 'y': 20}

INFO:langchain_custom_model:Tool Response: 15.0

INFO:langchain_custom_model:Tool Response: Yes, you can respond to the original 
message. The average of 10 and 20 is 15.0.

The first trace shows the LLM producing text that indicates the ‘average’ tool should be used with parameters x = 10 and y = 20. The second trace shows the response (15.0). The final trace shows the response back to the user which contains the result from the tool.

Prompt 2: “Move point (10, 20) by 5 units.”

INFO:langchain_custom_model:Tool Name: move, Args: {'x': 10, 'y': 20, 'distance': 5}

INFO:langchain_custom_model:Tool Response: (15.0, 25.0)

INFO:langchain_custom_model:Tool Response: Yes, you can respond to the original message. If you move the point (10, 20) by 5 units diagonally, you would move it 5 units in both the x and y directions. This means you would add 5 to both the x-coordinate and the y-coordinate.

Original point: (10, 20)
New point: (10 + 5, 20 + 5) = (15, 25)

So, the new point after moving (10, 20) by 5 units diagonally is (15, 25). Your response of (15.0, 25.0) is correct, but you can also simply write (15, 25) since the coordinates are integers.

The first trace shows the same LLM producing text to invoke the ‘move’ tool with the associated parameters (this time 3 parameters). The second trace shows the tool response (15.0, 25.0). The final trace shows the response which is a bit long winded and strangely has both the tool call result as well as LLM calculated result. The LLM has almost acted as a judge of the correctness of the tool response.

Prompt 3: “Reverse the string: I love programming.”

INFO:langchain_custom_model:Tool Name: reverse, Args: {'x': 'I love programming.'}

INFO:langchain_custom_model:Tool Response: .gnimmargorp evol I

INFO:langchain_custom_model:Tool Response: Yes, you can respond to the original message. Here is the reversed string:.gnimmargorp evol I

As before, the first trace this time contains the invocation of the ‘reverse’ tool with appropriate argument of the string to be reversed. The second trace shows the tool response. The final trace shows the LLM response to the user where the tool output is used.

My next task is to attempt to implement tool chaining which will be a combination of improving the internal prompts as well as experimenting with different LLMs.

While this is a basic implementation it shows you the main integration points and how there is no real magic when your LLMs invoke tools to do lots of amazing things.

The most intricate part of this custom model is the tool call catcher. Assuming the LLM has done its job the tool call catcher has the difficult job of extracting the tool name and parameters from the LLMs response, invoking the selected tool, return any response from the tool, and deal with any errors.

The code: https://github.com/amachwe/gen_ai_web_server/blob/main/langchain_custom_model.py

Agentic AI to Support Conversations

Now that we have understood the different layers in a conversation let us think about what it means for Agentic AI as an enabler.

Key Concept: If you can create steps using deterministic technologies (e.g., workflow engines like PEGA) mapped to actions then you do not need the complexity of agents. This means you have a finite domain of jobs that need to be done. For example: a pizza chain has a few well defined ‘jobs to be done’ and therefore does not need the complexity of an agentic system. Probably this is why the NLP chatbot examples from 2016 were all about ordering pizza and they worked really well!

Agents will start to become a real option if you find that as you step down the first three layers (Utterance, Intent, Action) there is a growing level of complexity that requires some level of decomposition and cannot be applied as a ‘one-size fits all’ process automation.

Decomposition

This decomposition is where the Agent pattern is amazing. For example at the Action level you can have a product agnostic Agent whose goal is to collect the information from a customer and onboard them to the org systems and another Agent could be tasked with fraud detection.

Parallelism

Agents can go about their work happily in parallel, working off the same input. They have well defined methods of interacting with each other and request support from other Agents as needed (e.g., an onboarding Agent for vulnerable customers).

Composition

Agents can also work in a hierarchy where Agents get increasingly specialised (e.g., such as those that implement specific Steps within an Action) to ensure we do not end up with monolith high level agents and can compose Step specific agents across different journeys. An agent that checks for specific types of fraud, or one that is good at collecting and organising semi-structured information (e.g., customer’s expenses) can be used across a number of different journeys as long as it can be assigned a job by the agents dealing with the conversation.

Handoff

We can clearly see two specific types of handoffs:

Conversational Handoff – here an agent hands over the interaction to another agent. For example: a particular customer was identified as a vulnerable customer by the primary contact agent and then transferred over to an agent that has been specially created for the task. The speciality of the agent can stem from custom prompts and governance, custom fine-tuned LLM, or a combination of the two. There may also be specific process changes in that scenario or an early escalation to a human agent.

The receiving agent has the option of not accepting the handoff therefore the sending agent must be prepared to deal with this scenario.

Once the receiving agent accepts the handoff the sending agent has no further role to play.

Task Handoff – in this case we can compose a particular bunch of functionality through task decomposition and handoffs. For example at the Step level we maybe have each Step implemented by a different Agent.

Taking the example from the previous post:

Collect basic information to register a new customer.
[Do Fraud checks]
{Create customer’s record.}
{Create customer’s application within the customer’s account.}
Collect personal information.
Collect expense information.
Collect employment information.
Seek permission for credit check
[Do credit check or stop application.]

The driver agent is carrying out the steps in italics. Then it is decomposing the tasks at the next level of detail between the fraud step (square brackets) and the creating the customer records step (curly brackets). These could be given to two different agents.

In this case the driver agent will decide how to divide the tasks and which agent to hand over which sub-task to. The driver will also be responsible for handling any errors, unexpected responses and the final response received from any of the support agents.

The support agents can refuse to accept a sub-task (or not) depending on the specific scenario).

An Important Agent Design Decision

Now we come to a critical decision for designing our agents. The trade-off between sending task to the data vs fetching data for the task vs a centralising tendency for both. Let us dig a bit deeper.

Sending the Task to the Data:

This is where the Orchestrating Agent drives the process by sending the task to a Serving Agent that is closer to the data. The Serving Agent processes the data as per the task requirement and returns only the results to the Orchestrating Agent. This is required in many situations such as:

Data is sensitive and cannot be accessed directly.
Data processing requires extensive context and knowledge.
Data processing is time consuming.
Associated data has specific usage conditions attached to it.
Results need to be ‘post-processed’ before being returned – e.g., checking for PII.

This is what happens when we seek expert advice or professional help. For example if we want to apply for a mortgage we provide the task (e.g., amount to be borrowed) to an expert (mortgage advisor) who then looks at all the data and provides suitable options (results) for us to evaluate.

We can see this type of Agentic interaction in the future where in a ‘Compare the Market’ scenario our Apple/Google/OpenAI agent becomes the Orchestrating Agent for a large number of Serving Agents operated by different lenders/providers.

Currently Googles A2A protocol attempts to provide this kind of ‘task transfer’ across organisational boundaries. This task transfer requires many layers of security, tracking, negotiations, and authorisation. Given the current state of A2A there are still gaps.

Security and Authorisation: the security posture and authorisation needs to be mixed. The agent operating on the data (Serving Agent) may require access to additional data that the Orchestrating Agent does not have access to. For example, interest rates and discounts. Further, the Orchestrating Agent may need to authorise the Serving Agent to access data owned by the requester. For example, requesters credit history.

Tracking and Negotiations: the tracking of tasks and negotiations before and during the task execution is critical. For example, when going through complex transactions like a mortgage application there is constant tracking and negotiations between the requester and the mortgage advisor.

Fetching the Data for the Task:

Now let us reverse the above example. We fetch the data required for an Agent to complete its task. This will be done through Tools using Framework-based tooling or MCP (for inter-organisational tool use).

There are many scenarios where this pattern is required. The common theme being the task is not easily transferable due to extensive knowledge requirements, cost, context, or regulations. For example, a personal finance advisor works in this way. The advisor does not forward the task to another agent as it is a regulatory requirement for the person dealing with the application to have specific training and certifications.

Here the key task is what data is required vs good to have, how is the data to be gathered, the time intervals for data gathering, and the relative sensitivity of the data being gathered (and therefore the risk holding that data brings). There is also an ethical dilemma in this where what information should be disregarded or not gathered.

I will bring out the ethical dilemma as the other issues are well understood. Imagine you are talking with an AI Insurance Agent, looking to buy travel insurance to your upcoming trip to a city famous for its water-sports. Let us say you mention by accident ‘I love deep sea diving’. Now the Agent asks you if you plan on participating in any water-sports and you reply ‘No, I am just going to relax there’. The ethical dilemma is whether the AI should take your response on face-value and forget about your love for deep sea diving or should it ignore. The choice will impact the perceived risk and therefore the premium. It may collect more data to improve its assessment and also provide a clear disclaimer to the requester that they will not be covered for any water-sports related claims.

There are various mechanism available to solve all of the above problems except the ethical dilemma. That is why we need the next style.

Centralising Data and Task:

In this case we send the data and the task (independently or as part of a deterministic process) to a third agent to process and respond.

This style is particularly important when we want one way of doing something which is applicable across a wider variety of tasks and data. Think of a judge in a court – they get cases pertaining to different laws. The same judge will process them.

The classic example for this is ‘LLM-as-a-Judge’ where we provide the task and data (including LLM response) to a different LLM to evaluate the response on some pre-defined criteria. These are usually implemented using a deterministic orchestration flow.

In our water-sports insurance journey we would have sent the final conversation and data (about the customer and the eligible products) to a validator LLM to ensure best possible customer outcome including sending communications to correct any mis-selling.

This can be risky in its own right – especially if the task and different parts of the data are coming from different sources. Even one slight issue can lead to sub-optimal outcomes.