Cost of Agentic

When we use Gen AI within a workflow with predictive interactions with LLMs we can attempt to estimate the cost and time penalties.

Now agentic (true agentic – not workflows being described as agentic) is characterised by:

  • Minimum coupling between agents but maximum cohesion within an agent.
  • Autonomy
  • Use of tools to change the environment

This brings a whole new dimension to the cost and time penalty problem making it lot harder to generate usable values.

Why Are Cost and Time Penalties Important

The simple answer is the two inform the sweet spot between experience, safety and cost.

Experience

Better experience involves using tricks that require additional LLM turns to reflect, plan and use tools, moving closer to agentic. While this allows the LLM to handle complex tasks with missing information it does increase the cost and time.

From a contact centre perspective think of this as:

  • An expert agent (human with experience) spending time with the customer to resolve their query.

Now if we did this for every case then we may end up giving a quick and ‘optimal’ resolution to every customer but the costs will be quite high.

Safety

Safer experience involves deeper and continuous checks on the information flowing into and out of the system. When we are dealing with agents then we get a hierarchical system to keep safe at different scales.

Semantic safety checks / guardrails involve LLM-as-a-Judge as we expect LLMs to ‘understand’. This increases the time to handle requests as well as increases costs due to additional LLM calls to the Judge LLM.

Cost Model

This is the first iteration of the cost model.

Figure 1: Three level of interaction corresponding to numbers in the list below.

The cost model works at three levels (see Figure 1 above):

  1. Single Model Turn – Input-LLM-Output
    • This is a single turn with the LLM.
  2. Single Agent Turn – Input-Agent-Output
    • This is a single turn with the Agent – the output could be a tool call or a response to the user or Agent request.
    • Input can be from the user or from another Agent (in a Multi-agent system) or a response from the Tool.
  3. Single Multi-agent System Turn – Input-Multi-Agent-System-Output
    • This is a single turn with a Multi-agent system. Input can be the user utterance or external stimulus and the output can be the response to the request.
    • A Multi-agent System can use deterministic orchestration, a semi-deterministic or fully probabilistic.

Single Model Turn

Item 1 in Figure 1. The base for input will be the prompt template which will be constant for a given input. We can have different input prompt templates for different sources (e.g., for tool response, user input etc.).

We assume a single prompt template in this iteration.

kI = Cost per token for input tokens (Agent LLM)
kO = Cost per token for output tokens (Agent LLM)
The constant cost per turn into the model:  Cc = kI * tP 
The variable cost per turn into the model:  Cv = kI * tD
Total Cost per turn out of the model: Co = kO * tO
Total Cost per Model Turn = Cc + Cv + Co

This can be enhanced to compensate for Guardrails that use LLM-as-a-Judge. I also introduce a multiplier to compensate for parallel guardrails per turn that use LLM-as-a-Judge.

gI = Cost per token for input tokens (Judge Model)
gO = Cost per token for output tokens (Judge Model)
mG = Multiplier - for multiple parallel guardrails that use LLMs as Judge
The variable cost per turn into the judge model:  Gv = gI * (tD + tP) * mG
Total Cost per turn out of the judge model: Go = gO * tO * mG
Total Cost per Model Turn (including Governance) = Cc + Cv + Co + Gv + Go

Single Agent Turn

Here a single agent turn will consist of multiple model turns (e.g., to understand user request, to plan actions, to execute tool calls, to reflect on input etc.).

In our simple model we take the Total Cost per turn and multiply it with the number of turns to get an estimate of the cost per Agent turn.

Total Cost for Single Turn = nA * Total Cost per Model Turn

nA = Number of turns within a single Agent turn

But many Agents continuously build context as they internally process a request. Therefore we can refine this model by increasing the variable input token count by the same amount every turn.

Incremental Cost of Variable Input = Cv += kI * tD * i
where i = 1 to nA + 1 

Single Turn with Multi-agent System

A single user input or event from an external system into a multi-agent system will trigger an orchestrated dance of agents where each agent plays a role in the overall system.

Given different patterns of orchestration are available from fully deterministic to fully probabilistic this particular factor is difficult to compensate for.

In a linear workflow with 4 agents we could use an agent interaction multiplier set to 5 (user + 4 agents). With a peer-to-peer a given request could bounce around multiple agents in an unpredictable manner.

The simplest model is a linear one:

Total Cost of Single Multi-agent System Turn = total Cost of Single Agent Turn * nI

nI = Number of Agent Interactions in a single Multi-agent System Turn.

Statistical Analysis

Now all of the models above require specific numbers to establish costs. For example:

  • How many tokens are going into a LLM?
  • How many turns does an Agent take even if we use mostly deterministic workflows?
  • How many Agent interactions are triggered with one user input?

Clearly these are not constant values. They will follow a distribution in time and space.

Key Concept: We cannot work with single values to estimate costs. We cannot even work with max and min values. We need to work with distributions.

We can use different distributions to mimic the different token counts, agent interactions, and agent turns.

The base prompt template token count for example can be from a choice of values.

The input data token count and output token count can be sampled from a normal distribution.

For the number of turns within an Agent we expect lower number of turns (say 1-3) to be of high probability with longer sequences being rarer. We will have at least one agent turn even if that results in the agent rejecting the request.

Same goes for the number of Agent interactions. This also depends on number of agents in the Multi-agent system and the orchestration pattern used. Therefore, this is the place where we can get high variability.

Key Concept: Agentic AI is touted as large number of agents running around doing things. This takes the high variability above and expands it like a balloon as more agents we have in a Multi-agent system wider is the range of agent interactions triggered by a request.

Some Results

The distributions below represent the result of the cost model working through each stage repeated 1 million times. Think of this as the same Multi-agent system going through 1 million different user chats.

I have used Gemini-2.5-Flash costs for the main LLM and Gemini-2.5-Flash-Lite costs for LLM-as-a-Judge.

The blue distributions from Left-to-Right: Base Prompt Template Token Count, Data Input Token Count, and Output Token Count. These are the first stage of the Cost Model.

The green distributions from Left-to-Right: Number of turns per Agent and Number of Agent Interactions. These are the second and third stages of the Cost Model.

The orange distribution is the cost per turn.

Distributions of the Cost Model layers.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0174756
Median Cost per Multi-Turn Agent Interaction: $0.0133794
Max Cost per Multi-Turn Agent Interaction: $0.2704608
Min Cost per Multi-Turn Agent Interaction: $0.0004264
Total Cost for 1000000 Multi-Turn Agent Interactions: $17475.6268173
Distributions when we use specific choice for Turns and Agent Interactions.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0119819
Median Cost per Multi-Turn Agent Interaction: $0.0096888
Max Cost per Multi-Turn Agent Interaction: $0.0501490
Min Cost per Multi-Turn Agent Interaction: $0.0007306
Total Cost for 1000000 Multi-Turn Agent Interactions: $11981.8778405

Let us see what happens when we change the distribution for Agent turns to 4 (a small change from 2 previously).

Distribution with larger number of Agent interactions.

Expected Costs:

Average Cost per Multi-Turn Agent Interaction: $0.0291073
Median Cost per Multi-Turn Agent Interaction: $0.0232008
Max Cost per Multi-Turn Agent Interaction: $0.3814272
Min Cost per Multi-Turn Agent Interaction: $0.0004921
Total Cost for 1000000 Multi-Turn Agent Interactions: $29107.3320206

The costs almost double from increasing agent interaction from 2 to 4.

Key Concept: The expected cost distribution (orange) will not be a one time thing. The development team will have to continuously fine-tune based on Agent and Multi-agent System design approach, specific prompts, and tools use.

Caveats

There are several caveats to this analysis.

  1. This is not the final cost – this is just the cost arising from the use of LLMs in agents and multi-agent applications where we have loops and workflows.
  2. This cost will contribute to the larger total cost of ownership which will include operational setup to monitor these agents, compute, storage, and not to mention cost of maintaining and upgrading the agents as models change, new intents are required and failure modes are discovered.
  3. The model output is quite sensitive to what is put in. That is why we get a distribution as part of the estimate. Individual numbers will never give us the true story.

Code

Code for the Agent Cost Model can be found below… feel free to play with it and fine tune it.

https://github.com/amachwe/agent_cost_model/tree/main

Building A Multi-agent System: Part 1

Overall Architecture.

This post attempts to bring together different low level components from the current agentic AI ecosystem to explore how things work under the hood in a multi-agent system. The components include:

  1. A2A Python gRPC library (https://github.com/a2aproject/a2a-python/tree/main/src/a2a/grpc)
  2. LangGraph Framework
  3. ADK Framework
  4. Redis for Memory
  5. GenAI types from Google

Part 2 of the post is here.

As ever I want to show how these things can be used to build something real. This post is the first part where we treat the UI (a command-line one) as an agent that interacts with a bunch of other agents using A2A. Code at the end of the post. TLDR here.

We do not need to know anything about how the ‘serving’ agent is developed (e.g., if it uses LangGraph or ADK). It is about building a standard interface and a whole bit of data conversion and mapping to enable the bi-directional communication.

The selection step requires an Agent Registry. This means the Client in the image above which represent the Human in the ‘agentic landscape’ needs to be aware of the agents available to communicate with at the other end and their associated skills.

In this first part the human controls the agent selection via the UI.

There is a further step which I shall implement in the second part of this post where LLM-based agents discover and talk to other LLM-based agents without direct human intervention.

Key Insights

  1. A2A is a heavy protocol – that is why it is restricted to the edge of the system boundary.
  2. Production architectures depend on which framework is selected and that brings its own complexities which services like GCP Agent Engine aim to solve for.
    • Data injection works differently between LangGraph and ADK as these frameworks work at different levels of abstraction
    • LangGraph allows you full control on how you build the handling logic (e.g., is it even an agent) and what is the input and output schema for the same. There are pre=made graph constructs available (e.g., ReACT agent) in case you did not want to start from scratch.
    • ADK uses agents as the top level abstraction and everything happens through a templatised prompt. There is a lower level API available to build out workflows and custom agents.
  3. Attempting to develop using the agentic application paradigm requires a lot of thinking and lot of hard work – if you are building a customer facing app you will not be able to ignore the details.
    • Tooling and platforms like AI Foundry, Agent Engine, and Copilot Studio are attempting to reduce the barriers to entry but that doesn’t help with customer facing applications where the control and customisation is required.
  4. The missing elephant in the room – there are no controls or responsible AI checks. That is a whole layer of complexity missing. Maybe I will cover it in another part.

Setup

There are two agents deployed in the Agent Runner Server. One uses a simple LangGraph graph (‘lg_greeter’) with a Microsoft Phi-4 mini instruct running locally. The other agent uses ADK agent (‘adk_greeter’) using Gemini Flash 2.5. The API between the Client and the Agent Runner Server is A2A (supporting the Message structure).

Currently, the agent registry in the Agent Runner Server is simply a dict keyed against the string label which holds the agent artefact and appropriate agent runner.

It is relatively easy to add new agents using the agent registry data structure.

Memory is implemented at the Agent Runner Server and takes into account the user input and the agent response. It is persisted in Redis and is shared by all agents. This shared memory is a step towards agents with individual memories.

There is no remote agent to agent communication happening as yet.

Output

The command line tool first asks the user which agent they want to interact with. The labels presented are string labels so that we can identify which framework was used and run tests. The selected label is passed as metadata to the Agent Runner Server.

The labels just need to correspond to real agent labels loaded on the Agent Runner Server where they are used to route the request to the appropriate agent running function. It also ensures that correct agent artefact is loaded.

The code is in test/harness_client_test.py

In the test above you can see how when we ask a question to the lg_greeter agent it then remembers what was asked. Since the memory is handled at the Agent Runner level and is keyed by the user id and the session id it is retained across agent interactions. Therefore, the other agent (adk_greeter) has access to the same set of memories.

Adding Tools

I next added a stock info tool to the ADK agent (because LangGraph agent running on Phi-4 is less performant). The image below shows the output where I ask ADK (Gemini) for info on IBM and it uses the tool to fetch it from yfinance. This is shown by the yellow arrow towards the top.

Then I asked Phi-4 about my previous interaction which answered correctly (shown by the yellow arrow towards the bottom.

Adding More Agents

Let us now add a new agent using LangGraph that responds to the user queries but is a Dungeons&Dragons fan therefore rolls 1d6 as well and gives us the result! I call this agent dice (code in dice_roller.py)

You can see now we have three agents to choose from. The yellow arrows indicate agent choice. Once again we can see how we address the original question to dice agent and subsequent one to the lg_greeter and then the last two to the adk_greeter.

A thing to note is the performance of Gemini Flash 2.5 on the memory recall questions.

Possible extensions to this basic demo:

  1. Granular memory
  2. Dynamic Agent invocation
  3. GUI as an Agent

Code

https://github.com/amachwe/multi_agent

You will need a Gemini API key and access to my genai_web_server and locally deployed Phi-4. Otherwise you will need to change the lg_greeter.py to use your preferred model.

Check out the commands.txt for how to run the server and the test client.