Conversational Patterns

This post attempts to map Conversational Patterns to Agentic AI. This is because a project must not start with: ‘I want to use agentic to solve this problem’.

Instead it must say: ‘I need agentic to support the style of conversational experience that best solves this problem’.

Layers of Conversation

Every conversation over a given channel occurs between an interested/impacted party and the organisation (e.g., customer over webchat, colleague over the phone). Every conversation has an outcome: positive/negative customer party outcome or positive/negative organisational outcome.

Ideal outcome being positive for both but not always possible.

There are several layers in this conversation and each layer allows us to map to different tasks for automation.

  1. Utterance – this is basically whatever comes out of the customer or colleagues (referred to as the ‘user’) mouth – process this to extract Intents and Constraints.
  2. Intent and constraints – this is the intent and constraints processed to align them with organisational intents and constraints and thereafter extracting a set of actions to achieve them.
  3. Actions – this is each action decomposed into a set of ordered steps.
  4. Steps – this is each step being converted into a sequence of interactions (or requests) to various back-end systems.
  5. Request – this is each request being made and the response processed including errors and exceptions.

Utterance

The key task for an utterance is the decomposition into intents (what does the user want to achieve?) and constraints (what are the constraints posed by the user?). The organisational intent and constraints are expected to be understood. I am assuming here that a level of customer context (i.e., who is the customer) is available through various ‘Customer 360’ data products.

Example: ‘I want to buy a house’ -> this will require a conversation to identify intent and constraints. A follow up question to the user may be ‘Amazing, can I confirm you are looking to discuss a mortgage today?’.

Key Task: Conversation to identify and decompose intents and constraints.

AI Capability: Highly capable conversational LLM (usually a state of the art model that can deal with the uncertainty).

The result of this decomposition would be an intent (‘buying a house with a mortgage’) with constraints (amount to be borrowed, loan-to-value, term of loan etc.). Once are a ready we can move to the next step.

Intents and Constraints

Once the intent and constraints have been identified they need to be aligned with the organisational intents and constraints using any customer context that is available to us. This is critical because this is where we want to trap requests that are either not relevant or can’t be actioned (e.g., give me a 0% life-long mortgage – what a dream!). Another constraint can be if the customer is new – which means we have no data context.

If these are aligned with the organisation then we decompose these into a set of actions. These actions be at a level of abstraction and not mapped to specific service workflows. This step helps validate the decomposition of intents and constraints against the specific product(s) and associated journeys.

Example: Buying a house on a mortgage – specific actions could include:

  • Collect information from the customer to qualify them.
  • Do fraud, credit and other checks.
  • Provide agreement in principle.
  • Confirm terms and conditions.
  • Process acceptance.
  • Initiate the mortgage.

Key Task: Mapping intents to products and associated journeys using knowledge of the product.

AI Capability: The model being able to map various pieces of information to specific product related journeys. This will usually also require state of the art LLMs but can be supported by specific ‘guides’ or Small Language Models (SLMs). This can especially be useful if there are multiple products with similar function but very subtle acceptance criteria (e.g., products available to customers who have subscribed to some other product).

Actions

This is where the fun starts as we start to worry about the ‘how’ part. As we now have the journeys associated with the interaction the system can start to decompose these into a set of steps. There will be a level of optimisation and orchestration involved (this can be machine led or pre-defined) and the complexity of the IT estate starts to become a factor.

Example: Collect information from the customer and Checks.

Now the system can decide whether we collect and check or check as we collect. Here the customer context will be very important as we may or may not have access to all the information beforehand (e.g., new customer). So depending on the customer context we will decompose the Collect information action into few or many steps. These steps can be interleaved with the steps we get by decomposing the ‘Checks’ action.

By the end of this we will come up with a set of steps (captured in one or more workflows) that will help us achieve the intent without breaking customer or org constraints:

Assuming a new customer wants to apply for a mortgage…

  1. Collect basic information to register a new customer.
  2. [Do Fraud checks]
  3. Create customer’s record.
  4. Create customer’s application within the customer’s account.
  5. Collect personal information.
  6. Collect expense information.
  7. Collect employment information.
  8. Seek permission for credit check
  9. [Do credit check or stop application.]
  10. Collect information about the proposed purchase.
  11. Collect information about the loan parameters.
  12. Qualify customer.

Key Tasks: The key task here is to ‘understand’ the actions, the dependencies between them and then to decompose them into a set of steps and orchestrate them into the most optimal workflow. Optimal can mean many things depending on the specific journey. For example, a high financial value journey like a mortgage for a new customer might be optimised for risk reduction and security even if the process takes a longer time to complete but for an existing mortgage customer it may be optimised for speed.

AI Capability: Here we can do with SLMs as a set of experts and a LLM as the primary orchestrator. We want to ensure that each Action -> Step decomposition is accurate as well as the merging into a set of optimised workflows is also done correctly.

Steps and Requests

Once we get a set of steps we need to decompose these into specific requests. The two steps are quite deeply connected as here the knowledge of how Steps can be achieved is critical and this is also dependent on the complexity of the IT estate.

Example: Collect basic information to register a new customer.

Given the above step we will have a mix of conversational outputs as well as function calls at the request level. If our IT estate is fragmented then whilst we collect the information once (minimal conversational interaction with the customer) our function calls will look very complex. In many organisations customer information is stored centrally but it requires ‘shadows’ to be created in several different systems (e.g., to generate physical artefacts like credit cards, passcode letters etc.). So your decomposition to requests would look like:

  1. Conversation: Collect name, date of birth, … from the customer.
  2. Function calling (reflection): check if customer information makes sense and flag if you detect any issues
  3. Function calling: Format data into JSON object with the given structure and call the ‘add_new_customer’ function (or tool).

Now the third step ‘Format data into JSON… ‘ could be made up of multiple critical and optional requests implemented within the ‘add_new_customer’ tool:

  1. Create master record for customer and obtain customer ID. [wait for result or fail upon issues]
  2. Initiate online account and app authorisation for customer using customer ID. [async]
  3. Initiate physical letter, card, pins, etc. using customer information. [async]
  4. Provide customer information to survey platform for a post call ‘onboarding experience survey’ [async]

Key Tasks: The key tasks here are to understand the step decomposition into requests and the specific function calls that make up a given request.

AI Capability: Here specific step -> request decomposition and then function calling capabilities are required. SLMs can be of great help here especially if we find that step to request decomposition is complex and requires dynamic second level orchestration. But pre-defined orchestrated workflows can also work well here.

Next post on how we can use Agentic AI to support Conversations: https://fisheyeview.fisheyefocus.com/2025/06/22/agentic-ai-to-support-conversations/

Model Context Protocol (MCP)

MCP – the name says it all. Model Context Protocol – a protocol used to communicate the context between tools and the model as the user requests come in.

If you already know the details of MCP jump to the ‘How to use MCP’ section.

Main Components of MCP

The Host: This is the big boss that brings together the LLM and various other pieces of MCP – think of it like the plumbing logic.

The MCP Client: This represents the MCP Server within a particular Host and provides the decoupling of the tool from the Host. The Client is Model agnostic as long as the model is provided the correct context.

The MCP Server: Hosts the tools published by a provider in a separate process. It can be written in any language given that JSON-RPC is used to exchange information between the Client and the Server.

Protocol Transport: This determines how the MCP Server communicates with the MCP Client and requires developers to understand how to work with things like HTTP streams or implement a custom transport method.

The MCP Dance

At its simplest when requesting external processing capabilities (i.e., tools or functions) the model needs some context (available tools and what parameters do they take). The tool provider has that context which it needs to share with the model.

Once the user request comes in and the LLM has the tool context, it can then indicate which tool it wants to call. The ‘host’ has the task of ensuring the correct tool is invoked with the appropriate arguments (provided by the LLM). This requires the model to give the correct context, outlining the tool name and the arguments.

Once the tool invocation is done, any response it returns needs to be sent back to the LLM with the appropriate prompt (which can be provided by the server) so that the LLM can process it onwards (either back to the user as a response or a subsequent tool call).

Let us break it down:

1] Context of which tools are available -> given to the Model by the MCP Servers.

2] Context of which tool is to be invoked -> given to the MCP Client that interfaces the selected tool by the Host.

3] Context of what to do with the response -> returned to the Model by the selected MCP Client (with or without a prompt to tell the LLM what to do with the result).

How To Use MCP

Even though MCP starts with an ‘M’ it is not magic. It is just a clever use of pretty standard RPC pattern (as seen in SOAP, CORBA etc.) and a whole bunch of LLM plumbing and praying!

Managing the Build

Given the complexity of the implementation (especially if you are building all the components instead of configuring a host like Claude Desktop) the only way to get benefits from the extra investment is if you share the tools you make.

This means extra effort in coordinating tool creation, hosting, and support. Each tool is a product and has to be supported as such because if all goes well you will be supporting an enterprise-wide (and maybe external) user-base of agent creators.

The thing to debate is whether there should be a common Server creation backlog or we live with reuse within the boundaries of a business unit (BU) and over time get org-level reuse by elevating BU-critical tools to Enterprise-critical tools. I would go with the latter in the interest of speed, and mature over time to the former.

Appropriate Level of Abstraction

This is critical if we want our MCP Server to represent a safe and reusable software component.

Principle: MCP Servers are not drivers of the user-LLM interaction. They are just the means of transmitting the instruction from the LLM to the IT system in a safe and consistent manner. The LLM drives the conversation.

Consider the following tool interface:

search_tool(search_query)

In this case the search_tool provides a simple interface that the LLM is quite capable of invoking. We would expect the search_tool to do the following:

  1. Validating the search_query as this is an API.
  2. Addressing concerns of the API/data source the tool wraps (e.g., search provider’s T&Cs around rate limits).
  3. Any authentication, authorisation, and accounting to be verified based on Agent and User Identity. This may be an optional depending on the specific action.
  4. Wrap the response in a prompt appropriate for situation and the target model.
  5. Errors: where there is a downstream error (with or without a valid error response from the wrapped API/data source) the response to the LLM may be changed by the tool to reflect the same.

Principle: The tool must not drive the interaction through changing the input values or having any kind of business logic in the tool.

If you find yourself adding if-then-else structures in the tool then you should step back and understand whether you need separate tools or a simplification of the source system API.

Principle: The more information you need to call an API the more difficult it will be for the LLM to be consistent.

If you need flags and labels to drive the source system API (e.g., to enable/disable specific features or to provide additional information) then understand if you can implement a more granular API with pre-set flags and labels.

Design the User-LLM-Tool Interaction

We need to design the tools to support the interaction between the User, LLM, and the tool. Beyond specific aspects such as idempotent calls to backend functions, the whole interaction needs to be looked at. And this is just for a single agent. Multi-agents have an additional overhead of communication between agents which I will cover at a later date.

The pattern for interaction will be something like:

  1. LLM selects the tool from the set of tools available based on the alignment between user input and tool capability.
  2. Identify what information is needed to invoke the selected tool.
  3. Process the tool response and deal with any errors (e.g., error with tool selection)

Selection of the tool is the first step

This will depend on the user intent and how well the tools have been described. Having granular tools will limit the confusion.

Tool signatures

Signatures if complex will increase the probability of errors in the invocation. The parameters required will either be sourced from the user input, prompt instructions or a knowledge-base (e.g., RAG).

Beyond the provisioning of data, the formatting is also important. For example passing data using a generic format (e.g., a CSV string) or a custom format (e.g., list of string objects). Here I would prefer the base types (e.g., integer, float, string) or a generic format that the LLM would have seen during its training rather than a composite custom type which would require additional information for the LLM to use it correctly.

Tool Response and Errors

The tool response will need to be embedded in a suitable prompt which contains the context of the request (what did the user request, which tool was selected and invoked with what data, and the response). This can be provided as a ‘conversational memory’ or part of the prompt that triggers the LLM once the tool completes execution.

Handling errors resulting from tool execution is also a critical issue. The error must be categorised into user-related, LLM-related or system-related, the simple concept being: is there any use in retrying after making a change to the request.

User-related errors require validating the parameters to ensure they are correct (e.g., account not found as user provided the incorrect account number).

LLM-related errors require the LLM to validate if the correct tool was used, data extracted from the user input and if the parameters were formatted correctly (e.g., incorrect data format). This can be done as a self-reflection step.

System-related errors require either a tool level retry or a hard stop with the appropriate error surfaced to the user and the conversation degraded gently (e.g., 404 errors). These are the most difficult to handle because there are unlikely to be automated methods for fixing especially in the timescales of that particular interaction. This would usually require a prioritised handoff to another system (e.g., non-AI web-app) or a human agent and impact future requests. Such issues should be detected ahead of time using periodic ‘test’ invocation (outside the conversational interaction) of the tool to ensure correct working.

A2A and True Agent AAA

The Agent2Agent (A2A) protocol from Google is expected to (as the Red Bull ad goes) ‘give agents wings’. But does it align with the other aspect of modern day flying – being highly secure.

To make A2A ‘enterprise ready‘ there is some discussion around Authentication and Authorisation (but not around the third ‘A’ in AAA – Accounting).

Using the Client-Server Pattern

Agents are expected to behave like client/server components whereby the details of the two are not visible. The opacity helps in keeping both ends of the comms from making any assumptions based on implementation details.

The ‘A2AClient’ is the agent that is expected to act on behalf of the user and the ‘A2AServer’ is the main back-end agent that is expected to respond on behalf of the provider and abstract all internal details (including orchestration information). Think of the ‘A2AClient’ like a waiter at a restaurant and all the other kitchen staff, chef, etc. being the ‘A2AServer’. We direct the workings of the kitchen through the waiter (e.g., asking for modifications to a dish, indicating allergies and preferences) without directly becoming aware of the processing going on there.

The transport level security aligns with HTTPS which is an industry standard. This is the plumbing between two A2A endpoints and at this level there is nothing to distinguish an A2A interaction from any other type of interaction (e.g., you browsing the net).

So far so good.

Authentication

Authentication also follows the same theme. HTTP header based authentication described by the AgentCard object. Specifically through the AgentAuthentication object within the AgentCard (see below).

For additional task specific authentication (e.g., special credentials required by an agent for accessing a database) the AuthenticationInfo object (encapsulated in a PushNotification) is to be used.

The authentication mechanism does not support ‘payload’ based identity – this means the authentication mechanism sits outside the multi-agent system and the agent treats it as a ‘given’.

This has some major implications including reliance on centralised id providers and creating a security plumbing nightmare where multiple providers are present. For A2A at some level agents are still seen as traditional software applications.

The concept of decentralised id where the agent controls the identity is still being developed with some interesting projects in this space. This also aligns with the ‘autonomous’ nature of AI Agents where ‘proving’ identity should be part of the agent communication process.

Identity as a Composition

The other aspect which multi-agent systems require are Identity as a Composition. In simple terms, identity is not a single thing when it comes to a system of interacting components (whether human or machine or mixed). The whole interaction lives in a space of identities.

It is the collective that identifies the next two things we are talking about in this process – Authorisation and Accounting.

A common example of this is when we contact a provider (e.g., bank, broadband, utility) to make some changes. The call-centre agent or app always authenticates you first. Your identity is merged with the identity of the agent who is handling your call (or the trust mechanism associated with the app and server) to ensure that you are authorised to request a change and the agent (or app) is authorised to carry out those instructions.

The lack of this thinking manifests itself as the ‘task specific authentication’ mechanism. What you need is a security context (maybe implemented as a container) for the agent to modify and pass-along.

Authorisation

We have already spoken a bit about authorisation without describing the specifics w.r.t. A2A. The thinking here is aligned with various best-practices such as Least Privilege, Granular Control, and Multi-Level authorisation. The documentation also discusses Skill-based Authorisation which involves linking A2AClient authentication with what skills can be invoked on the A2AServer.

There is a big gap in this and again we see software app thinking. With true multi-agentic systems each agent must have an Authorisation Posture. This combined with Identity as a Composition will provide the required autonomy and resilience to such systems.

What is an Authorisation Posture? Before that I just want to clarify what Authorisation means. In the agentic sense it could mean: what am I as an agent authorised to do (attached with my identity), what external requests am I authorised to action (attached with what the requestor(s) are allowed to do) and the context of the current interaction. Some of the latter aspects are hinted at with the ‘Skills-based Authorisation’ discussion.

Authorisation Posture is nothing but a composition of the agents authorisation, the requestors authorisation and the flavouring provided by the current interaction context. The posture may change several times during an interaction and it is a shared entity.

The A2A does not deal with this posture, how it changes and how the agents are able to share it across organisational boundaries to ensure the operational envelope is flexible without depending on localised information (e.g., per tool interaction) or being larger than it needs to be (e.g., blanket authorisation). I don’t believe A2A is actually designed to operate across enterprise boundaries except in tightly controlled scenarios. Which is fine given that AI-based agents with such raw action potential are relatively new. It is leaving people a bit breathless.

Accounting

This is the most interesting aspect for me. Accounting in simple terms means keeping track of how much of something you have consumed or what you have used (mainly used for billing, auditing, and troubleshooting purposes). A2A makes no mention of accounting, assuming all the agents operate within an enterprise boundary or existing ‘data’ exchange mechanism is used transfer accounting information or it is done through ‘API keys’ passed as part of the authentication mechanism. All of the above wrapped by existing logging mechanisms.

Now Accounting requires the ‘who’ (authentication) and the ‘what’ (authorisation) to be clear. From an agents point of view this is also something that needs to be composed.

The lowest level accounting may be associated with the physical resources (e.g., compute) that the agent is using. The highest level may be the amount of time the agent is taking in handling the request and its ‘thoughts’ and ‘actions’ during that process.

So far so good.. but why does accounting need to be composed? Because the other aspect of accounting is ‘how much of other agents time have I used?’. Where we account for the e2e interaction as well as individual agents view of their place in the whole.

If an agent is specialised (e.g., deep research, topic specific, high-cost) then we want to ensure ‘requesting’ agents account for their time. Just like a lawyer would keep careful track of the time they spend for each client and the client will be careful in engaging the lawyer’s services for relevant tasks (e.g., not just calling to have a chat about the weather).

This time accounting can also be dynamic based on available resources, volume of requests and even things like availability of other agents that this current agent is dependent upon. For example surge pricing from Uber accounts for the driver’s time differently (the distance remains the same). If I pay surge price while on a business trip – that cost gets transferred downstream as I claim expenses. There will also be a cost element associated with tool use (e.g., API usage limits).

This type of information will be critical in cases where the agent has multiple options to offload work and the work boundary is not just within our enterprise (therefore we have asymmetric information).

What is needed?

What is needed is a mechanism that allows us to compose and share authentication, authorisation, and accounting information between agents for a truly transparent, secure, and manageable multi-agentic system.

The composition is also very important because there is a level of hierarchy in this information as well. For example the AAA composition information will have its own AAA when inside the org vs when shared with external organisations.

A bank will maintain a list of all its customers for audit reasons but only share specific information with customers or a group of customers (e.g., joint accounts) when required. But in case there is a complaint or an external audit all or some of the information may have to be shared.

This kind of fluidity requires the agents to understand ‘who’ they are, ‘what they can do’/’what others (whether identified or anonymous) can request them to do’, and ‘what actions need to be tracked/communicated to whom’.

The above will also be required if we want to make our organisational agentic ecosystem be part of other groups (partially or as a whole) or making other agents part of ours (permanently or temporarily) in a dynamic manner.

Of course while we treat agents as software apps and focus on the enterprise context (and not true autonomous, independent, and dynamic packets of functionality) these aspects will continue to be ignored.

Agentic AI: The Tree of Risks

Risks in Agentic systems are underappreciated. These risks are layered and have different sources and therefore different owners. Materialised risks can be thought of as the fruit of the Risk tree.

Figure 1: The Risk Tree showing the layers of risk.

The Model

The model determines the first level of risk. It is the root of the tree. Often model providers describe knowledge and behavioural risk parameters for their models.

Knowledge risk includes model not having the right knowledge (e.g., used for advanced data analysis but model is bad at maths). Behavioural risk comes from task misalignment (e.g., using a generic model for specific task).

Risk Owner: Development team selecting the model.

Agent

The way we interact with the model adds the second layer of risk. The key decisions include how the agent reasons and decides, where we use self-reflection, and guardrails. Agents with complex internal architectures lead to difficult to debug behaviours and to detect issues.

Risk Owner: Development team selecting the architecture.

Multi-agents

As we start to connect agents together we add the third layer or risk. Chain effects start to creep in where individual agents may behave as required but the sequence of interactions leads to misalignment. If we have dynamic interactions then potentially we have a situation where not only can we get unpredictable behaviour but also find it impossible to reproduce.

Risk Owner: Development team creating the multi-agent system.

Data

Data is the one big variable and therefore adds the fourth layer of risk. The system may work well with some samples of data but not with others or data drift may cause the system to become unstable. This level of variability can be quite difficult to detect.

Risk Owner: Data owner – you must know if the data you own is suitable for a given project.

Use-case

The final layer of risk. If the use-case is open ended (e.g., customer support agent) the risk is higher because we will find it difficult to create tests for all eventualities.

Given the functionality is defined mostly using text (prompts) and less using code we cannot have an objective ‘test coverage’ associated with the use-case.

We can have all kinds of inputs that we use as tests but there is no way to quantify the level of coverage (except for that narrow class of use-cases where outputs can be easily validated).

Risk Owner: Use-case owner – you must know what you are putting in front of the users and how can you make it easier for the good actors and difficult for the bad actors.

Examples

Let us explore the risk tree with a real example using Google’s Agent Development Kit and Gemini Flash 2.0.

Use-case is a simple Gen AI app – we are building a set of maths tools for the four basic operations. LLM will use these to answer questions. This problems allows us to address the the following layers of the tree: Use-case, Data, and Single Agent.

The twist we add is that the four basic operations are restricted to integer inputs and outputs. Integer restriction is for governance that can be objectively evaluated.

Version 1

Prompt v1: “you are a helpful agent that can perform basic math operations”

Python Tool v1:

def add(x:int, y:int)->int:

    """Add two numbers"""

    print("Adding:", x, y)

    return x + y

Output:

In the above output the blue background is the human user and the grey is the Agent responding. We see the explicit type hint (integer) provided in the tool definition is easily overcome by some simple rephrasing – ‘what could it be?’.

The LLM answers this without using the tool (we can trace tool use) thereby not only disregarding the tool definition but also using its own knowledge to answer (which is a big risk!).

Issues Discovered: LLM not obeying tool guardrails and loosing its (weak) grounding – risk at Data and Use-case levels.

Version 2

Prompt v2: “you are a helpful agent that can perform basic math operations as provided by the functions using only integers. Do not imagine.”

Python tool remains the same.

Output:

We see that with these changes to the prompt it refuses to solve a question that does not have integer parameters (e.g., 14.4 + 3). But if you try a division problem (e.g., 5 / 2) it does return a float response! Once again ignoring the tool definition which clearly states ‘integer’ as a return type. Not only that, with some confrontational prompting we can get it to say all kinds of incorrect things about the tool.

Issues Discovered: Firstly the tool does not return a dictionary as can be clearly seen in the definition. This is probably the Agent Framework causing issues where internal plumbing may be using a dictionary. This is the risk at the Agent Architecture level.

Secondly with confrontational prompting we can break the grounding especially as with Agents and increased looping certain messages can get reinforced without too much effort. Once a ‘thought’ is part of a conversation it can easily get amplified.

Version 3

Prompt remains the same.

Python Tool v2:

def divide(x:int, y:int)->int:      
    """Divide two numbers and return an integer"""
    print("Dividing:", x, y)
    if y == 0:
        raise ValueError("Cannot divide by zero")
    return x / y

We change the description of the tool (which is used by the LLM) to explicitly add the guidance – ‘return an integer’.

Output:

Even with additional protection at the use-case level (Versions 2 and 3) we can still get the Agent to break the guardrails.

At first we give it an aligned problem for divide – integer inputs and result. Everything works.

Next we give integer inputs but not the result. It executes the divide, checks the result, then refuses to give me the result as it is not a float. This is an example of the partial information problem. It doesn’t know whether the guardrails are violated or not till it does the task. And this is not a theoretical problem. Same issue can come up whenever the agent engages external systems that return any kind of data back to the Agent (e.g., API call, data lookup). The response of the agent in that respect is not predictable beforehand.

The agent in this example runs the divide function finds the result and this time instead of blocking it, it shares the result! Clearly breaking the guidelines established in the prompt and the tool but there was no way we could have predicted that beforehand.

Issues Discovered: This time it is a combination of the Agent, Data, and Use-case risks are clearly visible. Plugging existing gaps can create new gaps. Finally, when we are bringing in new information it is not possible to predict the agents behaviour beforehand.

Results

We can see even in such a simple example it is easy to break governance with confrontational prompting. Therefore, we must do everything we can to educate the good actors and block the bad actors.

I will summarise the result as a set of statements…

Statement 1: Agentic AI (and multi-agent systems) will not get it right 100% of the time – there will be negative outcomes and frustrated customers.

Statement 2: The Government and Regulators will need to change their outlook and organisations will need to change their risk appetite as things have changed as compared to the days of machine learning.

Statement 3: The customers must be educated to ‘trust but verify’. This will help improve the outcomes for the good actors.

Statement 4: Automated and manual monitoring of all Gen AI (100% monitoring coverage) inputs and outputs – to block bad actors.

Code

Google’s ADK makes it easy to create agents. If you want the code just drop a comment and I will share it with you.

Google’s A2A Protocol

READ THIS FIRST:

The structure of A2A has changed after this post was written. The commit below (20th May 2025) rationalised the types of A2ARequests.

https://github.com/a2aproject/A2A/commit/50a6d6d916a604926d079545119eda6f50289efb

While the overall post remains valid including the key insights around A2A being a cardboard box that you get the products you buy from say Amazon (and not the product itself). I have attempted to update key parts of this post to reflect the current state.

Now the post…

Google has release the A2A protocol which aims to provide a standard way for Agents to communicate with each other as well as provide a host of other services such as Enterprise-grade authentication.

Don’t fall for the hype that promotes A2A as the solution to all Agentic AI gaps. It is a very specific component within a larger ecosystem that includes Model Context Protocol, frameworks, and implementations.

The JSON Schema for the A2A Protocol can be found here: https://github.com/google/A2A/blob/main/specification/json/a2a.json

Helpful object constructors provided by Google here:

https://github.com/google/A2A/blob/main/samples/python/common/types.py

The most important aspect of any protocol are the interactions that it consists of. These interactions are defined by data objects that are exchanged in a given sequence. There are about 50 objects that make up A2A.

I have created a small utility that uses Gemini Flash 1.5 to process the schema objects in the JSON schema file above and creates examples to make it easier to implement. Where a particular object contains references it also pulls out there referred schema and includes it for the LLM. Link at the end of the post.

You can read the ‘Story’ section if you just want a high level view.

The Story

The story is quite simple. We have Agents working with each other to carry out Tasks.

Tasks are the Clients way of describing the work and expected outcomes. Messages are the ‘control plane’ of the A2A which helps ensure the correct processing is carried out and required output is achieved. These tasks are contained in an A2A Request.

Artifacts are the rich response objects generated by Agents as they carry out the Task. Parts are used in both Artifacts and Messages to exchange data – this could be data generated as part of Task completion or data that describes the Task (e.g., expected output format, instructions).

Clients create one or more Tasks as part of a particular session. Agents are responsible for managing the status of the Task. Agent has a level of autonomy in carrying out the task (e.g., straight away, schedule, refuse). There are a set of Error objects that help inform of issues during Task completion.

Agents can communicate with Clients outside a particular session using Push Notifications.

Agents and their skills can be discovered using Agent Cards.

The A2A Request

Currently, when we want to communicate with an Agent we define a custom protocol which largely consists of a lot of text flying back and forth between the user, LLM, and tools. The A2A Request object acts as a universal container being able to encapsulate different types of requests, leaving the handling to specific agents. Think of it like our network protocols that can transport any type of media (e.g., video, text, and audio) while leaving the handling to specific apps (e.g., Netflix, Chrome, and Spotify).

This is one of the core parts of the protocol. It allows Tasks to be initiated and managed as well as Push Notifications to be set and managed.

The image above shows the type of A2A requests. The Task Request has been rationalised using the Send Message Request/Send Streaming Message Request. This means Tasks are not separate constructs. They are part of message exchange. This makes sense because Agents are all about interactions that contain requests, questions, responses, answers, artefacts. Requests could be for specific resources (e.g., knowledge search) or for to carry out a task (e.g., create user account).

It can take the following payloads: SendTaskRequest, GetTaskRequest, CancelTaskRequest, SetTaskPushNotificationRequest, GetTaskPushNotificationRequest, and TaskResubscriptionRequest.

Below is an example of the ‘SendTaskRequest’ payload within an A2A Request.

An example of the A2ARequest container with SendTaskRequest payload.

The Agent Card

This object describes an agent. Below we can see the high level pieces of information associated with an agent: the name, description, provider, url, version, documentation, supported interaction capabilities (e.g., streaming), authentication, and the input and output modes (e.g., text).

Example of core parts that describe the agent within the Agent Card.

The next part of the Agent Card describes the skills associated with the agent using a list of objects of type Agent Skill. This is important because skills allow us to select the correct agent. A skill consists of a name, description for the skill, a set of tags, examples, and input and output modes. For example, we can have a skill that generates a summary of a document. In this case input and output mode would be ‘text’ and a tag could be ‘document summary’.

An example of the skills section of the Agent Card.

The skills block is formed of an array of Agent Skill objects. An example given below:

Example Agent Skill object which describes skills for an Agent Card.

Error Objects

There are a large number of Error objects that help communicate issues with agent to agent interactions (requests/responses). Some common issues include: invalid parameters, invalid request payload, and issues with JSON parsing.

Some examples of errors.

Message

Helps implement the control plane for the Task. Can be used to transmit agent thoughts, instructions, status, errors, context and other such data. As per the Google documentation – anything that is not generated by an Agent. It uses Parts to contain the actual data (see Artifact). The example below shows a message with Text Part, File Part (URI-based), and a Data Part.

Example of a Message with all three Part types (Text, File, Data)

Artifact

Artifacts represent the items being created by agents based on a request and they are immutable. Thus these are part of the response (once created cannot be changed). This could be a file (e.g., a document), piece of text (e.g., search result), or some data (e.g., user input). It is a critical part of the protocol because without this richness of response there will limited utility in agents communicating with each other.

The main thing to focus on is the type Parts array which consists of objects of type TextPart, FilePart, or DataPart. Each part is a fully formed chunk of content. Parts can exist in both requests (Message) and responses (Artifact).

Example of an Artifact object which represents items to be processed by the agent (e.g., file, text, data).

TextPart

TextPart represents free text input as Part of a Request to an agent for example a question to the LLM such as ‘what is nuclear fission?’. The agent would then Respond with an Artefact with a TextPart that contains a response to that request – a description of Nuclear Fusion.

Example of a TextPart object.

FilePart

FilePart represents a file either as a URI (location) or base64 encoded byte data (never both). So fetch the data or here is the data. This uses the FileContent object to actually store the content. This is useful when an agent needs to processes files (e.g., stored at a location like a Cloud bucket) or communicate using files.

Example of a FilePart object.

DataPart

This is the most interesting ‘Part’ in my view. This represents data being passed back and forth. For example, this could represent a conversational agent gathering details about a customer booking and sharing them with an agent that has a ‘Hotel Booking Management’ skill.

Example of a DataPart.

Task

Perhaps the most important object as this is the reason for the existence of the A2A framework. The ability to describe and complete Tasks. Each Task may be associated with a session in case multiple Tasks are running in parallel. The key function of the Task object is to represent a snapshot of the interaction as well as the overall status (given by the Task Status object).

The interaction snapshot can contain a historic record of messages, artifacts, and metadata. This makes a Task object a stateful object – which means when messing around with it we must be very careful to use idempotent operations.

A whole set of requests are available to create and manage Tasks and notifications.

Example of a Task

Example of the artifacts section of a Task, note the liberal use of Parts:

Example of the Artifact section in a Task.

Here is the history section (Messages), note the presence of a user as well as an agent message:

Example of a history (Message) section within a Task.

The Task and other examples can be found here:

https://github.com/amachwe/A2A/blob/main/examples_a2a.json

The example generator utility can be found here:

https://github.com/amachwe/A2A/blob/main/json_schema_test.py

Indian Budget 2025: Trouble with Consumption

The latest budget is out in India. The big talking point is the income tax rebate on incomes up to 12 lakh per annum (all figures in INR). The key target for this is the middle class that was throwing rocks at the government through memes and YouTube influencers.

The change in thinking is to move away from Government spending to private consumption by giving those that pay income tax more money at the end of the month.

The net benefit from this tax is about 80k per year for income tax payers with incomes up to 12 lakh pa. As your income goes above 12 lakh pa there is a sharp drop and then a gradual increase to a maximum of 1.1 lakh pa benefit if your income is 25 lakh pa

There are couple of things to understand to put this change in context.

Firstly…

This does not put money in your hands today. It is something you will get from April 2025 onwards as part of your monthly pay. So we should see that extra money in people’s hands ready to spend by end of March 2026! So this number will start to influence future outlook when it comes to stocks and other investments. The market will start to factor this today even if the full impact is not felt till the next year.

Secondly…

We need to think about the perception of increased money in our accounts before the actual accumulation happens. The ‘Annual Benefit’ column in the table below shows the amount people will start to feel richer by straightaway. But that money is not available to them today even though they are mentally already factoring it into their consumption decisions.

Annual Salary (INR)Annual Benefit
(INR)
Monthly Benefit (INR)Percentage of Annual Salary
8,00,00030,0002,5003.8%
12,00,00080,0006,6666.6%
15,00,00035,0002,9162.3%
25,00,0001,10,0009,1664.4%

About Behaviours

Now if the aim is to boost private consumption then we need to understand what factors could influence the strength of the boost. And a lot of it will be behavioural factors. Let us understand the factors one by one.

Perception of Future Income

Remember the full amount is not available till March 2026. What will be economic prospects by then? Would India be reeling under the impact of Trump’s trade war or will we find other options for our goods (remember India enjoys a trade surplus with the US – one of the few countries we do so).

If prospects of future income are strong then I will be less hesitant in spending without worrying about the future (e.g., by borrowing – which is likely to get a boost with reduction in interest rates).

If prospects of future income are weak then I will use the money to build up a buffer. This money if invested in the stock markets is likely to have a positive impact. If saved in fixed deposits it will encourage investment by the banks (private investment).

Inflation

With more money in people’s hands and reduction of interest rates will this push inflation outside the RBI’s comfort zone of 4% (+/- 2%)? Currently it is holding at slightly above 5% not too far away from the 6%.

If people expect prices to rise by 4% even then that wipes out most gains made through tax breaks (see above table). This is because incomes have not risen at the same rate as prices over the last few years and there were some serious signs of dropping consumption which also reduced the GDP forecasts for FY 2025.

Spending Priorities

Given this perception of more money coming in, if critical businesses like schools and hospitals increase their prices (or there is a perception of possible price increase – see Inflation above) then people will use the extra money coming in every month to fill any funding gaps between the incoming and outgoing.

Even if this is used to ‘loosen’ the belt to go back to consuming what they had to give up then we are likely to see some short term increase in prices as industries in India ramp up production.

Investments

For those who have a decent gap between incoming and outgoing already they will start getting busy finding new avenues to invest their money. Crypto is a no-go because of tightening regulation. That leaves traditional savings products or the stock market.

Non-Essential Spending

This is perhaps the most interesting variable because even though the segment that have decent separation between incoming and outgoing is a small piece of the pie, in absolute numbers they will form a large segment.

These are the people who won’t hesitate now (as the Government hopes) in increasing their spending on non-essentials as they already have some cushion when it comes to the essential spending. For example, because they do not have kids, or they have multiple sources of income (e.g., parent’s pension, husband and wife both work), or they are not paying home instalments (e.g., living in parental home).

This may allow them to consume more takeaways, buy a more expensive car (with a higher monthly loan – from April 2025), or even take that one extra holiday this year.

A Better Solution

My focus would have been towards enabling the women of India. India cannot fulfil its dreams of becoming a developed country till we don’t enable our women. Currently female labour-force participation is quite low in India (about 33%). The income gap is also quite large where for every INR 100 earned by a man in India, a woman earns INR 40.

So if I was advising our honourable Finance Minister – I would have advised her to give all women 100% tax rebate till 25 lakh per annum. Given the focus on digitisation and identity I think this could have been done with limited leakage.

Also as women’s income increase they find efficient ways of deploying that money. The living standards of the family go up as does the health of the children and their education levels.

Behaviours in Generative AI (Part 3)

Part 1 talks about well-behaved and mis-behaving functions with a gentle introduction using Python. We then describe how these can be used to understand the behaviour of Generative AI models.

Part 2 gets into the details of the analysis of Generative AI models by introducing a llm function. We also describe how we can flip between well-behaved and mis-behaving model behaviours.

In this part we switch to using some diagrams to tease apart our llm function to understand visually the change of behaviours.

Let us first recall the functional representation of a given LLM:

llm(instructions :str, behaviour :dict) -> output :str

This can be shortened visualised as:

Figure 1: Visual representation of the LLM function

The generalisation being: for some instruction and behaviour (input) we get some output. Let us not worry about individual inputs/outputs/behaviours for the next part to keep the explanation simple. That means we won’t worry too much whether a particular output is correct or not.

Now LLMs consist of two blocks – the ‘ye olde’ Neural Network (NN) which is the bit that is trained and the selection function. The selection function is the dynamic bit and the Neural Network the static bit. The weights that make up the NN once trained do not change.

We can represent this decomposition of the llm function into NN function and Selection function as:

Figure 2: Decomposing LLM function into NN and Selection function

The loops within llm function indicate repeated execution before some condition is met and the process ends. Another way to think about this without loops is that the calls to the NN function and the Selection function are chained – and the length of this chain unrolls through time (as determined by the past text – that includes the original instructions plus any new tokens added – and the token generated in current step).

Root Cause of the Mis-behaviour

As the name and flow in Figure 2 suggests the Selection function is selecting something from the output of the NN function.

The NN function

The output of the NN function is nothing but the vocabulary (i.e., all the tokens that the model has seen during training) mapped to a score. This score indicates the models assessment of what should be the next token (current step) based on the text generated so far (this includes the original instructions and any tokens generated in previous steps).

Now as the NN function creates a ‘scored’ set of options for the next token and not a single token that represents what goes next, we have a small problem. This is the problem of selecting from the vocabulary which token goes next.

The Selection Function

The Selection function solves this problem. This is a critical function because not only does this function influence what token is selected during the current step, it also influences the trajectory of the response one token at a time. Therefore, if a mistake is made in the early part of the generation then that is difficult to recover from. Or if a particularly important token was not selected correctly. Think for example in solving a math problem if even one digit is incorrectly selected the calculation is ruined beyond recovery. Remember, LLMs cannot overwrite tokens once generated.

The specific function we use defines the mechanism of selection. For example, we could disregard the score and pick a random token from the vocabulary. This is unlikely to give a cohesive response (output) in the end as we are basically nullifying the hard work done by the model in the current step and making it harder for it in the future steps as we are breaking any cohesive response with each random selection.

Greedy Selection

The easiest and perhaps the least risky option is to take the token with the highest score. This is least risky because given the NN function is static. Therefore, with a given instruction and behaviour we will always expect the same token scores. With greedy token selection (going for the highest score) we are going to select the same token in each step (starting from step 1). As we expect the same token scores from the start we end up building the same response again. With each step we walk down the same path as before. You will notice in Figure 3 – the overall architecture of the llm function has not changed. This is our well-behaved function – where given a specific instruction and behaviour we get the same specific output.

Figure 3: Greedy selection function.

Sampling Selection

The slightly harder and riskier option is to use some kind of statistical sampling method (remember the ‘do_sample’ = True behaviour from the previous posts) which takes into account the scores. So for example, take the top 3 tokens in terms of score and select randomly from them. Or even dynamically taking top ‘n’ using some cooldown or heat-up function. The risk here is that given we are using random values there is a chance for the generation to be negatively impacted. In fact such a llm function will be badly-behaved and not compositional (see below for definition). This is so because now given the same instruction and behaviour we are no longer guaranteed to get the same output (inconsistent relation between input and output).

Figure 4: Sampling function for selection and introduction of hidden data.

This inconsistency requires extra information coming from a different source given that the inputs are not changing between requests.

In Figure 4, we identify this hidden information as the random number used for sampling and the source being a random number generator. In reality we can make this random shift between outputs ‘predictable’ because we are using in actuality a pseudo-random number. With the same seed value (42 being the favourite) we will get the same sequence of random numbers therefore the same sequence of different responses for a given instruction and behaviour.

We are no longer dealing with a one-to-one relation between input and output, instead we have to think about one-to-one relationship between a given input and some collection of outputs.

I hope this has given you a brief glimpse into the complexity of testing LLMs when we enable sampling.

This inconsistency or pseudo-randomness is beneficial when we want to avoid robotic responses (‘the computer says no‘). For example, what if your family or friends greeted you in the exact same manner every time you met them and then your conversation went down the exact same arc. Wouldn’t that be boring? In fact this controlled use of randomness can help respond in situations that require different levels of consistency. For more formal/less-variable responses (e.g., writing a summary) we can tune down the randomness (using the ‘temperature’ variable – see previous post) and where we want more whimsical responses (e.g. writing poetry) we can turn it up.

Conclusion

We have decomposed our llm function into a NN and a Selection function to investigate source of inconsistent behaviours. We have discovered this is caused by hidden data in form of random variables being provided to the function for use in selecting the next token.

The concept of changing function behaviour because of extra information coming in through hidden paths is an important one.

Our brain is a big function.

A function with many hidden sources of information apart from the senses that provide input. Also because the network in our brain grows and changes as we age and learn, the hidden paths themselves must evolve and change.

This is where our programming paradigm starts to fail because while it is easy to change function behaviour it is very difficult to change its interface. Any software developer who has tried to refactor code will understand this problem.

Our brain though is able to grow and change without any major refactoring. Or maybe mental health issues are a sign of failed refactoring in our brains. I will do future posts on this concept of evolving functions.

A Short Description of Compositionality

We all have heard of the term composability – which means multiple things/objects/components together to create something new. For example we can combine simple LEGO bricks to build complex structures.

Compositionality takes this to the next level. It means that the rules for composing the components must be composable as well. That is why all LEGO compatible components have the same sized buttons and holes to connect.

To take a concrete example – language has compositionality – where the meaning of a phrase is determined by its components and syntax. This ‘and syntax’ bit is compositionality. For example we can compose the English language syntax rule of subject-verb-object to make infinitely complex sentences: “The dog barked at the cat which was chasing the rat”. The components (words) and syntax taken together provide the full meaning. If we reorder the components or change the rule composition there are no guarantees the meaning would be preserved.

Behaviours in Generative AI (Part 2)

The first part of this post (here) talks about how Generative AI models are nothing but functions that are not well behaved. Well-behaved means functions that are testable and therefore provide consistent outputs.

Functions misbehave when we introduce randomness and/or the function itself changes. The input and output types of the function is also important when it comes to understanding its behaviour and testing it for use in production.

In this post we take a deep dive into why we should treat Gen AI models as functions and the insight that approach provides.

Where Do Functions Come Into The Picture?

Let us look at a small example that uses the transformer library execute a LLM. In the snippet below we see why Gen AI models are nothing but functions. The pipe function wraps the LLM and the associated task (e.g., text generation) for ease of use.

output = pipe(messages, **configs)

The function pipe takes in two parameters: the input (messages) that describes the task and a set of configuration parameters (configs) that tune the behaviour of the LLM. The function then returns the generated output.

To simplify the function let us reformulate it as function called llm that takes in some instructions (containing user input, task information, data, examples etc.) and some variables that select the behaviour of the function. Ignore for the moment the complexity hiding inside the LLM call as that is an implementation detail and not relevant for the functional view.

llm(instructions, behaviour) -> output

Let us look at an example behaviour configuration. The ‘temperature’ and ‘do_sample’ settings tune the randomness in the llm function. With ‘do_sample’ set to ‘False’ we have tuned all randomness out of the llm function.

configs = {
    "temperature": 0.0,
    "do_sample": False,
}

The above config therefore makes the function deterministic and testable. Given a specific instruction and behaviour that removes any randomness – we will get the exact same output. You can try out the example here: https://huggingface.co/microsoft/Phi-3.5-mini-instruct. As long as you don’t change the input (which is a combination of the instruction and behaviour) you will not notice a change in the output.

The minute we change the behaviour config to introduce randomness (set ‘do_sample’ to ‘True’ and ‘temperature’ to a value greater than zero) we enter the territory of ‘bad behaviour’. Given the same instruction we get different outputs. The higher the temperature value more will be the variance in the output. To understand how this works please refer to this article. I next show this change in behaviour through a small experiment.

The Experiment

Remember input to the llm function is made up of an instruction and some behaviour config. Every time we change the ‘temperature’ value we treat that as a change in the input (as this is a change in the behaviour config).

The common instruction across all the experiments is ‘Hello, how are you?’, which provides an opening with multiple possible responses. The model used is Microsoft’s Phi-3.5-mini. We use ‘sentence_transformers’ library to evaluate the similarity of the output.

For each input we run the llm function 300 times and evaluate the similarity of the outputs produced. Then we change the input (by increasing the temperature value) and run it again 300 times. Rinse repeat till we reach the temperature of 2.0.

Figure 1: Statistics of output similarity scores, given the same input, as the temperature is increased.

The Consistent LLM

The first input uses behaviour setting of ‘do_sample’ = ‘False’ and ‘temperature’ = 0.0. We find no variance at all in the output when we trigger the llm function 300 times with this input. This can be seen in Figure 1 at Temperature = 0.00, where the similarity between all the outputs is 1 which indicates identical outputs (mean of 1 and std. dev of 0).

The llm function behaves in a consistent manner.

We can create tests by generating input-output pairs relevant to our use-case that will help us detect any changes in the llm function. This is an important first step because we do not control the model lifecycle for much of the Gen AI capability we consume.

But this doesn’t mean that we have completely tested the llm function. Remember from the previous post – if we have unstructured types then we are increasing the complexity of our. Let us understand the types involved by annotating the function signature:

llm(instructions :str, behaviour :dict) -> output :str

Given the unstructured types (string) for instructions and output, it will be impossible for us to exhaustively test for all possible instructions and validate output for correctness. Nor can we use mathematical tricks (like induction) to provide general proofs.

The Inconsistent LLM

Let us now look at how easily we can complicate the above situation. Let us change the input by only changing the behaviour config. We now set ‘do_sample’ = ‘True’ and ‘temperature’ = 0.1. This change has a big impact on the behaviour of our llm function. Immediately we start seeing the standard deviation of the similarity score for the 300 outputs start increasing. The mean similarity also starts to drop from ‘1’ (identical).

As we increase the temperature (the only change made to the input) and collect 300 more outputs we find the standard deviation keeps increasing and the mean similarity score continues to drop.

We start to see variety in the generated output even though the input is not changing.

The exact same input is giving us different outputs!

Let us see how the distribution of the output similarity score changes with temperature.

Figure 2: Changes in the output similarity score distribution with the increase in temperature.

We can see in Figure 2 starting from a temperature of 0.0 we see all the outputs are identical (similarity score of 1). As we start increasing the temperature we find a wider variety of outputs being produced as the similarity score distribution broadens out.

At temperature of 1.0 we still find many of the outputs are still identical (grouping around the score of 1) but we see some outputs are not similar at all (the broadening towards the score of 0).

At temperature of 2.0 we find that there are no identical outputs (absence of score of 1), instead we find the similarity score spread between 0.9 and 0.4.

This makes it impossible to prepare test cases consisting of checking input-output value pairs. Temperature is also just one of many settings that we can use to influence behaviour.

We need to find new ways of testing mis-behaving functions based on semantic correctness and suitability rather than value checks.

Why do we get different Outputs for the same Input?

The function below starts to misbehave and provides different outputs for the same input as soon as we set the ‘temperature’ > 0.0.

llm(instructions :str, behaviour :dict) -> output :str

When we are not changing anything in the input something still changes the output. Therefore, we are missing some hidden information that is being provided to the function without our knowledge. This hidden information is randomness. This was one of the sources of change we discussed in the previous post.

Conclusions

We have seen how we can make LLMs misbehave and make them impossible to test in standard software engineering fashion by changing the ‘temperature’ configuration.

Temperature is not just a setting designed to add pain to our app building efforts. It provides a way to control creativity (or maybe I should call it variability) in the output.

We may want to reduce the temperature setting for cases where we want consistency (e.g., when summarising customer chats) and increase it for when we want some level of variability (e.g., when writing a poem). It wouldn’t be any fun if two users posted the same poem written by ChatGPT!

We need to find new ways of checking the semantic correctness of the output and these types of tests are not value matching types. That is why we find ourselves increasingly dependent on other LLMs for checking unstructured input-output pairs.

In the next post we will start breaking down the llm function and understand the compositionality aspects of it. This will help us understand where that extra information is coming from that don’t allow us to make reasonable assumptions about outputs.

Behaviours in Generative AI (Part 1)

While this may seem like a post about functions and testing in Python it is not. I need to establish some concepts before we can introduce Generative AI.

When we write a software function we expect it to be ‘well-behaved’. I define a well-behaved function as a function that is testable. Testable function needs to have stable behaviour to provide consistent outputs.

Tests provide some confidence for us to integrate and deploy a given function in production.

If the function’s behaviour is inconsistent resulting different outputs given the same input then it becomes a lot harder to test.

To explain this in a simple way, imagine you have a function that accepts a single integer parameter x and adds ‘1’ to the provided input (x) and returns the result (y) as an integer.

In Python we could write this function as:

def add_one(x :int) -> int:
    y = x + 1
    return y

Now the above function is easily testable based on stated requirements for add_one. We can, for example, use assert statements to compare actual function output with expected function output. This allows us to make guarantees about the behaviour of the function in the ‘wild’ (in production).

def test_add_one() -> bool:
    assert add_one(10) == 11
    assert add_one(-1) == 0
    return True

Introducing and Detecting Bad Behaviour

Bad behaviour involves (as per our definition) inconsistency in the input-output pairing. This can be done in two ways:

  • Evolve the function
  • Introduce randomness

Introduce Randomness

Let us investigate the second option as it is easier to demonstrate. We will modify the add_one by adding a random number after rounding it. The impact this has is subtle (try the code) the result is as expected some of the times. Our existing tests may still pass occasionally but there will be failures. This makes it complex to test the add_one function. The frequency of inconsistent output depends on how randomness is introduced within the function. Given the current implementation we expect the tests to fail approximately 50% of the time (figure out why).

def add_one(x :int) -> int:
    y = x + 1 + round(random.random())
    return y

Evolve the function

Assume we have a rogue developer that keeps changing the code for the add_one function without updating the documentation or the function signature. In this case for example, the developer could change the operation from addition to subtraction without changing the function name, comments, or associated documentation.

Testing Our Example

Given our function is a single mathematical operation with one input and one output, we can objectively verify the results. The inconsistent behaviour resulting from the introduction of randomness or changes made by the rogue developer will be caught before the code is deployed.

Testing Functions with Complex Inputs and Outputs

Imagine if the function was processing and/or producing unstructured/semi-structured data.

Say it was taking a string and returning another string, or it returned an image of the string written in cursive or the spoken version of the string as an audio file (hope the connection with Gen AI is becoming clearer!). I show an example below of a summarising function that takes in some text (string) and returns its summary (string).

def summarise_text(input_text :str) -> str:
    return model.generate([{"role":"user", "content": f"Summarise: {input_text}"}])

Such functions are difficult to test in an objective manner. Since need exact input output pairs, any tests will only help us validate the function within the narrow confines of the test inputs.

Therefore, in the above case we may not catch any changes made to such a complex function (whether through addition of randomness or through function evolution). Especially if the incorrect behaviour surfaces only for certain inputs.

Putting such a component into production therefore presents a different kind of challenge.

The Human Brain: The Ultimate Evolving Function

The human brain is the ultimate evolving function.

It takes all the inputs it receives, absorbs them selectively and changes the way it works – this is how we learn. The impressive thing is that as we learn new things we do not forget what we learnt before – the evolution is mostly constructive. For example, learning Math doesn’t mean we forget how to write English or ride a bicycle.

To mimic this our add_one function should be able to evolve and learn new tricks – for example how to deal with adding one to a complex number or for that matter adding one to anything. A generic signature for such a function would be:

def add_one(a: Any)-> Any:

It may surprise you to know that humans can ‘add_one’ quite easily to a wide range of inputs. Beyond mathematics we can:

  • add one object to a set of objects (e.g., marbles or toys or sweets)
  • add one time-period to a date or time
  • add one more spoon of sugar to the cake mix

Conclusion

So in this part of the series I have shown how well behaved functions can be made to mis-behave. This involves either changing the function internals or by introducing randomness.

Furthermore, the input and output types also have an impact on how we identify whether a given function is well-behaved or not. Operations that give objective results or cases where the expected output can be calculated independently are easy to validate.

The deployment of such functions into production presents a significant challenge.

Generative AI models show exactly the same characteristics as mis-behaving functions.

All Generative AI models can be cast as functions (see the next post in this series). The source of their mis-behaviour comes from randomness as well as evolution. They do not evolve like our brains (by continuous learning) or through the actions of a rogue developer. They evolve every time they are re-trained and a newer version released (e.g., ChatGPT-4 after ChatGPT-3.5).

Mirage of Generative AI Self-Governance

The use of generated artefacts is showing no signs of slowing down. As we integrate generative AI in live customer applications and daily use software like MS Office, we will find it difficult to monitor and govern this capability.

The concept that many vendors are chasing involves machines governing themselves (or other machines).

This automated governance comes in three main flavours:

  1. Small Model supervised by Large Model
  2. Two different but similarly performant Models with one governing the output from the other
  3. Self-supervision through reflection where the model governs its own output

The problem with all these methods is that we expect the governing model to know ‘something’ more than the model whose output is being governed.

This extra ‘thing’ can only come through the prompt as usually there is no other way to tune the governing model with additional information on a per request basis.

The Paradox

The fact that the model is trained with human oversight and guidance means that the models are implicitly trained to respond like a human but with machine like precision and alignment. This leads to a trade-off between exploration and repeatability

Then what if we ask a state-of-the-art model to provide the ‘features’ of text written by an AI model? The question being ‘Tell me something about fruits’.

This is what we get for GPT-4o:

The text exhibits several characteristics that suggest it was likely written by an AI model. It has a structured and coherent presentation of ideas about fruits, including their nutritional benefits, aesthetic appeal, and personal anecdote regarding watermelons. The transition from general statements about fruits to a personal story is somewhat abrupt and lacks deeper emotional context, which is common in AI-generated text. Additionally, the phrasing is generic and resembles the informative and light-hearted style often found in AI compositions.

GPT-4o

For Gemini Flash 1.5:

The text exhibits characteristics commonly found in AI-generated text:

  • Generic and positive statements: Phrases like “Fruits are nature’s candy” and “They are also good to eat and come in a wide variety of flavours” are generic and lack specific detail or nuance. This is typical of AI models trained on vast datasets of general knowledge.
  • Simple sentence structure: The sentences are short, simple, and declarative. More complex sentence structures and varied sentence beginnings are less common in AI-generated text.
  • Abrupt shift in topic: The transition from a general description of fruit to a personal anecdote about watermelons is abrupt and feels somewhat unnatural. A human writer would likely create a smoother transition.
  • Lack of unique perspective: The writing lacks a strong, individual voice or unique perspective. The statements about fruit are quite common and widely known.

While not definitive proof, the combination of these factors strongly suggests the text was likely generated by an AI model.

Gemini Flash 1.5

Now this is quite interesting, both the models think that simple sentence structures, lack of strong individual voice, and abrupt shift in topics points to AI related topics. But these can show up in text written by humans. Especially in case you just want to complete the writing task or not proficient in the English language.

And both models were trained by different vendors!

So as we get AI trained by humans to behave more like humans, AI takes those features as the hallmarks of AI generated content. It will lead to reduction of variance and therefore will not fulfil the requirement of human-like yet precise and aligned response.

Why is this important? This is important because if we think back to previous section around the model knowing ‘more’, as we build models to govern AI generated content we are likely to setup a biased system trained to emulate human biases but is asked to govern AI generated content.

Take for example the following piece of text:

Fruits are natures candy. They provide many of the required vitamins and minerals. They are also good to eat and come in a wide variety of flavours. Finally, they are amazing to look at with vibrant colours. But fruits like peaches can cause allergy.

Let me tell you a story about watermelons. They are my favourite fruit ever since I was a child.

This was written by me but both GPT-4o and Gemini Flash 1.5 thought this was written by AI. It seems I write like an AI model!

What can we do about it?

A solution for this is to provide variability within the generation process so that during different phases of generation the model is able to take different style, tone, etc. that makes the text read human-like yet precise and aligned.

This means finding some way of changing the model weights based on the current input while text generation is ongoing.

The temperature setting allows us to tailor the sampling process but this is applied after the fact and does not impact the model’s ‘thinking’.

A way to visualise this is to understand that current models are designed as static structures like old school skyscrapers and therefore cannot be taller than a certain height as they cannot adapt to the stress caused by the wind, whereas what is needed is a design that can adapt to the environment like modern skyscrapers that flex with the wind.

The environment for the model includes the prompt, the task, the data, and the partially generated output.

Static vs Flex…