Take Care of Your Agents!

How does one identify a ‘real’ AI Agent developer? They don’t use ‘adk web’ to demo their agent.

Silly jokes aside, in this post I want to talk about building an app with AI Agents. The focus is on GCP stack. Google’s Agent Engine – ADK combination looks like a really good option for pro-code development that abstracts out many common tasks.

I do not like magical abstraction as it blocks my understand of ‘how things work’. Therefore, I will use Google’s ADK to build our agents and instead of hosting it on Agent Engine I will write the application that provides a snug and safe environment for the agents.

This will be a text heavy post where I introduce key aspects. I will follow this up with a code heavy set of posts looking at Sessions, Memory, and Telemetry.

Let us first look at what Agent Engine helps us out with as an Agent developer:

  1. Session State: the internal state that the agent uses to carry out the task – includes knowledge and facts gathered
  2. Long Term Memory: the long term memory created by compressing and updating based on the internal state of the agent
  3. Logging/Telemetry: the data stream coming from the agent that helps us monitor, troubleshoot and audit.
  4. Agent Runner and API: the infrastructure that runs the agents and creates the API for our agent to be published for use.

All of the things in the list above are done for us when we use ‘adk web’ or similar ‘agent testing’ tool. This container for the agents is the ‘Agent Application’.

Agent Engine and adk are two examples of ‘Agent Applications’ that we can use to deploy Agents written in ADK. Agent Engine also supports LangGraph and other frameworks.

Writing an Agent Application

Writing an Agent Application requires a deep dive into concepts that are critical to agent operations: session, memory, and telemetry. Fortunately, ADK gives us all the core capabilities to build these out. For example, when using adk web our agents remember what the user said before, this is because the adk web Agent Application is taking care of session management for us.

One also needs to understand the lifecycle of an interaction so let us look at this first.

Interaction Lifecycle

In any agentic interaction there is usually a user or external system (or agent), there is also an agentic application they interact with, and the interaction is usually bound to an intent called a session (e.g., I want to open a new account). The interaction continues till the intent is satisfied or rejected.

The interaction lifecycle gives us scope for any state and memory we use.

User: Scope is the user. This typically will be information associated with the user like their username, preferences etc. This will span application and session scope. Identified by ‘user_id’.

Application: Scope is the application. This typically will be information required to operate the app. This allows the same agent to have different application state when used in different applications. For example, credentials for agents, application level guardrails (e.g., do not give advice) and global facts. Identified by ‘app_name’.

Session: Scope is the current session. This will contain the short-term (or scratchpad) state associated with the ongoing interaction for the agent to use while the session is active. This includes the users inputs like the reason to contact and any responses by the agent. The client in this case will need to ensure state management by reusing the same user, session and app ID when it wishes to maintain the session continuity (e.g., in case of a disconnect).

Now let us look at the Session, Memory and Telemetry.

Session

Session is all about the local state of the Agent. ADK has the Session object that provides the data structure to store session info and a SessionService object that manages the session data structure.

The SessionService is responsible for the entire session state management and Google do not want us to manually write to the session object. The only two special cases being when the interaction starts (to preload state) and when it finishes (to save state). Even for this we are expected to use the appropriate lifecycle callbacks. More on callbacks later.

SessionService comes in three flavours:

  • InMemorySessionService – for testing locally (outside Agent Engine).
  • VertexAISessionService – for production deployment using Agent Engine (abstracts away).
  • DatabaseSessionService – developer managed generic persistent session service for production environment that can use any SQL db.

The Session object is identified by the combination of session_id, user_id, and app_name.

Session object has the following properties:

  • History – an automatically updated list of significant events (managed by ADK).
  • Session State – the temporary state of the agent to be managed by the application for use by the agent as a scratchpad.
  • Activity Tracking – this is a timestamp of the last event for tracking (managed by ADK).

A key question for me is how does one manage shared user sessions (e.g., in case of a joint bank account). Some creative coding and user management is needed.

Memory

This represents the session state processed and stored as a ‘memory’ for use in later interactions. The retrieval happens based on the user_id and app_name to ensure compartmentalisation.

The MemoryService manages the creation, management, and retrieval of memories. As a developer we do not have any control in the memory creation and management process. If we want full control then there are examples of accessing closed session state or checkpoint running session state to create/manage memories manually.

It comes in two flavours:

  • InMemoryMemoryService – nicely named service for when we are testing.
  • VertexAIMemoryBank – to be used with Agent Engine which abstracts the process and provides persistence.

Telemetry

Telemetry includes all information that flows in and out of an agent. These streams then have to be tied up together as information flows through a system with multiple interacting agents to identify issues and troubleshoot when things go wrong.

Without telemetry we will not be able to operate these systems in production. For regulated industries like banking and healthcare additional audit requirements are there. Therefore, focus on what to collect. This post will talk about how to collect.

When using Agent Engine a variety of tools are available that can collect the OpenTelemetry-based feeds produced by it. Cloud Trace is the GCP native service for this.

Collecting with Callbacks

ADK supports 6 basic callbacks. This enables us to access state and input/output information as the agent operates.

Before/After LLM: Callbacks triggered just before and after the LLM is engaged. Use this to audit LLM request and response as well as interception for guardrails (checking input for relevance, harm etc. and checking output for quality and content).

Before/After Tool: Callbacks triggered just before and after a tool is engaged. Use this to audit tool use (inputs and outputs) as well as for debugging tool requests generated by the LLM.

Before/After Agent: Callbacks triggered just before and after the agent engages. This is used to audit inputs going into the agent and the response produced by the agent. This is specifically the input and output to the user.

LLM and Tool Callbacks tell us about the inner workings of the agent (private thoughts). Agent callbacks tell us about the external workings of the agent (public thoughts).

Callbacks are the Google recommended mechanism for changing state and they also provide a set of patterns (sharing links here):

Remember callbacks must be implemented in a manner that:

  1. Does not add latency – use async mechanism
  2. Does not introduce parallel logic for processing data
  3. Does not become an alternative integration point

Logging

ADK also supports python logger sub-system. If DEBUG level is used then every aspect of a running agent is logged plus more. But this increases the data processing load as well as tends to slow down the processing.

High data volumes will require significant processing (including correlation of different streams, root cause analysis, and time-series analysis) before it can be converted into an event stream prioritised by criticality to enable Human-on-the-loop (human as a supervisor). I do not expect human-in-the-loop (human as an approver) to be a viable method for controlling and monitoring multi-agent systems.

1 Comment

Leave a Comment