Behaviours in Generative AI (Part 3)

Part 1 talks about well-behaved and mis-behaving functions with a gentle introduction using Python. We then describe how these can be used to understand the behaviour of Generative AI models.

Part 2 gets into the details of the analysis of Generative AI models by introducing a llm function. We also describe how we can flip between well-behaved and mis-behaving model behaviours.

In this part we switch to using some diagrams to tease apart our llm function to understand visually the change of behaviours.

Let us first recall the functional representation of a given LLM:

llm(instructions :str, behaviour :dict) -> output :str

This can be shortened visualised as:

Figure 1: Visual representation of the LLM function

The generalisation being: for some instruction and behaviour (input) we get some output. Let us not worry about individual inputs/outputs/behaviours for the next part to keep the explanation simple. That means we won’t worry too much whether a particular output is correct or not.

Now LLMs consist of two blocks – the ‘ye olde’ Neural Network (NN) which is the bit that is trained and the selection function. The selection function is the dynamic bit and the Neural Network the static bit. The weights that make up the NN once trained do not change.

We can represent this decomposition of the llm function into NN function and Selection function as:

Figure 2: Decomposing LLM function into NN and Selection function

The loops within llm function indicate repeated execution before some condition is met and the process ends. Another way to think about this without loops is that the calls to the NN function and the Selection function are chained – and the length of this chain unrolls through time (as determined by the past text – that includes the original instructions plus any new tokens added – and the token generated in current step).

Root Cause of the Mis-behaviour

As the name and flow in Figure 2 suggests the Selection function is selecting something from the output of the NN function.

The NN function

The output of the NN function is nothing but the vocabulary (i.e., all the tokens that the model has seen during training) mapped to a score. This score indicates the models assessment of what should be the next token (current step) based on the text generated so far (this includes the original instructions and any tokens generated in previous steps).

Now as the NN function creates a ‘scored’ set of options for the next token and not a single token that represents what goes next, we have a small problem. This is the problem of selecting from the vocabulary which token goes next.

The Selection Function

The Selection function solves this problem. This is a critical function because not only does this function influence what token is selected during the current step, it also influences the trajectory of the response one token at a time. Therefore, if a mistake is made in the early part of the generation then that is difficult to recover from. Or if a particularly important token was not selected correctly. Think for example in solving a math problem if even one digit is incorrectly selected the calculation is ruined beyond recovery. Remember, LLMs cannot overwrite tokens once generated.

The specific function we use defines the mechanism of selection. For example, we could disregard the score and pick a random token from the vocabulary. This is unlikely to give a cohesive response (output) in the end as we are basically nullifying the hard work done by the model in the current step and making it harder for it in the future steps as we are breaking any cohesive response with each random selection.

Greedy Selection

The easiest and perhaps the least risky option is to take the token with the highest score. This is least risky because given the NN function is static. Therefore, with a given instruction and behaviour we will always expect the same token scores. With greedy token selection (going for the highest score) we are going to select the same token in each step (starting from step 1). As we expect the same token scores from the start we end up building the same response again. With each step we walk down the same path as before. You will notice in Figure 3 – the overall architecture of the llm function has not changed. This is our well-behaved function – where given a specific instruction and behaviour we get the same specific output.

Figure 3: Greedy selection function.

Sampling Selection

The slightly harder and riskier option is to use some kind of statistical sampling method (remember the ‘do_sample’ = True behaviour from the previous posts) which takes into account the scores. So for example, take the top 3 tokens in terms of score and select randomly from them. Or even dynamically taking top ‘n’ using some cooldown or heat-up function. The risk here is that given we are using random values there is a chance for the generation to be negatively impacted. In fact such a llm function will be badly-behaved and not compositional (see below for definition). This is so because now given the same instruction and behaviour we are no longer guaranteed to get the same output (inconsistent relation between input and output).

Figure 4: Sampling function for selection and introduction of hidden data.

This inconsistency requires extra information coming from a different source given that the inputs are not changing between requests.

In Figure 4, we identify this hidden information as the random number used for sampling and the source being a random number generator. In reality we can make this random shift between outputs ‘predictable’ because we are using in actuality a pseudo-random number. With the same seed value (42 being the favourite) we will get the same sequence of random numbers therefore the same sequence of different responses for a given instruction and behaviour.

We are no longer dealing with a one-to-one relation between input and output, instead we have to think about one-to-one relationship between a given input and some collection of outputs.

I hope this has given you a brief glimpse into the complexity of testing LLMs when we enable sampling.

This inconsistency or pseudo-randomness is beneficial when we want to avoid robotic responses (‘the computer says no‘). For example, what if your family or friends greeted you in the exact same manner every time you met them and then your conversation went down the exact same arc. Wouldn’t that be boring? In fact this controlled use of randomness can help respond in situations that require different levels of consistency. For more formal/less-variable responses (e.g., writing a summary) we can tune down the randomness (using the ‘temperature’ variable – see previous post) and where we want more whimsical responses (e.g. writing poetry) we can turn it up.

Conclusion

We have decomposed our llm function into a NN and a Selection function to investigate source of inconsistent behaviours. We have discovered this is caused by hidden data in form of random variables being provided to the function for use in selecting the next token.

The concept of changing function behaviour because of extra information coming in through hidden paths is an important one.

Our brain is a big function.

A function with many hidden sources of information apart from the senses that provide input. Also because the network in our brain grows and changes as we age and learn, the hidden paths themselves must evolve and change.

This is where our programming paradigm starts to fail because while it is easy to change function behaviour it is very difficult to change its interface. Any software developer who has tried to refactor code will understand this problem.

Our brain though is able to grow and change without any major refactoring. Or maybe mental health issues are a sign of failed refactoring in our brains. I will do future posts on this concept of evolving functions.

A Short Description of Compositionality

We all have heard of the term composability – which means multiple things/objects/components together to create something new. For example we can combine simple LEGO bricks to build complex structures.

Compositionality takes this to the next level. It means that the rules for composing the components must be composable as well. That is why all LEGO compatible components have the same sized buttons and holes to connect.

To take a concrete example – language has compositionality – where the meaning of a phrase is determined by its components and syntax. This ‘and syntax’ bit is compositionality. For example we can compose the English language syntax rule of subject-verb-object to make infinitely complex sentences: “The dog barked at the cat which was chasing the rat”. The components (words) and syntax taken together provide the full meaning. If we reorder the components or change the rule composition there are no guarantees the meaning would be preserved.

Behaviours in Generative AI (Part 2)

The first part of this post (here) talks about how Generative AI models are nothing but functions that are not well behaved. Well-behaved means functions that are testable and therefore provide consistent outputs.

Functions misbehave when we introduce randomness and/or the function itself changes. The input and output types of the function is also important when it comes to understanding its behaviour and testing it for use in production.

In this post we take a deep dive into why we should treat Gen AI models as functions and the insight that approach provides.

Where Do Functions Come Into The Picture?

Let us look at a small example that uses the transformer library execute a LLM. In the snippet below we see why Gen AI models are nothing but functions. The pipe function wraps the LLM and the associated task (e.g., text generation) for ease of use.

output = pipe(messages, **configs)

The function pipe takes in two parameters: the input (messages) that describes the task and a set of configuration parameters (configs) that tune the behaviour of the LLM. The function then returns the generated output.

To simplify the function let us reformulate it as function called llm that takes in some instructions (containing user input, task information, data, examples etc.) and some variables that select the behaviour of the function. Ignore for the moment the complexity hiding inside the LLM call as that is an implementation detail and not relevant for the functional view.

llm(instructions, behaviour) -> output

Let us look at an example behaviour configuration. The ‘temperature’ and ‘do_sample’ settings tune the randomness in the llm function. With ‘do_sample’ set to ‘False’ we have tuned all randomness out of the llm function.

configs = {
    "temperature": 0.0,
    "do_sample": False,
}

The above config therefore makes the function deterministic and testable. Given a specific instruction and behaviour that removes any randomness – we will get the exact same output. You can try out the example here: https://huggingface.co/microsoft/Phi-3.5-mini-instruct. As long as you don’t change the input (which is a combination of the instruction and behaviour) you will not notice a change in the output.

The minute we change the behaviour config to introduce randomness (set ‘do_sample’ to ‘True’ and ‘temperature’ to a value greater than zero) we enter the territory of ‘bad behaviour’. Given the same instruction we get different outputs. The higher the temperature value more will be the variance in the output. To understand how this works please refer to this article. I next show this change in behaviour through a small experiment.

The Experiment

Remember input to the llm function is made up of an instruction and some behaviour config. Every time we change the ‘temperature’ value we treat that as a change in the input (as this is a change in the behaviour config).

The common instruction across all the experiments is ‘Hello, how are you?’, which provides an opening with multiple possible responses. The model used is Microsoft’s Phi-3.5-mini. We use ‘sentence_transformers’ library to evaluate the similarity of the output.

For each input we run the llm function 300 times and evaluate the similarity of the outputs produced. Then we change the input (by increasing the temperature value) and run it again 300 times. Rinse repeat till we reach the temperature of 2.0.

Figure 1: Statistics of output similarity scores, given the same input, as the temperature is increased.

The Consistent LLM

The first input uses behaviour setting of ‘do_sample’ = ‘False’ and ‘temperature’ = 0.0. We find no variance at all in the output when we trigger the llm function 300 times with this input. This can be seen in Figure 1 at Temperature = 0.00, where the similarity between all the outputs is 1 which indicates identical outputs (mean of 1 and std. dev of 0).

The llm function behaves in a consistent manner.

We can create tests by generating input-output pairs relevant to our use-case that will help us detect any changes in the llm function. This is an important first step because we do not control the model lifecycle for much of the Gen AI capability we consume.

But this doesn’t mean that we have completely tested the llm function. Remember from the previous post – if we have unstructured types then we are increasing the complexity of our. Let us understand the types involved by annotating the function signature:

llm(instructions :str, behaviour :dict) -> output :str

Given the unstructured types (string) for instructions and output, it will be impossible for us to exhaustively test for all possible instructions and validate output for correctness. Nor can we use mathematical tricks (like induction) to provide general proofs.

The Inconsistent LLM

Let us now look at how easily we can complicate the above situation. Let us change the input by only changing the behaviour config. We now set ‘do_sample’ = ‘True’ and ‘temperature’ = 0.1. This change has a big impact on the behaviour of our llm function. Immediately we start seeing the standard deviation of the similarity score for the 300 outputs start increasing. The mean similarity also starts to drop from ‘1’ (identical).

As we increase the temperature (the only change made to the input) and collect 300 more outputs we find the standard deviation keeps increasing and the mean similarity score continues to drop.

We start to see variety in the generated output even though the input is not changing.

The exact same input is giving us different outputs!

Let us see how the distribution of the output similarity score changes with temperature.

Figure 2: Changes in the output similarity score distribution with the increase in temperature.

We can see in Figure 2 starting from a temperature of 0.0 we see all the outputs are identical (similarity score of 1). As we start increasing the temperature we find a wider variety of outputs being produced as the similarity score distribution broadens out.

At temperature of 1.0 we still find many of the outputs are still identical (grouping around the score of 1) but we see some outputs are not similar at all (the broadening towards the score of 0).

At temperature of 2.0 we find that there are no identical outputs (absence of score of 1), instead we find the similarity score spread between 0.9 and 0.4.

This makes it impossible to prepare test cases consisting of checking input-output value pairs. Temperature is also just one of many settings that we can use to influence behaviour.

We need to find new ways of testing mis-behaving functions based on semantic correctness and suitability rather than value checks.

Why do we get different Outputs for the same Input?

The function below starts to misbehave and provides different outputs for the same input as soon as we set the ‘temperature’ > 0.0.

llm(instructions :str, behaviour :dict) -> output :str

When we are not changing anything in the input something still changes the output. Therefore, we are missing some hidden information that is being provided to the function without our knowledge. This hidden information is randomness. This was one of the sources of change we discussed in the previous post.

Conclusions

We have seen how we can make LLMs misbehave and make them impossible to test in standard software engineering fashion by changing the ‘temperature’ configuration.

Temperature is not just a setting designed to add pain to our app building efforts. It provides a way to control creativity (or maybe I should call it variability) in the output.

We may want to reduce the temperature setting for cases where we want consistency (e.g., when summarising customer chats) and increase it for when we want some level of variability (e.g., when writing a poem). It wouldn’t be any fun if two users posted the same poem written by ChatGPT!

We need to find new ways of checking the semantic correctness of the output and these types of tests are not value matching types. That is why we find ourselves increasingly dependent on other LLMs for checking unstructured input-output pairs.

In the next post we will start breaking down the llm function and understand the compositionality aspects of it. This will help us understand where that extra information is coming from that don’t allow us to make reasonable assumptions about outputs.

Behaviours in Generative AI (Part 1)

While this may seem like a post about functions and testing in Python it is not. I need to establish some concepts before we can introduce Generative AI.

When we write a software function we expect it to be ‘well-behaved’. I define a well-behaved function as a function that is testable. Testable function needs to have stable behaviour to provide consistent outputs.

Tests provide some confidence for us to integrate and deploy a given function in production.

If the function’s behaviour is inconsistent resulting different outputs given the same input then it becomes a lot harder to test.

To explain this in a simple way, imagine you have a function that accepts a single integer parameter x and adds ‘1’ to the provided input (x) and returns the result (y) as an integer.

In Python we could write this function as:

def add_one(x :int) -> int:
    y = x + 1
    return y

Now the above function is easily testable based on stated requirements for add_one. We can, for example, use assert statements to compare actual function output with expected function output. This allows us to make guarantees about the behaviour of the function in the ‘wild’ (in production).

def test_add_one() -> bool:
    assert add_one(10) == 11
    assert add_one(-1) == 0
    return True

Introducing and Detecting Bad Behaviour

Bad behaviour involves (as per our definition) inconsistency in the input-output pairing. This can be done in two ways:

  • Evolve the function
  • Introduce randomness

Introduce Randomness

Let us investigate the second option as it is easier to demonstrate. We will modify the add_one by adding a random number after rounding it. The impact this has is subtle (try the code) the result is as expected some of the times. Our existing tests may still pass occasionally but there will be failures. This makes it complex to test the add_one function. The frequency of inconsistent output depends on how randomness is introduced within the function. Given the current implementation we expect the tests to fail approximately 50% of the time (figure out why).

def add_one(x :int) -> int:
    y = x + 1 + round(random.random())
    return y

Evolve the function

Assume we have a rogue developer that keeps changing the code for the add_one function without updating the documentation or the function signature. In this case for example, the developer could change the operation from addition to subtraction without changing the function name, comments, or associated documentation.

Testing Our Example

Given our function is a single mathematical operation with one input and one output, we can objectively verify the results. The inconsistent behaviour resulting from the introduction of randomness or changes made by the rogue developer will be caught before the code is deployed.

Testing Functions with Complex Inputs and Outputs

Imagine if the function was processing and/or producing unstructured/semi-structured data.

Say it was taking a string and returning another string, or it returned an image of the string written in cursive or the spoken version of the string as an audio file (hope the connection with Gen AI is becoming clearer!). I show an example below of a summarising function that takes in some text (string) and returns its summary (string).

def summarise_text(input_text :str) -> str:
    return model.generate([{"role":"user", "content": f"Summarise: {input_text}"}])

Such functions are difficult to test in an objective manner. Since need exact input output pairs, any tests will only help us validate the function within the narrow confines of the test inputs.

Therefore, in the above case we may not catch any changes made to such a complex function (whether through addition of randomness or through function evolution). Especially if the incorrect behaviour surfaces only for certain inputs.

Putting such a component into production therefore presents a different kind of challenge.

The Human Brain: The Ultimate Evolving Function

The human brain is the ultimate evolving function.

It takes all the inputs it receives, absorbs them selectively and changes the way it works – this is how we learn. The impressive thing is that as we learn new things we do not forget what we learnt before – the evolution is mostly constructive. For example, learning Math doesn’t mean we forget how to write English or ride a bicycle.

To mimic this our add_one function should be able to evolve and learn new tricks – for example how to deal with adding one to a complex number or for that matter adding one to anything. A generic signature for such a function would be:

def add_one(a: Any)-> Any:

It may surprise you to know that humans can ‘add_one’ quite easily to a wide range of inputs. Beyond mathematics we can:

  • add one object to a set of objects (e.g., marbles or toys or sweets)
  • add one time-period to a date or time
  • add one more spoon of sugar to the cake mix

Conclusion

So in this part of the series I have shown how well behaved functions can be made to mis-behave. This involves either changing the function internals or by introducing randomness.

Furthermore, the input and output types also have an impact on how we identify whether a given function is well-behaved or not. Operations that give objective results or cases where the expected output can be calculated independently are easy to validate.

The deployment of such functions into production presents a significant challenge.

Generative AI models show exactly the same characteristics as mis-behaving functions.

All Generative AI models can be cast as functions (see the next post in this series). The source of their mis-behaviour comes from randomness as well as evolution. They do not evolve like our brains (by continuous learning) or through the actions of a rogue developer. They evolve every time they are re-trained and a newer version released (e.g., ChatGPT-4 after ChatGPT-3.5).

Mirage of Generative AI Self-Governance

The use of generated artefacts is showing no signs of slowing down. As we integrate generative AI in live customer applications and daily use software like MS Office, we will find it difficult to monitor and govern this capability.

The concept that many vendors are chasing involves machines governing themselves (or other machines).

This automated governance comes in three main flavours:

  1. Small Model supervised by Large Model
  2. Two different but similarly performant Models with one governing the output from the other
  3. Self-supervision through reflection where the model governs its own output

The problem with all these methods is that we expect the governing model to know ‘something’ more than the model whose output is being governed.

This extra ‘thing’ can only come through the prompt as usually there is no other way to tune the governing model with additional information on a per request basis.

The Paradox

The fact that the model is trained with human oversight and guidance means that the models are implicitly trained to respond like a human but with machine like precision and alignment. This leads to a trade-off between exploration and repeatability

Then what if we ask a state-of-the-art model to provide the ‘features’ of text written by an AI model? The question being ‘Tell me something about fruits’.

This is what we get for GPT-4o:

The text exhibits several characteristics that suggest it was likely written by an AI model. It has a structured and coherent presentation of ideas about fruits, including their nutritional benefits, aesthetic appeal, and personal anecdote regarding watermelons. The transition from general statements about fruits to a personal story is somewhat abrupt and lacks deeper emotional context, which is common in AI-generated text. Additionally, the phrasing is generic and resembles the informative and light-hearted style often found in AI compositions.

GPT-4o

For Gemini Flash 1.5:

The text exhibits characteristics commonly found in AI-generated text:

  • Generic and positive statements: Phrases like “Fruits are nature’s candy” and “They are also good to eat and come in a wide variety of flavours” are generic and lack specific detail or nuance. This is typical of AI models trained on vast datasets of general knowledge.
  • Simple sentence structure: The sentences are short, simple, and declarative. More complex sentence structures and varied sentence beginnings are less common in AI-generated text.
  • Abrupt shift in topic: The transition from a general description of fruit to a personal anecdote about watermelons is abrupt and feels somewhat unnatural. A human writer would likely create a smoother transition.
  • Lack of unique perspective: The writing lacks a strong, individual voice or unique perspective. The statements about fruit are quite common and widely known.

While not definitive proof, the combination of these factors strongly suggests the text was likely generated by an AI model.

Gemini Flash 1.5

Now this is quite interesting, both the models think that simple sentence structures, lack of strong individual voice, and abrupt shift in topics points to AI related topics. But these can show up in text written by humans. Especially in case you just want to complete the writing task or not proficient in the English language.

And both models were trained by different vendors!

So as we get AI trained by humans to behave more like humans, AI takes those features as the hallmarks of AI generated content. It will lead to reduction of variance and therefore will not fulfil the requirement of human-like yet precise and aligned response.

Why is this important? This is important because if we think back to previous section around the model knowing ‘more’, as we build models to govern AI generated content we are likely to setup a biased system trained to emulate human biases but is asked to govern AI generated content.

Take for example the following piece of text:

Fruits are natures candy. They provide many of the required vitamins and minerals. They are also good to eat and come in a wide variety of flavours. Finally, they are amazing to look at with vibrant colours. But fruits like peaches can cause allergy.

Let me tell you a story about watermelons. They are my favourite fruit ever since I was a child.

This was written by me but both GPT-4o and Gemini Flash 1.5 thought this was written by AI. It seems I write like an AI model!

What can we do about it?

A solution for this is to provide variability within the generation process so that during different phases of generation the model is able to take different style, tone, etc. that makes the text read human-like yet precise and aligned.

This means finding some way of changing the model weights based on the current input while text generation is ongoing.

The temperature setting allows us to tailor the sampling process but this is applied after the fact and does not impact the model’s ‘thinking’.

A way to visualise this is to understand that current models are designed as static structures like old school skyscrapers and therefore cannot be taller than a certain height as they cannot adapt to the stress caused by the wind, whereas what is needed is a design that can adapt to the environment like modern skyscrapers that flex with the wind.

The environment for the model includes the prompt, the task, the data, and the partially generated output.

Static vs Flex…

Talking with your Data Part 4: Further Agentic Goodness

Part 4 of this series focuses on extending the framework we created in the previous post by adding some tools. If you have not read the previous post: Part 3: Agentic Talking with your Data

In the previous examples one of the recurring issues we come across is the LLM’s lack of awareness of the current time and date. This is to be expected given the lack of real time information streams available to the LLM.

There are also some minor fixes and improvements like using Beautiful Soup to extract data from tags and tweaking the prompt itself.

The New Prompt

We can see the new prompt below. We have added two new tools via tags to the prompt. The first one is a ‘time tool’ which allows the LLM to access the latest time and date. The LLM can use the ‘T’ tag to invoke the tool.

You are an AI assistant that can take one action at a time to help a user with a question: {question}
You cannot answer without querying for the data. You are allowed to answer without asking a follow up question.
You can only select ONE action per response. Do not create a SQL query if you do not have enough information. Do not repeat the input provided by the user as part of the response.

Use tags below to indicate the action you want to take:
## Run SQL query within the following tags in one line: <SQL> </SQL>. You cannot execute a query - only generate it.
## Ask a follow up question to the user - surround question with tags in one line: <Q> </Q>. 
## If you have enough data to give final answer to the user's question use tags in one line: <R> </R>.
## You can ask for help with finding current time and date using: <T> </T>.


Data schema:
{ddl}

Output Example

We can see the trace of an example interaction below using the Gemini 1.5 Flash. Green highlighted text contains the user’s initial input and subsequent responses. The purple highlighted numbers indicate the response from the LLM. The blue highlighted text is the answer provided by the LLM.

Figure 1a: First part of the chat trace.

In the first part of the chat we can see the general pattern emerging where the user asks a question and the LLM asks a question in return [Response 2]. This is typically a request for the answer to the question asked by the user! If the user nudges the LLM to use the schema, it is then generating, executing (via SQL Tool) and returning a proper answer using the Answer Tool [Response 3 and 4]. Finally the user asks a follow up question: their joining date?

Next we come to the second part of the chat as the LLM first generates and then executes the SQL to answer the joining date question [Response 5, 6, and 7].

Then comes the interesting part – the final follow up question: how many days ago was that from today? Previously, the LLM would have asked a follow up question to ascertain the current date. But in this case it immediately fires off a request to the Time and Date tool [Response 8] which is then used (correctly!) to create a new SQL query [Response 9] which then finally gives the correct answer [Response 10] of 41 days even though the SQL tool response is a floating point number which is automatically rounded by the LLM when it converts the SQL result to an answer to the question.

Figure 1b: Second part of the chat.

Conclusion

Some interesting conclusions.

  1. This highlights why answering a question is the end-to-end process and using SQL to gather the data is just one of the possible intermediate steps.
  2. LLM is also capable of overusing the tools. For example, in this case using the SQL tool to calculate date difference rather than just using the joining date and current date to evaluate the duration.
  3. LLM is quickly able to use tools. Thing to try is when does the LLM start getting confused with the tools available to it and how to plan and orchestrate their use.

Talking with your Data (SQL) Part 3: Agentic System

This is the third post in this series. The first one was creating a basic text-to-SQL tool that used a Large Language Model to create SQL based on user’s question and data schema. The second one tested the text-to-SQL capabilities of Github Copilot. We use a SQLite database as a backend.

  1. Talking with your Data: Local LLM
  2. Talking with your Data: Using Github Copilot

In this post we look at answering the user’s question using a conversational mode of interaction as opposed to the Q&A mode.

Figure 1: User interaction with LLM and LLM interactions with Tools.

The LLM is made aware of the available tools and how to use them (see Figure 1). In this case we have three tools:

  1. SQL Tool: LLM provides SQL query through <SQL> </SQL> tags which is then caught by the Tool Runner and forwarded to the SQL tool. After the SQL tool finishes running any output (including errors) are passed back to the LLM.
  2. Ask A Question Tool: LLM is allowed to ask a follow-up question by using the <Q> </Q> tags which is then caught by the Tool Runner and forwarded to the Ask Q. tool. This sends the question to the user and returns the user’s response back to the LLM.
  3. Answer the Question Tool: LLM can also decide that it is now able to answer the question that was asked. It can use the <R> </R> tags to wrap the answer. The tool then sends that to the user and gets their confirmation if the question has been answered or further help is needed. This is there to continue the conversation in case the user has follow-up questions.

A running history of the inputs and LLM responses is maintained which ensures the context is maintained. This requires a model with decent context size.

These types of systems that combine LLMs with tool use and some level of feedback to achieve one or more goals are referred to as Agentic Systems.

Models

I have used Open AI GPT-4o-mini and Google Gemini-1.5-flash. Both models are quite user-friendly and cheap to use. For all my testing with OpenAI I spent $0.02 and because I stayed within the free-tier I did not pay anything for the Google model.

GPT-4o-mini has context size of 128,000 tokens (bit more than a book) and output size of 16,384 tokens (about a book chapter).

Gemini-1.5-flash has context size of 1 million tokens and output size of 8,192 tokens.

The Main Prompt

You are an AI assistant that can take one action at a time to help a user with a question: {question}
You cannot answer without querying for the data. You are allowed to answer without asking a follow up question.
You can only select ONE action per response. Do not create a SQL query if you do not have enough information.
Do not repeat the input provided by the user as part of the response.
Use tags below to indicate the action you want to take:
## Run SQL query within the following tags in one line: <SQL> </SQL>. You cannot execute a query - only generate it.
## Ask a follow up question to the user - surround question with tags in one line: <Q> </Q>.
## If you have enough data to give final answer to the user's question use tags in one line: <R> </R>.
Data schema:
{ddl}

The prompt above is the ‘driving’ prompt. This is the prompt that sets the context, task definition, and constraints in the beginning. The {question} placeholder is where the initial question is added. The data schema placeholder {ddl} is where the data definition is added.

Each tool has its own ‘mini’ prompt that is used to either trigger the user (biological LLM) or the LLM.

The prompt is the same for both the LLMs.

A Conversation using Google Gemini-1.5-flash

The colours map to the paths in Figure 1.

The red arrows represent user input/responses to follow-up questions (Ask A Question Tool and Answer The Question Tool).

The blue arrows represent the response from the SQL Tool.

The orange arrows represent the response from the LLM.

Gemini-1.5-flash example

The interaction starts with the initial question from the questioner (green highlight): ‘which user has purchased the most items?’

The third LLM response (index: 2) answers the question after running a query (blue highlight) but only partially. The questioner wants the full ‘name’ and not just a ‘user-id’ which is clarified through a sub-sequent interaction.

The fifth LLM response (index: 4) provides an updated answer with the user’s ‘name’. This time the the questioner asks a follow-up question about the number of items purchased by that user.

The sixth and final LLM response (index: 5) provides the final answer that includes the ‘name’ and the total number of items purchased.

A Conversation using GPT-4o-mini

GPT-4o-mini example

The above example using GPT-4o-mini is even more amazing. It shows fairly complex reasoning and follow up questions to reduce the subjectivity (e.g., defining ‘last month’). Unlike Gemini-1.5-flash, GPT-4o-mini recognises the schema embedded in the prompt and doesn’t ask a follow-up question about the correct table name. This shows that GPT-4o-mini’s attention mechanism is better in this case for this prompt.

Conclusion

What was surprising was how easy it was to build this app. The code can be found on my Github. There are two files, the first one is the ‘main’ program that drives the LLMs, and the second file is a client that helps me switch between OpenAI and Gemini.

The main file (loop_and_tool)

SQLite Database file (marketplace.db)

LLM Client to abstract away interaction with LLM

The code is mainly plumbing with the LLMs taking a lot of the cognitive load.

Agent-like Aspects

We can see Agentic behaviour in the interaction through the LLM deciding to use different tools (e.g., run a SQL query, ask a question). The system is also able to adjust to changing questions. For example in the case we ask a follow-up question, it does not ask again for table names.

Some things to explore further are :

  1. How can we get to proper agent behaviour where the goal itself can change?
    • This would involve some level of ‘erasing’ history where our initial prompt sets out the ‘goal’ (answering questions in this case) to refocus the LLM towards a different task.
    • This different focus will need to highlighted to the LLM (almost like a command prompt). We will probably see some sort of ‘command task’ mechanism coming in to LLMs to help them focus.
    • We should be able to add different tools, make the LLM aware of them and remove them as required. For example, if we wanted to transfer money, we could provide temporary access to a tool that takes some info and initiates the transfer and take that away (erase it from LLMs memory) to ensure it can’t be misused.
  2. How can we improve the prompt so that the LLM doesn’t ask which table to use?
  3. How can we test different database back ends and how can we direct the LLM to generate suitable SQL query?
  4. How can we supplement user interaction with mechanisms like RAG to onboard new information (e.g., dynamically load schema)?
  5. How can we package this? There are various frameworks such as LangChain, Llama Stack that we can explore.

Understanding LLM Outputs

When a large language model generates text one token at a time, it takes the original prompt and all the tokens generated so far as the input.

For this style of text generation the original prompt acts as the guide (as it is the only ‘constant’ being repeated at the start of every input). Each new token added interacts with the original prompt and the tokens added before it to decide future tokens.

Therefore, each added token has to be selected carefully if we are to generate useful output. That said, each token that is added does not hold the same weight in deciding future tokens. There are tokens that do not commit us to a particular approach in solving the problem and there are those that do. The ones that do are called ‘choice points‘.

For example, if we are solving a mathematical problem the numbers and operators we use are the choice points. If we use the wrong operator (e.g., add instead of subtract) or the wrong digits then the output will not be fit for purpose. Compare this to any additional text we add to the calculation to explain the logic. As long as the correct choice points are used there is flexibility in how we structure the text.

Framing the Problem

Given the prompt has ‘p’ tokens.

Then to generate the first token: xp+1 = Sample(P(x1, x2, … , xp))

Then to generate the second token: Xp+2 = Sample(P(x1, x2, … , xp, xp+1))

And then to generate the ith token where i>0, Xp+i+1 = Sample(P(x1, x2, … , xp, Xp+i))

P represents the LLM that generates the probability distribution over the fixed vocabulary for the next token.

Sample represents the sampling function that samples from the probability distribution to select the next token. In case we set disable sampling then we get greedy sampling (highest scoring token).

Now the scores that we obtain from P are dependent on the initial prompt and subsequent tokens. In certain cases the previous tokens will give clear direction to the model and in other cases the direction will not be clear.

Interpreting Model Behaviour

Clear direction is represented by the separation between the highest and second highest scores. More the gap, clearer is the signal generated by the model. This represents important tokens that do not have alternatives.

On the other hand if the separation is not large then we can say the signal is weak. There could be several reasons for the signal to be weak – such as the model is not clear about what comes next given the tokens presented so far or there is genuine flexibility in the next token.

This separation is critical because if we use sampling (i.e., don’t lock down the response) we could get different flows of text in cases where the separation is weak whilst in cases of high separation – securing the tokens critical for a correct response.

An Example of a Choice Point

I am using Microsoft phi-3.5-mini-instruct running within my LLM Server harness. We are collecting the logits and scores at the generation of each token using a LogitsProcessor. We are not using sampling. We then identify the top two tokens, in terms of the model score, and compare the difference between them. Let us go through an example.

The prompt: What is 14+5*34?

The response (colour coded):

Example response to the prompt: What is 14+5*34?

I am calculating the percentage difference between the highest and second highest scoring tokens. Then colour coding the generated tokens based on that difference using some simple CSS and HTML (output above).

Mapping colour coding to the percentage difference between the highest and second highest tokens.

We can see the green bits are fairly generic. For example: the tokens ‘In this expression’ is green which means the top two tokens were quite close to each other in terms of the score (2-4%). These tokens are redundant as if we remove these the text still makes sense.

The ‘,‘ after the expression has higher separation (10%) which makes sense given this is required given the tokens that precede it.

The real hight value tokens are those that represent the numbers and operators (e.g., ‘5 * 34 = 170’). Even there the choice points are quite clear. Once the model was given the token ‘5’ as part of the input the ‘*’ and ’34’ became critical for correct calculation. It could have taken another approach: start with ’34’ as the choice point and then ‘* 5’ become critical for the correct calculation.

Another Example

This time let us look at something subjective and not as clear cut as a mathematical calculation. Again we are not using sampling therefore we always pick the highest scoring token.

The prompt: What is the weather likely to be in December?

The response (colour coded):

Colour coding of the response tokens using the same mapping to percentage difference.

The starting token ‘The’ is not that critical but once it is presented to the LLM the following tokens become clearer. The token ‘significantly’ is not that critical (as we can see from the low difference). It could be removed without changing the accuracy or quality of the output (in my opinion).

In the first major bullet once it had started with the token ‘Nor’ the following tokens (‘thern’, ‘ ‘, ‘Hem’, ‘is’, ‘phere’) become super critical. Interestingly, once it had received the second major bullet (‘**’ without the leading space) as an input it became critical to have ‘Sou’ as the next token as ‘Nor’ was already present.

The Impact

The key question that comes next is: why is this analysis interesting?

In my opinion this is an interesting analysis because of two reasons:

  • It helps us evaluate the confidence level of the model very easily.
  • It can help us with fine-tuning especially when we want to promote the use of business/domain specific terms over generic ones.

There is even more exciting things waiting when we start sampling. Sampling allows us to build a picture of the ‘optimum’ search space for the response. The LLM and Sampling interact with each other and drive the generation across the ‘high quality’ sequences within the optimum search space. I will attempt to make my second post about this.

Swacch Bharat Abhiyan

What is it?

Swacch Bharat Abhiyan (SBA), started in 2014, is the ongoing Govt. of India campaign focusing on hygiene with one of the key targets of eliminating open defecation (especially in Rural India). It is based on the learnings from the UPA-era Nirmal Bharat Abhiyan (2009) which was not a success.

Building toilets all across India with a focus on villages is one of the key activities carried out as part of the campaign alongside spreading awareness about personal hygiene and changing human behaviour related to this basic human need.

There are two versions of SBA – one focussed on urban areas and the other on rural areas with different agencies involved in execution.

Why am I writing about it?

I am writing about this topic because when I was travelling in India I came across a toilet constructed as part of this program. A few days later a thought jumped into my head… where are the plumbers and where are the sewers? This made me want to understand more about this initiative.

Current State

The scale of the initiative is something to behold. From A-list celebrities to volunteers on the ground – one can say that an honest attempt has been made to stop open defecation. Given the scale of India and the time it takes to change behaviour once toilets are available no one expected a rapid end to open defecation.

What is being seen on the ground is that the large number of toilets built in rural areas since the program started (coverage up from 40% in 2014 to 90+% in 2019) are beginning to have a material impact on health and quality of life. This is especially noticeable in women and children.

Below is the SBA dashboard maintained by the Govt. of India at https://sbm.gov.in/sbmgdashboard/statesdashboard.aspx (accessed 24/8/2024)

In the dashboard above ODF Plus Village/District/State is one that has maintained their open defecation free (ODF) status and has arrangements for solid and liquid waste management.

ODF Plus Model Village/District/State is one that has gone beyond ODF Plus and is looking after general cleanliness measured using different attributes such as absence of plastic waste in public areas.

We can say with some confidence that there has been India-wide implementation of this program with the Central and State Governments working actively to improve access to toilets.

There is an excellent report here prepared by the Feedback Foundation that gives more details about the current state of waste management in India (beyond human waste).

The Questions

Now all this is well and good but an obvious question will be: if we are constructing toilets where there were no toilets before where is the waste going to go and who is going to take care of the toilets?

Where Does The Waste Go?

One would assume if a locality did not have toilets it would not have had a sewer network. As of 2020 just over 30% of India’s urban population has access to sewers [1]. This number is quite a bit lower when it comes to rural households. This is also one of the biggest gaps in the SBA mission.

The Urban Context

For example, within the urban context almost all the large cities in India have grown organically. This means the sewers and other utility networks have also grown organically with some sections still going back to the time of the Raj.

For SBA-Urban if there is a sewer within 30m of the toilet being constructed [2] – the toilet must be connected to the sewer. This is easier said than done especially when those 30m could be cutting through someone’s house or a busy street.

The Rural Context

For the rural version of SBA things get bit crazier. Given that sewer systems are non-existent in villages there is no easy way to transport the waste once you pull the flush. Also, given the effort required to build a sewer system, the toilet building was not going to be paused to allow the infrastructure to catch up. Therefore, low cost static solutions such as pits and septic systems were deployed.

But these pits/septic systems are not suitable for all parts of India given the geographical diversity. For example, in planar regions prone to heavy rainfall and flooding (such as Bihar) these pits can become flooded rendering the toilets unusable and lead to groundwater contamination [3].

The pits have another small problem: they need to be emptied every so often and the sludge disposed off safely. This can be done manually or it can be done using specialised equipment.

This goes back to solid/liquid waste management point in the Plus/Plus Model categorisation. But how much of this management is monitored? How much of this is in private hands? What is happening to the extracted waste? This paper attempts to throw some light on this problem looking at the city of Bengaluru (Karnataka). https://iwaponline.com/washdev/article/9/2/338/66056/When-the-pits-fill-up-in-visible-flows-of-waste-in

What About The Plumbers?

Now since toilets were being built at breakneck speed (kudos to the Central and State Governments) who would look after them? Would you now have plumbers being trained up in every village to look after their brand new toilets?

Given that 50% of Indian households have access to piped water [4] I was also expecting plumbers to be involved in maintaining and building the plumbing.

All of this points to a massive skills gap as the quantity and quality of toilets improve across India. Where are the plumbers to meet the demand? Will we see reverse migration of plumbers from cities to smaller towns and villages?

I came across this interesting article from 2017 around training plumbers while you build toilets [5] and it makes an interesting point about lack of standardisation in plumbing across India. To support the sanitation mission there must be more focus on standardisation of plumbing. This is not a ‘boring’ topic. Because if you have non-standard repair of all these millions of new toilets then we will find over time they will start to malfunction with real consequences such as ground water contamination at scale.

Conclusion

We can say only the first part of this grand and important task of bringing sanitation and clean water to the masses of India is nearing completion (sort-of). Till every village does not have a proper sewer system and waste management/treatment facilities the picture won’t be complete.

All of this work has to be led locally. It is only the locals who can monitor and feedback on the services. It is also the locals who would be impacted by illegal dumping of sludge.

A key question will also be about the backfilling of this work. How do we provide sewers to densely populated areas (especially those outside the major population centres)?

Looking Beyond Next-token Generation

The output produced by Large language models seems even more incredible given the fact that they predict the next token (i.e., next bit of text) based on provided input (prompt) and the tokens generated so far.

The human brain does not work like this. We constantly evaluate, re-write, and re-think as we work. We also use hierarchical planning (e.g., breaking down a document into sections and sub-sections) and other methods to ‘think’ about text at various levels of detail.

LLMs on the other hand can only move forward one token at a time. There is no ‘reverse and rewrite’ mode (as yet).

So it would make sense for people to investigate generating a sequence of tokens rather than just the next token and to see if this has an impact on the quality of the output (which to be fair is already quite amazing).

This is exactly what a research team with members working in Meta have done. The paper is titled: Better & Faster Large Language Models via Multi-token Prediction

Why Multiple Tokens?

First let us understand what we mean by generating multiple tokens. Assuming we have the following prompt: “Calculate 4+6”.

A single token prediction LLM with provide the following sequential output (hidden from us because of convenience methods provided by transformers library):

  • prompt -> <space>
  • prompt + <space> -> 1
  • prompt + <space>1 -> 0
  • prompt + <space>10 -> .
  • Final output: <space>10.

A multi-token prediction with length four might work as below (pipes separate tokens generated in parallel):

prompt -> <space>|1|0|.

Final output: <space>10.

Figure 1 shows the architecture that includes 4 heads to generate n=4 next tokens in parallel.

It is immediately clear if we have this kind of n token generation we are going to get massive speedup in inference at the cost of additional resource (for the extra heads). Also this will make training harder and resource intensive. To get to next-token generation the model can discard heads 2, 3, and 4.

Figure 1: Model architecture for n=4 token generation (source: the paper linked above)

Why Does This Work?

In their results they show significant improvements in Code related tasks and other benchmarks. The key thing to understand is that when we train using multiple-heads we are passing lot of ‘hidden’ information about token choices.

As they state in the paper, and we humans know intuitively: “Not all token decisions are equally important for generating useful texts from language models”.

Imagine when you are presented some text to read. Your brain knows the words it can skip without impacting your understanding of the text. These words may impart stylistic variation (e.g., first voice vs third-person voice) but do not add much to your understanding. These can be thought of as style tokens.

Then there will be some words that will grab your attention because they define the message in the text. These tokens they call choice points and they define the semantic properties of the text.

In a similar way LLMs have to generate (one token at a time – without any reverse-rewrite) the sequence of style tokens and choice point tokens that provides a correct response to the prompt.

Now you can probably begin to understand how LLMs can go off-track. If mistakes are made in the early choice point tokens then it is difficult to recover (as it cannot go back and reselect a different choice point). If mistakes are made in style tokens then recovery may still be possible.

When we train using multiple heads (even if we discard them during inference) we are teaching the LLM about relationships between next n tokens. And the key thing (as they show in the paper) – during training the correlated choice point tokens are weighed together with larger values than the style tokens.

In simple words two tokens that in the generation are related and add meaning to the text are given more importance than those that do not.

This looking-ahead property during training ensures that we already have a mental map (a tiny one) of what then next few words may look like when we are generating.

But n = ?

A valid question would be – what should be the value of n. In other words how big a mental map should be build during the training.

One reasonable answer would be – depends on the task – complex tasks may require bigger mental maps.

For most of the tasks they experiment with n = 4 seems to be the optimal value (except one task where it is 6). Another interesting result is that performance seems to drop at higher values of n. This is also understandable as if we try and think too far ahead we may find many ‘next steps’ and find it difficult to select the correct one.

Talking with your Data (SQL) Part 2: Testing GitHub Copilot

In the previous post I investigated a text-to-SQL LLM to enable ‘talking to your data’. In this post I am running the same use-case to test the SQL generation capability of GitHub Copilot (GHC) using the plugin for VS Code. This uses the personal edition of GitHub Copilot.

One drawback of using GHC is that we cannot integrate with the backend data source to execute the generated SQL because at the time of writing there is no API-based access available.

Instead if we use the VS Code plugin – we can include the schema definition and other artefacts as we chat with the model. The model takes these included artefacts into consideration when it responds.

This type of ‘copilot’ integration is suitable for an analyst tool that is suitable for both non-technical and technical users. The conversational interface can be used to outline the questions which result in query generation and execution. The technical user can quickly build complex queries.

I expect this to be a generally available capability as all major database providers are integrating some kind of ‘copilot’ / gen-ai assisted querying tool in their product.

Creating the Synthetic Data

In the screenshot below we can see the VS Code GitHub Copilot chat plugin. On the left hand side we have the data schema file open (yellow bracket). This is automatically taken by GHC as a reference (red line) when we ask a question (above the red line).

In its response GHC first explains the required conditions (white bracket) and then it provides the SQL code that we can execute (green line).

The resulting query is absolutely correct.

Asking Questions

We ask the same questions as in the previous post (red lines in the screenshots). This is quite easy as the only reference we need is the data schema file as above which is taken automatically by GHC. The responses from GHC explain the logic behind the response, then show the generated SQL query, and finally the expected result. All the queries are correct!

Tell me the name of the author that has the most sales?
Tell me which books had the highest sales?
Tell me which author has written the most books?

Denormalising the Data

Data denormalisation is often required for analytics at scale to avoid expensive join operations and to keep the queries simple.

GitHub Copilot (GHC) is able to denormalise the schema quite easily as well. In the image below it provides the DDL for the denormalised version.

Denormalise the bookstore tables.

And the fun doesn’t stop there… if I ask GHC to provide a script to extract data from the normalised version of the tables to the denormalised table, it can do that as well!

Insert data into the denormalised structure.

Here is the result after I run the above insert statement. Now this is not the complete data because as the astute reader will observe – in the denormalised table definition all the values cannot be null. Therefore the insert statement has taken that into account and brought over only those books that have transactions against them (see the Transactions table below).

Data injected into the denormalised table.
The Transaction table from the normalised schema.