January 2025 – Fish Eye View

Behaviours in Generative AI (Part 3)

Part 1 talks about well-behaved and mis-behaving functions with a gentle introduction using Python. We then describe how these can be used to understand the behaviour of Generative AI models.

Part 2 gets into the details of the analysis of Generative AI models by introducing a llm function. We also describe how we can flip between well-behaved and mis-behaving model behaviours.

In this part we switch to using some diagrams to tease apart our llm function to understand visually the change of behaviours.

Let us first recall the functional representation of a given LLM:

llm(instructions :str, behaviour :dict) -> output :str

This can be shortened visualised as:

Figure 1: Visual representation of the LLM function

The generalisation being: for some instruction and behaviour (input) we get some output. Let us not worry about individual inputs/outputs/behaviours for the next part to keep the explanation simple. That means we won’t worry too much whether a particular output is correct or not.

Now LLMs consist of two blocks – the ‘ye olde’ Neural Network (NN) which is the bit that is trained and the selection function. The selection function is the dynamic bit and the Neural Network the static bit. The weights that make up the NN once trained do not change.

We can represent this decomposition of the llm function into NN function and Selection function as:

Figure 2: Decomposing LLM function into NN and Selection function

The loops within llm function indicate repeated execution before some condition is met and the process ends. Another way to think about this without loops is that the calls to the NN function and the Selection function are chained – and the length of this chain unrolls through time (as determined by the past text – that includes the original instructions plus any new tokens added – and the token generated in current step).

Root Cause of the Mis-behaviour

As the name and flow in Figure 2 suggests the Selection function is selecting something from the output of the NN function.

The NN function

The output of the NN function is nothing but the vocabulary (i.e., all the tokens that the model has seen during training) mapped to a score. This score indicates the models assessment of what should be the next token (current step) based on the text generated so far (this includes the original instructions and any tokens generated in previous steps).

Now as the NN function creates a ‘scored’ set of options for the next token and not a single token that represents what goes next, we have a small problem. This is the problem of selecting from the vocabulary which token goes next.

The Selection Function

The Selection function solves this problem. This is a critical function because not only does this function influence what token is selected during the current step, it also influences the trajectory of the response one token at a time. Therefore, if a mistake is made in the early part of the generation then that is difficult to recover from. Or if a particularly important token was not selected correctly. Think for example in solving a math problem if even one digit is incorrectly selected the calculation is ruined beyond recovery. Remember, LLMs cannot overwrite tokens once generated.

The specific function we use defines the mechanism of selection. For example, we could disregard the score and pick a random token from the vocabulary. This is unlikely to give a cohesive response (output) in the end as we are basically nullifying the hard work done by the model in the current step and making it harder for it in the future steps as we are breaking any cohesive response with each random selection.

Greedy Selection

The easiest and perhaps the least risky option is to take the token with the highest score. This is least risky because given the NN function is static. Therefore, with a given instruction and behaviour we will always expect the same token scores. With greedy token selection (going for the highest score) we are going to select the same token in each step (starting from step 1). As we expect the same token scores from the start we end up building the same response again. With each step we walk down the same path as before. You will notice in Figure 3 – the overall architecture of the llm function has not changed. This is our well-behaved function – where given a specific instruction and behaviour we get the same specific output.

Sampling Selection

The slightly harder and riskier option is to use some kind of statistical sampling method (remember the ‘do_sample’ = True behaviour from the previous posts) which takes into account the scores. So for example, take the top 3 tokens in terms of score and select randomly from them. Or even dynamically taking top ‘n’ using some cooldown or heat-up function. The risk here is that given we are using random values there is a chance for the generation to be negatively impacted. In fact such a llm function will be badly-behaved and not compositional (see below for definition). This is so because now given the same instruction and behaviour we are no longer guaranteed to get the same output (inconsistent relation between input and output).

Figure 4: Sampling function for selection and introduction of hidden data.

This inconsistency requires extra information coming from a different source given that the inputs are not changing between requests.

In Figure 4, we identify this hidden information as the random number used for sampling and the source being a random number generator. In reality we can make this random shift between outputs ‘predictable’ because we are using in actuality a pseudo-random number. With the same seed value (42 being the favourite) we will get the same sequence of random numbers therefore the same sequence of different responses for a given instruction and behaviour.

We are no longer dealing with a one-to-one relation between input and output, instead we have to think about one-to-one relationship between a given input and some collection of outputs.

I hope this has given you a brief glimpse into the complexity of testing LLMs when we enable sampling.

This inconsistency or pseudo-randomness is beneficial when we want to avoid robotic responses (‘the computer says no‘). For example, what if your family or friends greeted you in the exact same manner every time you met them and then your conversation went down the exact same arc. Wouldn’t that be boring? In fact this controlled use of randomness can help respond in situations that require different levels of consistency. For more formal/less-variable responses (e.g., writing a summary) we can tune down the randomness (using the ‘temperature’ variable – see previous post) and where we want more whimsical responses (e.g. writing poetry) we can turn it up.

Conclusion

We have decomposed our llm function into a NN and a Selection function to investigate source of inconsistent behaviours. We have discovered this is caused by hidden data in form of random variables being provided to the function for use in selecting the next token.

The concept of changing function behaviour because of extra information coming in through hidden paths is an important one.

Our brain is a big function.

A function with many hidden sources of information apart from the senses that provide input. Also because the network in our brain grows and changes as we age and learn, the hidden paths themselves must evolve and change.

This is where our programming paradigm starts to fail because while it is easy to change function behaviour it is very difficult to change its interface. Any software developer who has tried to refactor code will understand this problem.

Our brain though is able to grow and change without any major refactoring. Or maybe mental health issues are a sign of failed refactoring in our brains. I will do future posts on this concept of evolving functions.

A Short Description of Compositionality

We all have heard of the term composability – which means multiple things/objects/components together to create something new. For example we can combine simple LEGO bricks to build complex structures.

Compositionality takes this to the next level. It means that the rules for composing the components must be composable as well. That is why all LEGO compatible components have the same sized buttons and holes to connect.

To take a concrete example – language has compositionality – where the meaning of a phrase is determined by its components and syntax. This ‘and syntax’ bit is compositionality. For example we can compose the English language syntax rule of subject-verb-object to make infinitely complex sentences: “The dog barked at the cat which was chasing the rat”. The components (words) and syntax taken together provide the full meaning. If we reorder the components or change the rule composition there are no guarantees the meaning would be preserved.

Behaviours in Generative AI (Part 2)

The first part of this post (here) talks about how Generative AI models are nothing but functions that are not well behaved. Well-behaved means functions that are testable and therefore provide consistent outputs.

Functions misbehave when we introduce randomness and/or the function itself changes. The input and output types of the function is also important when it comes to understanding its behaviour and testing it for use in production.

In this post we take a deep dive into why we should treat Gen AI models as functions and the insight that approach provides.

Where Do Functions Come Into The Picture?

Let us look at a small example that uses the transformer library execute a LLM. In the snippet below we see why Gen AI models are nothing but functions. The pipe function wraps the LLM and the associated task (e.g., text generation) for ease of use.

output = pipe(messages, **configs)

The function pipe takes in two parameters: the input (messages) that describes the task and a set of configuration parameters (configs) that tune the behaviour of the LLM. The function then returns the generated output.

To simplify the function let us reformulate it as function called llm that takes in some instructions (containing user input, task information, data, examples etc.) and some variables that select the behaviour of the function. Ignore for the moment the complexity hiding inside the LLM call as that is an implementation detail and not relevant for the functional view.

llm(instructions, behaviour) -> output

Let us look at an example behaviour configuration. The ‘temperature’ and ‘do_sample’ settings tune the randomness in the llm function. With ‘do_sample’ set to ‘False’ we have tuned all randomness out of the llm function.

configs = {
    "temperature": 0.0,
    "do_sample": False,
}

The above config therefore makes the function deterministic and testable. Given a specific instruction and behaviour that removes any randomness – we will get the exact same output. You can try out the example here: https://huggingface.co/microsoft/Phi-3.5-mini-instruct. As long as you don’t change the input (which is a combination of the instruction and behaviour) you will not notice a change in the output.

The minute we change the behaviour config to introduce randomness (set ‘do_sample’ to ‘True’ and ‘temperature’ to a value greater than zero) we enter the territory of ‘bad behaviour’. Given the same instruction we get different outputs. The higher the temperature value more will be the variance in the output. To understand how this works please refer to this article. I next show this change in behaviour through a small experiment.

The Experiment

Remember input to the llm function is made up of an instruction and some behaviour config. Every time we change the ‘temperature’ value we treat that as a change in the input (as this is a change in the behaviour config).

The common instruction across all the experiments is ‘Hello, how are you?’, which provides an opening with multiple possible responses. The model used is Microsoft’s Phi-3.5-mini. We use ‘sentence_transformers’ library to evaluate the similarity of the output.

For each input we run the llm function 300 times and evaluate the similarity of the outputs produced. Then we change the input (by increasing the temperature value) and run it again 300 times. Rinse repeat till we reach the temperature of 2.0.

Figure 1: Statistics of output similarity scores, given the same input, as the temperature is increased.

The Consistent LLM

The first input uses behaviour setting of ‘do_sample’ = ‘False’ and ‘temperature’ = 0.0. We find no variance at all in the output when we trigger the llm function 300 times with this input. This can be seen in Figure 1 at Temperature = 0.00, where the similarity between all the outputs is 1 which indicates identical outputs (mean of 1 and std. dev of 0).

The llm function behaves in a consistent manner.

We can create tests by generating input-output pairs relevant to our use-case that will help us detect any changes in the llm function. This is an important first step because we do not control the model lifecycle for much of the Gen AI capability we consume.

But this doesn’t mean that we have completely tested the llm function. Remember from the previous post – if we have unstructured types then we are increasing the complexity of our. Let us understand the types involved by annotating the function signature:

llm(instructions :str, behaviour :dict) -> output :str

Given the unstructured types (string) for instructions and output, it will be impossible for us to exhaustively test for all possible instructions and validate output for correctness. Nor can we use mathematical tricks (like induction) to provide general proofs.

The Inconsistent LLM

Let us now look at how easily we can complicate the above situation. Let us change the input by only changing the behaviour config. We now set ‘do_sample’ = ‘True’ and ‘temperature’ = 0.1. This change has a big impact on the behaviour of our llm function. Immediately we start seeing the standard deviation of the similarity score for the 300 outputs start increasing. The mean similarity also starts to drop from ‘1’ (identical).

As we increase the temperature (the only change made to the input) and collect 300 more outputs we find the standard deviation keeps increasing and the mean similarity score continues to drop.

We start to see variety in the generated output even though the input is not changing.

The exact same input is giving us different outputs!

Let us see how the distribution of the output similarity score changes with temperature.

Figure 2: Changes in the output similarity score distribution with the increase in temperature.

We can see in Figure 2 starting from a temperature of 0.0 we see all the outputs are identical (similarity score of 1). As we start increasing the temperature we find a wider variety of outputs being produced as the similarity score distribution broadens out.

At temperature of 1.0 we still find many of the outputs are still identical (grouping around the score of 1) but we see some outputs are not similar at all (the broadening towards the score of 0).

At temperature of 2.0 we find that there are no identical outputs (absence of score of 1), instead we find the similarity score spread between 0.9 and 0.4.

This makes it impossible to prepare test cases consisting of checking input-output value pairs. Temperature is also just one of many settings that we can use to influence behaviour.

We need to find new ways of testing mis-behaving functions based on semantic correctness and suitability rather than value checks.

Why do we get different Outputs for the same Input?

The function below starts to misbehave and provides different outputs for the same input as soon as we set the ‘temperature’ > 0.0.

llm(instructions :str, behaviour :dict) -> output :str

When we are not changing anything in the input something still changes the output. Therefore, we are missing some hidden information that is being provided to the function without our knowledge. This hidden information is randomness. This was one of the sources of change we discussed in the previous post.

Conclusions

We have seen how we can make LLMs misbehave and make them impossible to test in standard software engineering fashion by changing the ‘temperature’ configuration.

Temperature is not just a setting designed to add pain to our app building efforts. It provides a way to control creativity (or maybe I should call it variability) in the output.

We may want to reduce the temperature setting for cases where we want consistency (e.g., when summarising customer chats) and increase it for when we want some level of variability (e.g., when writing a poem). It wouldn’t be any fun if two users posted the same poem written by ChatGPT!

We need to find new ways of checking the semantic correctness of the output and these types of tests are not value matching types. That is why we find ourselves increasingly dependent on other LLMs for checking unstructured input-output pairs.

In the next post we will start breaking down the llm function and understand the compositionality aspects of it. This will help us understand where that extra information is coming from that don’t allow us to make reasonable assumptions about outputs.

Behaviours in Generative AI (Part 1)

While this may seem like a post about functions and testing in Python it is not. I need to establish some concepts before we can introduce Generative AI.

When we write a software function we expect it to be ‘well-behaved’. I define a well-behaved function as a function that is testable. Testable function needs to have stable behaviour to provide consistent outputs.

Tests provide some confidence for us to integrate and deploy a given function in production.

If the function’s behaviour is inconsistent resulting different outputs given the same input then it becomes a lot harder to test.

To explain this in a simple way, imagine you have a function that accepts a single integer parameter x and adds ‘1’ to the provided input (x) and returns the result (y) as an integer.

In Python we could write this function as:

def add_one(x :int) -> int:
    y = x + 1
    return y

Now the above function is easily testable based on stated requirements for add_one. We can, for example, use assert statements to compare actual function output with expected function output. This allows us to make guarantees about the behaviour of the function in the ‘wild’ (in production).

def test_add_one() -> bool:
    assert add_one(10) == 11
    assert add_one(-1) == 0
    return True

Introducing and Detecting Bad Behaviour

Bad behaviour involves (as per our definition) inconsistency in the input-output pairing. This can be done in two ways:

Evolve the function
Introduce randomness

Introduce Randomness

Let us investigate the second option as it is easier to demonstrate. We will modify the add_one by adding a random number after rounding it. The impact this has is subtle (try the code) the result is as expected some of the times. Our existing tests may still pass occasionally but there will be failures. This makes it complex to test the add_one function. The frequency of inconsistent output depends on how randomness is introduced within the function. Given the current implementation we expect the tests to fail approximately 50% of the time (figure out why).

def add_one(x :int) -> int:
    y = x + 1 + round(random.random())
    return y

Evolve the function

Assume we have a rogue developer that keeps changing the code for the add_one function without updating the documentation or the function signature. In this case for example, the developer could change the operation from addition to subtraction without changing the function name, comments, or associated documentation.

Testing Our Example

Given our function is a single mathematical operation with one input and one output, we can objectively verify the results. The inconsistent behaviour resulting from the introduction of randomness or changes made by the rogue developer will be caught before the code is deployed.

Testing Functions with Complex Inputs and Outputs

Imagine if the function was processing and/or producing unstructured/semi-structured data.

Say it was taking a string and returning another string, or it returned an image of the string written in cursive or the spoken version of the string as an audio file (hope the connection with Gen AI is becoming clearer!). I show an example below of a summarising function that takes in some text (string) and returns its summary (string).

def summarise_text(input_text :str) -> str:
    return model.generate([{"role":"user", "content": f"Summarise: {input_text}"}])

Such functions are difficult to test in an objective manner. Since need exact input output pairs, any tests will only help us validate the function within the narrow confines of the test inputs.

Therefore, in the above case we may not catch any changes made to such a complex function (whether through addition of randomness or through function evolution). Especially if the incorrect behaviour surfaces only for certain inputs.

Putting such a component into production therefore presents a different kind of challenge.

The Human Brain: The Ultimate Evolving Function

The human brain is the ultimate evolving function.

It takes all the inputs it receives, absorbs them selectively and changes the way it works – this is how we learn. The impressive thing is that as we learn new things we do not forget what we learnt before – the evolution is mostly constructive. For example, learning Math doesn’t mean we forget how to write English or ride a bicycle.

To mimic this our add_one function should be able to evolve and learn new tricks – for example how to deal with adding one to a complex number or for that matter adding one to anything. A generic signature for such a function would be:

def add_one(a: Any)-> Any:

It may surprise you to know that humans can ‘add_one’ quite easily to a wide range of inputs. Beyond mathematics we can:

add one object to a set of objects (e.g., marbles or toys or sweets)
add one time-period to a date or time
add one more spoon of sugar to the cake mix

Conclusion

So in this part of the series I have shown how well behaved functions can be made to mis-behave. This involves either changing the function internals or by introducing randomness.

Furthermore, the input and output types also have an impact on how we identify whether a given function is well-behaved or not. Operations that give objective results or cases where the expected output can be calculated independently are easy to validate.

The deployment of such functions into production presents a significant challenge.

Generative AI models show exactly the same characteristics as mis-behaving functions.

All Generative AI models can be cast as functions (see the next post in this series). The source of their mis-behaviour comes from randomness as well as evolution. They do not evolve like our brains (by continuous learning) or through the actions of a rogue developer. They evolve every time they are re-trained and a newer version released (e.g., ChatGPT-4 after ChatGPT-3.5).