Saving and Using Model Artefacts

One of the keys requirements for creating apps that use AI/ML is the ability to save models after training, in a format that can be used to share and deploy the model.

This is similar to software packages that can be shared and used through language specific package managers such as Maven for Java.

When it comes to models though there multiple data points that need to be ‘saved’. These are:

1) the model architecture – provides the frame to fit the weights and biases in as well as any layer specific logic and custom layer definitions.

2) the weights and biases – the output of the training process

3) model configuration elements

4) post processing (e.g., beam search)

We need to ensure that the above items as they exist in memory (after training) can be serialised (“saved to disk”) in a safe and consistent way that allows it to be loaded and reused in a different environment.

All the major ML libraries in python provide a mechanism to save and load models. There are also packages that can take a model object created by these libraries and persist them on disk.

TLDR: Jump straight to the Conclusions here.

Major Formats for Saving the Model

The simplest type of model is one that has an architecture with no custom layers or complex post processing logic. Basically, it is just a tensor (numbers arranged in a multi-dimensional array). Some examples include: linear regression models, fully connected feed forward neural networks.

More complex models may contain custom layers, multiple inputs and outputs, hidden states, and so on.

Let us look at various options available to us to save the model…

Pickle File

This is the python native way of persisting objects. It is also the default method supported by PyTorch and therefore liberaries built on it (i.e., Transformers library).

Pickle process serialises the full model object to disk therefore it is very flexible. The pickle file contains all the code and data associated with the object. Therefore, as long as the libraries used to create the model are available in the target environment, the model object can be reconstructed and its state set with the weights and bias data. The model is then ready for inference.

A key risk with pickle files is that the constructor code used to rebuild the model object can be replaced with any code. For example, call to a layer constructor can be easily replaced with another python function call (e.g., which scans the drive for important files and corrupts them).

Safetensors

With the popularity of Gen AI and model sharing platforms such as HuggingFace, it became risk to have all this serialised data flying around all over the place. Python libraries like Transformers allow one line download and invocation of models further increasing the risk of using pickle files.

To get around this HuggingFace developed a new format to store ML models called ‘safetensors’. This was a ‘from scratch’ development keeping in mind the usage of such artefacts.

Safetensors library was written in Rust (i.e., really fast) and is not bound to the python ecosystem. It has been designed with restricted ‘execution’ capabilities. It is quite simple to use and has helper methods to save files across different ML libraries (e.g., pytorch).

GGML

If you use a MacBook Pro with an Apple processor for your ML development then you may be familiar with GGML format. GGML format is optimised for running on Apple hardware, has various tweaks (e.g., 16-bit representation to reduce memory size) that allow models to be run locally, and is written in C which is super efficient. As an aside: GGML is also a library used to run models stored in the GGML format.

A major drawback of GGML is that it is not a native format. Scripts convert saved models from PyTorch into the GGML format, therefore for each architecture type you need a conversion script. This and other issues such as lack of backward compatibility when model structure changed, led to the rapid decline of GGML.

GGUF

This format was designed to overcome the issues with GGML (particularly the lack of backward compatibility) while preserving the benefits of being able to run state-of-the-art models in a resource constrained environment like your personal laptop (through quantisation) using GGML. Quantisation involves reducing the granularity of the floating point numbers used to represent the model. This can reduce the amount of storage and memory needed without adversly impacting model performance.

If you have used Nomic’s GPT4ALL (based on llama.cpp) to run LLMs locally, you would have used a quantised model in the GGUF format. You would have also used the GGUF format if you are running on Apple hardware or if you have used llama.cpp directly.

Conclusions

There are three main model formats to consider when it comes to consuming external models and distributing our own model:

Pickle: if you are consuming external models for experimentation without the intention of putting it into production.

Safetensors: if you are ready to distribute your model or are planning to consume an external model for production deployment.

GGUF: if you are on Apple hardware or want to run high performance models in a resource constrained environment or you want to use something like GPT4ALL to ‘host’ the model instead of using say Python Transformers to access and run the model.

Understanding the Key and Query Concept in Large Language Models

The attention mechanism was an important innovation and led to the rise of large language models like GPT. The paper ‘Attention is all you need’ introduced the Transformer network architecture which showed state-of-the-art performance, in sequence prediction tasks, using only the attention mechanism. The attention mechanism used in the paper is called ‘Self-attention’.

Self-attention mechanism learns to pay attention to different parts of the input sequence. This is done by learning how the input sequence interacts with itself. This is where the Key and Query tensors play an important role.

The Key and Query tensors are computed by multiplying the input tensor (in embedding format) with two different weight tensors. The resultant tensors are referred to as the Key and Query tensors. The attention mechanism is trained by tuning the Key, Query, and Value weights. Once trained, the attention mechanism is then able to ‘identify’ the important parts of the input which can then be used to generate the next output token.

If I is the input tensor and Wk and Wq are the Key and Query weights then we can define the Key (K) and Query (Q) tensors as:

K = I @ Wk       (1)

Q = I @ Wq       (2)

(where @ stands for matrix multiplication)

Key Understanding: For each token in the input (I) we have embeddings. The same embeddings are then matrix multiplied three times – once with weights for the Key, once with weights for the Query, and once with weights for the Value. This produces the Key, Query, and Value tensors.

Attention Score

The attention score (self-attention) is calculated by scaling and taking the dot product between Q and K. Therefore this mechanism is also referred to as ‘scaled dot-product attention’. The output of this operation is then used to enhance/degrade the Value tensor.

Key Understanding: The attention score ‘decorates’ the value tensor and is a form of automated ‘feature extraction’. In other words, the attention mechanism pulls out important parts of the input which then aids in generating contextually correct output.

Need for Transformations

These transformations change the ‘space’ the input words (tokens strictly speaking) are sitting in. Think of it like putting two dots on a piece of cloth and then stretching it. The dots will move away from each other. Instead if we fold the piece of cloth then the dots come closer. The weights learnt should modify the input in a way that reflects some property of the inputs. For example, nouns and related pronouns should come closer in a sentence like:

‘The dog finished his food quickly.’

Or even across sentences:

‘My favourite fruit is an apple. I prefer it over a banana.

Key Understanding: We should be able to test this out by training a small model and investigating the change in similarity between the same pair of tokens in the Key and Query tensors.

An Example

Let process the following example: “Apple iPhone is an amazing phone. I like it better than an apple.”

The process:

  1. Train a small encoder-decoder model that uses attention heads.
  2. Extract Key and Query weights from the model.
  3. Convert the input example above into embeddings.
  4. Create the Key and Query tensors using (1) and (2).
  5. Use cosine similarity to find similarity between tokens in the resulting tensors.
  6. Plot the similarity and compare the two outputs.

Below we can see the similarity measure between Key tensor tokens for the example input.

Figure 1: Comparing similarity between tokens of the Key tensor, mapped to the input text.

Below we can see the similarity measure between Query tensor tokens for the example input.

Figure 2: Comparing similarity between tokens of the Query tensor, mapped to the input text.

The yellow blocks in the above figures show exactly the same token (1.00).

‘an’ is a token which is repeated in the example sentence therefore the yellow block lies outside the diagonal.

The tokens ‘apple.’ and ‘Apple’ are quite similar (0.96) in both the Key and Query tensors, but unfortunately not in the current context as these refer to different objects. Similarly, if we look at the tokens ‘Apple’ and ‘iPhone’ (the first row) we find high similarity in both the tensors.

For ‘Apple’ and ‘than’ the similarity in the Key tensor is around 0.72. For the Query tensor it is around 0.68. This means these tokens are closer in the Key space (higher similarity) than the Query space.

Figure 3: Comparing the similarity scores between Key and Query tensors, mapped to the input text.

Figure 3 shows the difference in similarity for the same text between the Key and Query tensor. If we look at ‘Apple’ and ‘apple.’ we see the difference between the two tensors is around 0.001.

This is where the architecture and training process play a critical role. For a model like GPT-3.5, we expect the context to be crystal clear and the generated content should be aligned with the context. In other words, the generated output should not confuse Apple iPhone with the apple the fruit.

Key Understanding: The architecture that involves structures such as multiple attention heads (instead of one) will be able to tease out many different relationships across the input.

Training larger models with more data will ensure the Key and Query weights are able to extract more ‘features’ from the input text to generate the correct output.

RAG with LangChain

This is the fourth post in the Retrieval Augmented Generation (RAG) series where we look at using LangChain to implement RAG.

If you just want the code – the link is just before the Conclusion section at the end of the post.

At its core, LangChain is a python library that contains tools for common tasks that can be chained together to build apps that process natural language using an AI model.

The LangChain website is quite good and has lots of relevant examples. One of those examples helped write a LangChain version of my blog RAG experiment.

This allows the developer to focus on building these chains using the available tools (or creating a custom tool where required) instead of coding everything from scratch. Some common tasks are:

  1. reading documents of different formats from a directory or some other data-source
  2. extracting text and other items from those documents (e.g., tables)
  3. processing the extracted text (e.g., chunking, stemming etc.)
  4. creating prompts for LLMse based on templates and processed text (e.g., RAG)

I first implemented all these tasks by hand to understand the full end-to-end process. But in this post I will show you how we can achieve the same result, in far less dev time, using LangChain.

Introduction to LangChain

Langchain is made up of four main components:

  1. Langchain – containing the core capability of building chains, prompt templates, adapters for common providers of LLMs (e.g., HuggingFace).
  2. Langchain Community – this is where the tools added by the Langchain developer community live (e.g., LLM providers, Data-store providers, and Document loaders). This is the true benefit of using Langchain, you are likely to find common Data-stores, Document handling and LLM providers already integrated and ready for use.
  3. LangServe – to deploy the chains you have created and expose them via REST APIs.
  4. LangSmith – adds observability, testing and debugging to the mix, needed for production use of LangChain-based apps.

More details can be found here.

Install LangChain using pip:

pip install langchain

Key Building Blocks for RAG mapped to LangChain

The simplest version of RAG requires the following key building blocks:

  1. Text Extraction Block – dependent on type of data source and format
  2. Text Processing Block – dependent on use-case and vectorisation method used.
  3. Vector DB (and Embedding) Block – to store/retrieve data for RAG
  4. Prompt Template – to integrated retrieved data with the user query and instructions for the LLM
  5. LLM – to process the question and provide a response

Text Extraction can be a tedious task. In my example I had converted my blog posts to a set of plain text files which meant that reading them was simple. But what if you have a bunch of MS Word docs or PDFs or HTML files or even Python source code? The text extractor has to then process structure and encoding. This can be quite a challenging task especially in a real enterprise where different versions of the same format can exist in the same data source.

LangChain Community provides different packages for dealing with various document formats. These are contained in the following package:

langchain_community.document_loaders

With the two lines of code below you can load all the blog post text files from a directory (with recursion) in a multi-threaded manner into a set of documents. It also populates the document with its source as metadata (quite helpful if you want to add references in the response).

dir_loader = DirectoryLoader("\\posts", glob="**/*.txt", use_multithreading=True)<br>blog_docs = dir_loader.load()

Check here for further details and available document loaders.

Text Processing is well suited for this ‘tool-chain’ approach since it usually involves chaining a number of specific processing tasks (e.g., extract text from the document and remove stop words from it) based on the use-case. The most common processing task for RAG is chunking. This could mean chunking a text document by length, paragraphs or lines. Or it could involve chunking JSON text by documents. Or code by functions and modules. These are contained in the following package:

langchain.text_splitter

With the two lines below we create a token-based splitter (using TikToken) that splits by fixed number of tokens (100) with an overlap of 10 tokens and then we pass it the blog documents from the loader to generate chunked docs.

text_splitter = TokenTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=10)                                                                
docs = text_splitter.split_documents(blog_docs)

Check here for further details and available text processors.

Vector DB integration took the longest because I had to learn to setup Milvus and how to perform read/writes using the pymilvus library. Not surprisingly, different vector db integrations are another set of capabilities implemented within LangChain. In this example instead of using Milvus, I used Chroma as it would do what I needed and a one-line integration.

For writing into the vector db we will also need an embedding model. This is also something that LangChain provides.

The following package provides access to embedding models:

langchain.embeddings

The code below initialises the selected embedding model and readies it for use in converting the blog post chunks into vectors (using the GPU).

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L12-v2"               
emb_kw_args = {"device":"cuda"}                                            
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL, model_kwargs=emb_kw_args)

The vector store integrations can be found in the following package:

langchain_community.vectorstores

Further information on vector store integrations can be found here.

The two lines below first setup the Chroma database and write the extracted and processed documents with the selected embedding model. The first line takes in the chunked documents, the embedding model we initialised above, and the directory we want to persist the database in (otherwise we can choose to run the whole pipeline every time and recreate the database every time). This is the ‘write’ path in one line.

db = Chroma.from_documents(docs, embeddings, persist_directory="./data")      retriever = db.as_retriever()

The second line is the ‘read’ path again in one line. This creates a retriever object that we can add to the chain and use it to retrieve, at run time, documents related to the query.

Prompt Template allows us to parameterise the prompt while preserving its structure. This defines the interface for the chain that we are creating. To invoke the chain we will have to pass the parameters as defined using the prompt template.

The RAG template below has two variables: ‘context’ and ‘question’. The ‘context’ variable is where LangChain will inject the text retrieved from the vector db. The ‘question’ variable is where we inject the user’s query.

rag_template = """Answer only using the context.                             
                  Context: {context}                                        
                  Question: {question}                                        
                  Answer: """

The non-RAG template shown below has only one variable ‘question’ which contains the user’s query (given that this is non-RAG).

non_rag_template = """Answer the question: {question}                         
                      Answer: """

    The code below sets up the objects for the two prompts along with the input variables.

rag_prompt = PromptTemplate(template=rag_template, input_variables=['context','question'])                                                
non_rag_prompt = PromptTemplate(template=non_rag_template, input_variables=['question'])

The different types of Prompt Templates can be found here.

LLM Block is perhaps the central reason for the existence of LangChain. This block allows us to encapsulate any given LLM into a ‘tool’ that can be made part of a LangChain chain. The power of the community means that we have implementations already in place for common providers like OpenAI, HuggingFace, GPT4ALL etc. Lots of code that we can avoid writing, particularly if our focus is app development and not learning about LLMs. We can also create custom handlers for LLMs.

The package below contains implementations for common LLM providers:

langchain_community.llms

Further information on LLM providers and creating custom providers can be found here.

For this example I am running models on my own laptop (instead of relying on GPT4 – cost constraints!). This means we need to use GPT4ALL with LangChain (from the above package). GPT4ALL allows us to download various models and run them on our machine without writing code. They also allow us to identify smaller but capable models with inputs from the wider community on strengths and weaknesses. Once you download the model files you just pass the location to the LangChain GPT4ALL tool and it takes care of the rest. One drawback – I have not been able to figure out how to use the GPU to run the models when using GPT4ALL.

For this example I am using ORCA2-13b and FALCON-7b. All it takes is one line (see below) to prepare the model (in this case ORCA2) to be used in a chain.

gpt4all = GPT4All(model=ORCA_MODEL_PATH)

The Chain in All Its Glory

So far we have setup all the tools we will need to implement RAG and Non-RAG chains for our comparison. Now we bring all of them on to the chain which we can invoke using the defined parameters. The cool thing is that we can use the same initialised model on multiple chains. This means we also save time coding the integration of these tools.

Let us look at the simpler Non-RAG chain:

non_rag_chain = LLMChain(prompt=non_rag_prompt, llm=gpt4all)

This uses LLMChain (available as part of the main langchain package). The chain above is the simplest possible – which is doing nothing but passing a query to the LLM and getting a response.

Let us see the full power of LangChain with the RAG chain:

rag_chain = (
{"context": retriever | format_docs , "question":RunnablePassthrough()}                                                              | rag_prompt                                                                        | gpt4all
)

The chain starts off by populating the ‘context’ variable using the vector db retriever and ‘format_docs‘ function we created. The function simply concatenates the various text chunks retrieved from the vector database. The variable ‘question’ is sent to the retriever as well as passed through to the rag_prompt (as defined by the template). Following this the output from the prompt template is passed to the model encapsulated by gpt4all object.

The retriever, rag_prompt, and gpt4all objects are all tools that we created using LangChain. ‘format_docs‘ is a simple python function that I created. We can use the ‘run’ method on the LLMChain and the ‘invoke’ method on the RAG chain to execute the pipeline.

The full code is available here.

Conclusion

Hopefully this post will help you take the first steps in building Gen AI apps using LangChain. While it was quite easy to learn and build applications using LangChain, it does have some drawbacks when compared to building your own pipeline.

From a learning perspective LangChain hides lot of the complexities or exposes them through argument lists (e.g., running embedding model on GPU). This might be good or bad depending on how you like to learn and what your learning target is (LLM vs just building Gen AI app).

Frameworks also fix the way of doing something. If you want to explore different options of doing something then write your own code. Best approach may be to use LangChain defined tools for the ‘boilerplate’ parts of the code (e.g., loading documents) and your own code (as shown with the ‘format_doc’ function) where you want to dive deeper. The pipeline does not need to have a Gen AI model to run. For example, you could just create a document extraction chain and integrate manually with the LLM.

There is also the issue with understanding how to productionise the app. LangSmith offers lot of the capabilities but given that this framework is relatively new it will take time to mature.

Finally Frameworks also follow standard architecture options. Which means if our components are not part of a standard architecture (e.g., we are running models independently of any ‘runner’ like GPT4ALL), as can be the case in an enterprise use-case, then we have to build out LangChain custom model wrapper that allows us to integrate it with the chain.

Finally, frameworks can come and go. We still remember the React vs Angular choice. It is usually difficult to get developers who know more than one complex framework. The other option in this space, for example, is LlamaIndex. If you see the code examples in the link you will find similar patterns of configuring the pipeline.

Measuring Benefits of RAG

Understanding the Response

In this final post we understand how we can start to measure the performance of Retrieval Augmented Generation (RAG) and therefore estimate its value. Figure 1 shows the basic generation mechanism in Text Generation Models. The query has to trigger the correct problem solving paths which combines the training data to create a response. The response is a mix of text from the training data and newly generated text. This generated text can be an extrapolation based on the training data or it could be simply gluing the training data together without adding new facts.

Figure 1: Basic mechanism of text generation using a Generative AI Model.

For the enterprise question and answer use-case we want to start with the latter behaviour but tune it, as shown in Figure 2, to include relevant data from a proprietary database, minimize data from training, and control the generated text. So ideally, we want the model to use its problem solving capabilities to generate a response based on relevant proprietary data.

Figure 2: Retrieval augmented response generation.

Figure 3 shows some possible combinations we can find in the response.

Figure 3: Possible combinations present in a RAG response.

A response that consists of large amounts of generated text with little grounding in training data could be a possible hallucination. If the response is based mainly on training data with either generated text or data from proprietary database added to it we may find the response plausible but it may not be accurate.

To illustrate this with an example: say this was deployed on a retail website and the customer/agent wanted to ask about the return policy associated with a particular product. We want the model to use its problem solving (language generation) capabilities to generate the response, using the data associated with product policy available in the retailers database. We do not want it to create its own returns policy based on its training data (training data driven in Figure 3) or for it to get ‘creative’ and generate its own returns policy (hallucination). We also don’t want it to mix training data with proprietary data because there may be significant differences between generic return policies it may have observed during training and the specific policy associated with that product.

The ideal response (see Figure 3) consists of data retrieved from the proprietary database, where we can be assured of its accuracy, validity, and alignment with business strategy with the generated text used as a glue to explain the policy without adding, changing, or removing any part of it. In this case we are using the text comprehension (retrieved docs), problem solving (summarisation/expansion), and text generation skills of the model.

Building the Baseline

To evaluate the RAG response we need to build a baseline. This is critical for cases where the proprietary data is not entirely unique and similar examples may be present in the training data of the model being used. For example, standard terms and conditions that have been customised for an organisation. In this case, it is important to run the common queries we expect against the model without any supporting proprietary data (see Figure 1). This tells us what answer does the model give based purely on training data and how is it different from the the organisation specific answer.

Therefore, to build the baseline we create a corpus of questions (both relevant and nonsensical) that we believe the customers would ask and interrogate the selected model. The analysis of the response tells us what the model knows and which class of undesired responses are most common. This analysis must be done by subject matter experts to ensure semantic correctness.

Running with RAG

Using the same corpus of questions with RAG pipeline allows us to generate what we expect to be accurate and focused responses. We can compare these with the baseline and weed out cases where the model is focusing more on the training (generic) data than the retrieved proprietary data.

Ideally, we want the RAG response to be different from the non-RAG responses for the same question. For the case where the proprietary data is unique (e.g., customer interactions, transactions, internal company policies, etc.) this difference should be quite significant. We should evaluate this difference both objectively (say using embedding vectors – see next section) as well as subjectively using subject matter experts.

We should analyse the response where the question was not related to the domain of the organisation to protect against different prompt-based attacks. These should produce some response from the model when we are not using RAG. When we use RAG we should be able to block the request to the model because we do not find relevant data in the proprietary database.

Building a corpus of questions also allows us to cache the model response thereby reducing runtime costs. This would require first checking the question against a cache of answered questions and running the RAG pipeline only if the question is not found in the cache or there is significant difference (e.g., using embedding vector distance).

Comparing the Output

The basic requirement here is to compare the response generated with and without RAG, and proving that RAG response is accurate and safe even when asked unrelated or nonsensical questions.

There are various objective metrics we can look at.

Firstly, we can compare the amount of text generated with and without RAG. We would expect on common topics the generated text may be similar in length but on unique topics there must be a significant difference.

Secondly, we can use an embedding generating model (e.g., sentence-transformers/all-MiniLM-L6-v2) to create embeddings from the RAG and non-RAG responses. These can then be compared using similarity metrics.

Thirdly, we can use subject matter experts and customer panels to manually evaluate the responses (without revealing which answer was generated with/without RAG). What we should find is a clear preference for the RAG answer.

In this post we will focus on comparing the amount of text generated as well as the similarity between the RAG and non-RAG response.

If we find no significant difference between RAG and non-RAG response and no clear preference for the RAG response then the dataset may be a good target for non-generative AI based methods that are significantly cheaper to run.

An Example

It is always good to get your hands dirty to explore a topic. I built my own RAG pipeline using using my blog posts as the source of proprietary data.

Step 1: Review the documents in the data store and build the corpus with a mix of related and unrelated questions.

Step 2: Build the RAG pipeline – see the second post. Three major components are: Milvus vector database, sentence transformer for similarity, and GPT4 as the text generation model.

Step 3: Run each question multiple times with and without RAG and collect the response data (text and tokens generated). This will allow us to get a good baseline.

Comparing Amount of Text Generated

Figure 4 compares length of text generated with and without RAG. Given the set of questions (X-axis) we compare the number of tokens (Y-axis) generated by the model with and without RAG. We repeat the same question multiple times with and without RAG and calculate the average value.

In general when using RAG we see more text is generated on average than without RAG. Although for some questions (e.g., question no. 13 and 16) the text generation is less with RAG. The two questions relate to topics which are unlikely to be found in the proprietary dataset or the training data of the model (e.g., ‘who is Azahar Machwe’). Question 17 shows a big difference in size given that one of the posts in my blog relates directly to the question.

Figure 4: Comparing average amount of text generated with and without RAG.

Figure 5 shows actual lengths of generated text with and without RAG. We can see with RAG the spread and skew is quite significant. For RAG the output is less spread out and the skew also appears to be less. The median for Non RAG is below the first quartile of RAG.

Figure 5: Comparing actual lengths of generated text with and without RAG.

These two plots (Figure 4 and 5) show that adding proprietary data is influencing the quantity of text generated by the model.

Comparing Similarities of Output

We use the same set of questions and generate multiple responses with and without RAG. We then measure the similarity of the RAG response to the Non-RAG response. Figure 6 shows the similarities plotted against different questions.

Figure 6: Similarity score of RAG and Non-RAG responses across different questions.

We can see a few patterns in the spread of the similarities.

  1. Questions 2, 14, and 15 show a large spread with a skew which suggests that the questions being asked have answers in the proprietary data as well as the training data. In this case these were technical questions about ActiveMQ, InfluxDB, and New Delhi, India.
  2. Questions 1, 13, and 17 show no spread and are at one end of the similarity scale.
    • Question 1 and 13 show that the RAG and Non-RAG responses are similar.
      • This could be because my original blog post and training data for the model were sourced from the same set of articles.
      • Both my blog posts and the model had no information about it (therefore similar ‘nothing found’ responses from the model).
    • Question 17 is at the other end and shows that RAG and Non-RAG responses have low similarity. This question was very specific to a blog post I wrote. Comparing it to Figure 4 we can see that the RAG response contains much more text than the Non-RAG.
  3. Questions 5, 6, 7, 8, and 9 with similarity range between 0.7 and 0.9. This points to responses that:
    • Need additional scrutiny to ensure we are not depending on training data.
    • Questions that need further prompt design to sample responses that give low similarity (below 0.7)

Key Takeaways

  1. Human scrutiny (subject matter experts, customer surveys, and customer panels) is required especially once the solution has been deployed to ensure we evaluate the prompt design, vectorisation strategy, and model performance.
  2. We should build a corpus of questions consisting of relevant, irrelevant, and purely nonsense (e.g., prompt attack) questions and keep updating it whilst the application is in production. This will allow us to evaluate new models, vectorisation strategies, and prompt designs. There are some examples where models are used to generate the initial corpus where no prior examples are available.
  3. Attempt to get separation between RAG and Non-RAG response in terms of similarity and amount of text generated. Fold in human scrutiny with this.
Figure 7: Developing and operationalising RAG-based applications.

Implementing RAG (2)

In the previous post we looked at the need for Retrieval Augmented Generation (RAG). In this post we understand how to implement it.

Figure 1 shows RAG architecture. Its major components include:

  • Embedding Model: To convert query and proprietary data text into vectors for vector search.
  • Vector DB: To store and search vectorised documents
  • Large Language Model (LLM): For answering User’s query based on retrieved documents
  • Proprietary Data Store: source of ‘non-public’ data
Figure 1: Outline of RAG Architecture

I used python to create a simple app to explore RAG – using my blog posts as the proprietary ‘data source’ that we want to use for answering questions. Component mapping shown in Figure 2 and described below.

  • Embedding Model: Sentence-Transformers (python sentence_transformers)
  • Vector DB: Milvus with python client (pymilvus)
  • LLM: GPT-4 (gpt-4-0613) via Open AI API client (python)
  • Proprietary Data Source: blog posts – one text file per blog post
Figure 2: Components used to implement RAG

I used the dockerised version of Milvus, it was super easy to use, remember to reduce the logging level as the logs are quite verbose. Download the docker compose file from here: https://milvus.io/docs/install_standalone-docker.md

Sentence Transformers (sentence_transformers), Python Milvus client (pymilvus), and OpenAI (openai) Python Client can all be installed using pip.

RAG Implementation Code

The main RAG implementation is here: https://github.com/amachwe/rag_test/blob/master/rag_main.py

The logic is straight forward. We have a list of queries covering topics relevant to the blog post and topics that are not relevant to make sure we can run some experiments on the output.

We execute each query against GPT-4 twice. Once with attached documents retrieved from the vector database (using the same query) and once without any attached documents.

We then vectorise the RAG and Non-RAG response from the language model and compare them to get a similarity score.

The data from each run is saved in a file named ‘run.csv’.

We also log the response from the language model as well.

Vector DB Build Code

The code to populate the vector db with documents can be found here:

https://github.com/amachwe/rag_test/blob/master/load_main_single.py

For this test I created a folder with multiple text files in it. Each text file corresponds to a single blog post. Some posts are really small and some are quite long. The chunking is at the sentence level. I will investigate other chunking methods in the future.

Running the Code

To run the code you will first need to populate the vector db with documents of your choice.

Collection name, schema, and indexes can be changed as needed. In this respect the Milvus documentation is quite good (https://milvus.io/docs/schema.md).

Once the database has been loaded the RAG test can be run. Make sure collection name and OpenAI keys are updated.

In the next post we look at the results and start thinking about how to evaluate the output.

Fun With RAG (1)

A short series on RAG – which is one of the most popular methods for working with LLMs whilst minimising the impact of training bias and hallucinations.

Retrieval Augmented Generation (RAG) involves supplementing the information stored within a large language model (LLM) with information retrieved from an external source for answering questions with improved accuracy and references for validation.

RAG is especially important when we want to answer questions accurately without retraining the LLM on the dataset of interest. This decoupling from retraining allows us to update data independently of the model.

The Source of LLM’s Knowledge

Let us understand a bit more about the above statement regarding the training of LLMs. Generally, training involves multiple stages with different types of data involved in each stage.

The first stage involves large datasets consisting of Wikipedia, books, manuals, and anything else that we can find on the Internet. This is like parents telling their children to read a wide variety of text to improve their knowledge, vocabulary, and comprehension. This usually results in a generic ‘language model’ also known as a foundation LLM. The main problem with such a model is that while it knows a lot about constructing language it has not learnt how to say construct language to support a chat.

The second and subsequent stages involve fine-tuning a foundation model to specialise it to handle specific tasks such as having a ‘chat’ (e.g., ChatGPT) or helping search for information. Now the data used to train during fine-tuning is quite different from what is used in the first stage and includes human feedback and curated datasets. This is similar to a college student reading a coursebook, they already know the language structures and form. The data is usually specialised to reflect the tasks it is being fine tuned for. A lot of the ‘understanding’ of what is a proper response is learnt at this stage.

Key Concept
Therefore as a finished product the LLM has patterns and data from the following sources: the Internet - including books, Wikipedia, and manuals, human feedback on its responses, and specialised curated datasets suitable for the tasks it was developed for.

The Gap

The LLM as a finished product is quite useful. It is able to perform many useful tasks, and use its capability of language formation to respond with the correct syntax and format. For example, if you ask GPT4 to write a poem it will us the correct alphabets and format.

The Scenario

But imagine you were running a company that sold electronic gadgets. Customer and colleague feedback indicates that even though you have a set of Frequently Asked Questions and well structured product guides, it still takes time to get to specific pieces of information. You discover customers are calling to get the agents to summarise and explain parts of the guide. This process is far from simple and often leads to confusion as each agent summarises in a different way and sometimes the less experienced agents provide erroneous information.

Obviously, you get very excited about the power of LLMs and build a business case for how we can save agent time, enable customer self-serve, and increase first-time-correct using LLMs to explain/summarise/guide the agents and the customers. You setup a proof of concept to test the technology and create an environment with your favourite cloud provider to try out some LLMs. This is where you hit the first issue. While the LLMs you are trying have generic knowledge about electronics and can answer common questions, they have no way of answering specific questions because either that data did not form part of the training set or there was not enough of it, or that data did not exist on the public Internet when training was being carried out.

As an example: say the customer wants information about a recently released laptop. LLM will be able answer questions like: ‘what is usb-c’ or ‘tell me about how to clean laptop screens’ as there is a lot of information on the Internet about these topics. But if we ask specific questions like: ‘what is the expected battery life of the laptop’ the LLM will not be able to answer since this laptop model did not exist when the LLM was trained. You want the LLM to say ‘sorry I don’t know’ but it is difficult to identify that unless you can tabulate what you do know.

The worst case is that LLM has partial or related data for say the laptop brand or some other laptop model and it uses that to create a response that sounds genuine but is either incorrect or baseless. This can increase the frustration experienced by the customers and agents.

Key Concept 
The LLM does not have enough knowledge to answer the question. There are other cases where it may have partial knowledge, incorrect knowledge, or it may 'think' it has enough based on links with unrelated knowledge items. But because the LLM is compelled to generate, it will generate some response. This gap in knowledge between what it needs to respond correctly vs what it knows is where RAG comes in.

LLMs: Forward Pass vs Generate

Large language models are complex constructs that can understand and generate language – made up of a token-probability map generator and the next token selector which uses that map to select the next token. We will explore the structure of Large LanguageModels (LLMs) using the transformers python library provided by HuggingFace.

The token generator part is the gazillion-parameter, heavy weight, neural network based, language model. We create it as below, where the ‘model’ object encapsulates the LLM:

There are different model objects associated with each model type. In the above example we are loading a version of the GPT2 model using the ‘from_pretrained’ convenience method.

The token probabilities for the next token are obtained by performing a forward pass through the model as above.

The output will be shaped according to the vocabulary size and output size of the model (e.g., 50,257 for GPT2 and 32,000 for Llama2). If we do some processing of the output and map it against the index in the ‘vocab.json’ associated with the model, we can get a probability map of tokens like below (from GPT2). The screenshot shows the token index and its probability score and text value:

The token generator and selector parts are conveniently encapsulated in the generate method associated with the model object:

The selector uses searching and sampling mechanisms (e.g., top-k/top-p sampling and beam search) to select the next token. This component is what provides the mechanism to inject variation in the generated output. The model itself doesn’t provide any variability. The forward pass will predict the same same set of next tokens given the same input.

The forward pass is repeated again by adding the previously generated token to the input (for auto-regressive models like GPT) which allows the response to ‘grow’ one token at a time. The generate method takes care of this looping under the hood. The max_length parameter controls the number of times this looping takes place (and therefore the length of the generated output).

Generating Embeddings

Embeddings allow non-numeric data such as graphs and text to be passed into neural networks for use-cases such as machine translation. Embeddings map the non-numeric data into a set of numbers that preserve some/all of the properties of the original data.

Let us start with a simple example: we want to translate a single word from one language to another (say English to German). The model will have three parts – an encoder that maps an English word to a point in an abstract numeric space, a ‘decoder’ that maps a point from another numeric space to a German word, and a translator that maps the ‘English’ numeric space to the ‘German’ numeric space.

Encoder: [hat] -> [42]

Translation: [42] -> [100]

Decoder: [100] -> [Hut]

Remember the requirement that some/all of the properties must be preserved when moving to the numeric space? In the example above, for ’42’ to be an embedding and not just an encoding it should reflect the meaning of the word ‘hat’. In a numeric space that has preserved the ‘meaning’ we expect that the numbers for ‘hat’ and ‘cap’ should be close to each other and the number for the word ‘elephant’ should be far away.

In case of images, we would expect embeddings of the different images of hand-written ‘1’ to be close to each other and those for ‘2’ to be far away. Embeddings for ‘7’ might be closer as hand-written ‘7’ can look like a ‘1’.

Generating Embeddings

Training your own embedding generator for the English language is relatively easy as you have lots of free data available. You can use the Book Corpus dataset (https://huggingface.co/datasets/bookcorpus) that consists of 5gb of extracts from books.

To prepare the data for to train the encoder-decoder network you will need a tokenizer which converts a string of words into a string of numeric tokens. I used the freely available GPT2 tokenizer from Huggingface.

During the training process, tokens are then passed through the encoder-decoder network where we attempt to:

  1. train the encoder part to generate an embedding that represents the input string.
  2. train the decoder part to ‘recreate’ the input token string from the generated embedding.

The resulting trained encoder-decoder pair can then be separated (think of breaking a chocolate bar in half) and just the encoder part be used to generate embeddings.

Let us see with an example.

The first step is to tokenize the input string:

‘I love pizza’ -> tokenize -> [45, 54, 200]

Then to train the encoder-decoder pair so that input can be recreated by decoder at the other end:

Input: [45, 54, 200] -> encoder -> [4, 6] -> decoder -> Output: [45, 54, 200]

In the example above the generated embeddings have 2-dimensions (easier to visualise) with the value of [4, 6]. This when passed to the decoder should generate a token string that is same as the input.

The neural network usually looks like the figure ‘8’ with the embedding output found in the narrow ‘waist’.

For my example I used a max sentence length of 1024 words. Embedding dimension of 2 (to make it easier to visualise). Input layer of 1024, with three encoder layers: 500, 200 and 2. Mirroring that the decoder with 2, 200, 500 layers and a final output layer of 1024.

Training took 2 days to complete on my laptop.

Output Examples

Figure 1: Plotting single word embeddings.

Figure 1 shows single words put through the encoder and the resulting embedding. Two dimensional embeddings allow easy visualisation of the embedding points. We can see some clusters for Countries and Animals and the merging of the two clusters for ‘Cat’ and ‘India’. Clearly we are getting some ‘meaning’ of the word translating to the embedding but it is far from perfect.

Figure 2: Plotting phrase embeddings.

In Figure 2 we can see the output of phrase embeddings. We see a cluster of phrases around ‘England’ but then an outlier that talks about the London being the capital of England. Other clusters can be seen around Animal phrases and Country phrases. This again is far from perfect.

The main thing I want to highlight is that with minimal effort preparing the data and training the encoder model, it has extracted some concept of ‘meaning’ and this property is translated into the embedding space. Similar phrases are showing up closer to each other in the numeric embedding space. This is after 2 days of training a relatively small model on a low-powered laptop.

A question that arises is ‘how did it learn the meaning’. The answer is relatively straightforward. It did not learn the actual meaning. It learned the relationships between words based on their occurrence in the training data set. This is why large language models cannot be used in a raw state because the data they are used to train with may have all kinds of skewed subsets that lead to things like gender bias.

Imagine training larger models on more powerful infrastructure. Well, you don’t need to imagine, we can already see the power of GPT-4 which took months to train on highly specialized hardware (Nvidia GPUs).

Machine Learning, Forgetting and Ignoring

The focus in AI is all about Machine Learning – training larger models with more data and then showcasing how that model performs on tests. This is testing two main aspects of the model:

  1. Ability to recall relevant information.
  2. To organize and present the relevant information so that it resembles human generated content.

But no one is talking about the ability to forget information and to ignore information that is not relevant or out of date.

Reasons to Forget

Forgetting is a natural function of learning. It allows us to learn new knowledge, connect new knowledge with existing (reinforcement) and deal with the ever increasing flood of information as we grow older.

For Machine Learning models this is a critical requirement. This will allow models to keep learning without it taking more time and effort (energy) as the body of available knowledge grows. This will also allow models to build their knowledge in specific areas in an incremental way.

ChatGPT has a time horizon of September 2021 and has big gaps in its knowledge. Imagine being 1.5 years behind in today’s day and age [see Image 1].

Image 1: Gaps in ChatGPT.

Reasons to Ignore

Machine learning models need to learn to ignore. This is something humans do naturally, using the context of the task to direct our attention and ignore the noise. For example, doctors, lawyers, accountants need to focus on the latest available information in their field.

When we start to learn something new we take the same approach of focusing on specific items and ignoring everything else. Once we have mastered a topic we are able to understand why some items were ignored and what are the rules and risks of ignoring.

Current transformer models have an attention mechanism which does not tell us what to ignore and why. The ‘why’ part is very important because it adds a level of explainability to the output. For example the model can end up paying more attention to the facts that are incorrect or no longer relevant (e.g. treatments, laws, rules) because of larger presence in the training data. If it was able to describe why it ignored related but less repeated facts (or the opposite – ignore repeated facts in favor of unique ones – see Image 2 below) then we can build specific re-training regimes and knowledge on-boarding frameworks. This can be thought of as choosing between ‘exploration’ and ‘exploitation’ of facts.

Image 2: ChatGPT giving wrong information about Queen Elizabeth II, which is expected given its limitations (see Image 1), as she passed away in 2022.

Image 2: Incorrect response – Queen Elizabeth II passed away in 2022.

Image 3: Asking for references used and ChatGPT responding with a few valid references (all accessed on the day of writing the post as part of the request to ChatGPT). Surely, one of those links would be updated with the facts beyond 2021. Let us investigate the second link.

Image 4: Accessing the second link we find updated information about Queen Elizabeth II that she passed away in 2022. If I was not aware of this fact I might have taken the references at face value and used the output. ChatGPT should have ignored what it knew from its training in favor of newer information. But it was not able to do that.

Image 4: Information from one of the links provided as reference.

What Does It Mean For Generative AI?

For agile applications where LLMs and other generative models are ubiquitous we will need to allow these models to re-train as and when required. For that to happen we will need a mechanism for the model to forget and learn in a way that builds on what was learnt before. We will also need the model to learn to ignore information.

With ML Models we will also need a way of validating/confirming that the right items have been forgotten. This points to regular certification of generative models especially those being used in regulated verticals such as finance, insurance , healthcare and telecoms. This is similar to certification regimes in place for financial advisors, doctors etc.

Looking back at this article – it is simply amazing that we are talking about Generative AI in the same context as human cognition and reasoning!

Controlling Expenses

Money worries are back. In response to rising inflation in the UK, Bank of England has been forced to increase the base rate to 4.25%. That is expected to squeeze demand in two ways:

  1. Improving savings rate to encourage people to spend less and save more
  2. Increasing cost of borrowing – making mortgages more expensive

If you are worried about making money stretch till the end of the month then the first thing to do is: understand your spending.

The best way to understand your spending is to look at your sources of money. Generally, people have two sources:

  1. Income – what we earn
  2. Borrowing – what we borrow from one or more credit products

It is important to understand credit cards are not the only credit product out there. Anything that allows you to buy now and pay later is a credit product – even if they don’t charge interest.

Understand your Income and Expenses

Go to your bank account or credit-card app to understand how much money you get in each month, how much of it is spent and how much do you borrow over that amount. Most modern banking apps allow you to download transactions in comma separated values (csv) format or as an excel sheet. Many also have spending analytics that allow you to investigate the flow of money.

Once you have downloaded the data you just need to extract 4 items and start building your own expense tracker (see Excel sheet below).

Four pieces of information are important here:

  1. Date of Transaction
  2. Amount (with indication of money spent or earned)
  3. Categorization of the Transaction
  4. Description (optional) – to help categorize the transaction to allow you to filter

For the Categorization I like to keep it simple and just have three categories:

  1. Essential – this includes food, utilities (including broadband, mobile), transport costs, mortgage, insurance, essential childcare, small treats, monthly memberships. This is essential for your physical well-being (i.e. basic food, shelter, clothes, medicine) as well as mental and emotional well-being (i.e. entertainment, favorite food etc. as an occasional treat).
  2. Non-essential – this includes bigger purchases (> £20) such as dining out, more expensive entertainment, travel etc. that we can do without.
  3. Luxury – this includes purchases > £100 which we can do without.

The Excel sheet below will help you track your expenses. The highlighted items are things that you need to provide.

The items that you need to provide (highlighted in the sheet) are:

  1. Transaction data from your bank and credit-card (Columns A-D), this should be for at least 3 months – this is time consuming the first time as you will have to go through and mark each entry as category 1, 2 or 3 (see above). One you do it for historic data you can then maintain it with minimal effort for new data.
  2. Timespan of the data in months
  3. Income

This will generate the expenses in each category on a total and monthly basis (to help you budget).

This will also generate, based on the income provided, the savings you can target each month.

Note: All results are in terms of the currency used for Income and the Transaction records.

Things To Do

Once you understand what you are spending in the three categories you can start forecasting by taking monthly average spending in each category.

Tracking how monthly average changes over a few months will tell you the variance in your spending. Generally, spending rises and falls as the year progresses (e.g. rising sharply before festive periods like Christmas and Diwali and falling right after).

You can do advanced things like factor for impact of inflation on your monthly spending and savings.

Finally, once you have confidence in the output – you can move items between categories. This will allow you to play what-if and understand the impact of changing your spending patterns on your monthly expenses and savings.