Part 4 of this series focuses on extending the framework we created in the previous post by adding some tools. If you have not read the previous post: Part 3: Agentic Talking with your Data
In the previous examples one of the recurring issues we come across is the LLM’s lack of awareness of the current time and date. This is to be expected given the lack of real time information streams available to the LLM.
There are also some minor fixes and improvements like using Beautiful Soup to extract data from tags and tweaking the prompt itself.
The New Prompt
We can see the new prompt below. We have added two new tools via tags to the prompt. The first one is a ‘time tool’ which allows the LLM to access the latest time and date. The LLM can use the ‘T’ tag to invoke the tool.
You are an AI assistant that can take one action at a time to help a user with a question: {question}
You cannot answer without querying for the data. You are allowed to answer without asking a follow up question.
You can only select ONE action per response. Do not create a SQL query if you do not have enough information. Do not repeat the input provided by the user as part of the response.
Use tags below to indicate the action you want to take:
## Run SQL query within the following tags in one line: <SQL> </SQL>. You cannot execute a query - only generate it.
## Ask a follow up question to the user - surround question with tags in one line: <Q> </Q>.
## If you have enough data to give final answer to the user's question use tags in one line: <R> </R>.
## You can ask for help with finding current time and date using: <T> </T>.
Data schema:
{ddl}
Output Example
We can see the trace of an example interaction below using the Gemini 1.5 Flash. Green highlighted text contains the user’s initial input and subsequent responses. The purple highlighted numbers indicate the response from the LLM. The blue highlighted text is the answer provided by the LLM.
Figure 1a: First part of the chat trace.
In the first part of the chat we can see the general pattern emerging where the user asks a question and the LLM asks a question in return [Response 2]. This is typically a request for the answer to the question asked by the user! If the user nudges the LLM to use the schema, it is then generating, executing (via SQL Tool) and returning a proper answer using the Answer Tool [Response 3 and 4]. Finally the user asks a follow up question: their joining date?
Next we come to the second part of the chat as the LLM first generates and then executes the SQL to answer the joining date question [Response 5, 6, and 7].
Then comes the interesting part – the final follow up question: how many days ago was that from today? Previously, the LLM would have asked a follow up question to ascertain the current date. But in this case it immediately fires off a request to the Time and Date tool [Response 8] which is then used (correctly!) to create a new SQL query [Response 9] which then finally gives the correct answer [Response 10] of 41 days even though the SQL tool response is a floating point number which is automatically rounded by the LLM when it converts the SQL result to an answer to the question.
Figure 1b: Second part of the chat.
Conclusion
Some interesting conclusions.
This highlights why answering a question is the end-to-end process and using SQL to gather the data is just one of the possible intermediate steps.
LLM is also capable of overusing the tools. For example, in this case using the SQL tool to calculate date difference rather than just using the joining date and current date to evaluate the duration.
LLM is quickly able to use tools. Thing to try is when does the LLM start getting confused with the tools available to it and how to plan and orchestrate their use.
This is the third post in this series. The first one was creating a basic text-to-SQL tool that used a Large Language Model to create SQL based on user’s question and data schema. The second one tested the text-to-SQL capabilities of Github Copilot. We use a SQLite database as a backend.
In this post we look at answering the user’s question using a conversational mode of interaction as opposed to the Q&A mode.
Figure 1: User interaction with LLM and LLM interactions with Tools.
The LLM is made aware of the available tools and how to use them (see Figure 1). In this case we have three tools:
SQL Tool: LLM provides SQL query through <SQL> </SQL> tags which is then caught by the Tool Runner and forwarded to the SQL tool. After the SQL tool finishes running any output (including errors) are passed back to the LLM.
Ask A Question Tool: LLM is allowed to ask a follow-up question by using the <Q> </Q> tags which is then caught by the Tool Runner and forwarded to the Ask Q. tool. This sends the question to the user and returns the user’s response back to the LLM.
Answer the Question Tool: LLM can also decide that it is now able to answer the question that was asked. It can use the <R> </R> tags to wrap the answer. The tool then sends that to the user and gets their confirmation if the question has been answered or further help is needed. This is there to continue the conversation in case the user has follow-up questions.
A running history of the inputs and LLM responses is maintained which ensures the context is maintained. This requires a model with decent context size.
These types of systems that combine LLMs with tool use and some level of feedback to achieve one or more goals are referred to as Agentic Systems.
Models
I have used Open AI GPT-4o-mini and Google Gemini-1.5-flash. Both models are quite user-friendly and cheap to use. For all my testing with OpenAI I spent $0.02 and because I stayed within the free-tier I did not pay anything for the Google model.
GPT-4o-mini has context size of 128,000 tokens (bit more than a book) and output size of 16,384 tokens (about a book chapter).
Gemini-1.5-flash has context size of 1 million tokens and output size of 8,192 tokens.
The Main Prompt
You are an AI assistant that can take one action at a time to help a user with a question: {question}
You cannot answer without querying for the data. You are allowed to answer without asking a follow up question.
You can only select ONE action per response. Do not create a SQL query if you do not have enough information.
Do not repeat the input provided by the user as part of the response.
Use tags below to indicate the action you want to take:
## Run SQL query within the following tags in one line: <SQL> </SQL>. You cannot execute a query - only generate it.
## Ask a follow up question to the user - surround question with tags in one line: <Q> </Q>.
## If you have enough data to give final answer to the user's question use tags in one line: <R> </R>.
Data schema:
{ddl}
The prompt above is the ‘driving’ prompt. This is the prompt that sets the context, task definition, and constraints in the beginning. The {question} placeholder is where the initial question is added. The data schema placeholder {ddl} is where the data definition is added.
Each tool has its own ‘mini’ prompt that is used to either trigger the user (biological LLM) or the LLM.
The prompt is the same for both the LLMs.
A Conversation using Google Gemini-1.5-flash
The colours map to the paths in Figure 1.
The red arrows represent user input/responses to follow-up questions (Ask A Question Tool and Answer The Question Tool).
The blue arrows represent the response from the SQL Tool.
The orange arrows represent the response from the LLM.
Gemini-1.5-flash example
The interaction starts with the initial question from the questioner (green highlight): ‘which user has purchased the most items?’
The third LLM response (index: 2) answers the question after running a query (blue highlight) but only partially. The questioner wants the full ‘name’ and not just a ‘user-id’ which is clarified through a sub-sequent interaction.
The fifth LLM response (index: 4) provides an updated answer with the user’s ‘name’. This time the the questioner asks a follow-up question about the number of items purchased by that user.
The sixth and final LLM response (index: 5) provides the final answer that includes the ‘name’ and the total number of items purchased.
A Conversation using GPT-4o-mini
GPT-4o-mini example
The above example using GPT-4o-mini is even more amazing. It shows fairly complex reasoning and follow up questions to reduce the subjectivity (e.g., defining ‘last month’). Unlike Gemini-1.5-flash, GPT-4o-mini recognises the schema embedded in the prompt and doesn’t ask a follow-up question about the correct table name. This shows that GPT-4o-mini’s attention mechanism is better in this case for this prompt.
Conclusion
What was surprising was how easy it was to build this app. The code can be found on my Github. There are two files, the first one is the ‘main’ program that drives the LLMs, and the second file is a client that helps me switch between OpenAI and Gemini.
The code is mainly plumbing with the LLMs taking a lot of the cognitive load.
Agent-like Aspects
We can see Agentic behaviour in the interaction through the LLM deciding to use different tools (e.g., run a SQL query, ask a question). The system is also able to adjust to changing questions. For example in the case we ask a follow-up question, it does not ask again for table names.
Some things to explore further are :
How can we get to proper agent behaviour where the goal itself can change?
This would involve some level of ‘erasing’ history where our initial prompt sets out the ‘goal’ (answering questions in this case) to refocus the LLM towards a different task.
This different focus will need to highlighted to the LLM (almost like a command prompt). We will probably see some sort of ‘command task’ mechanism coming in to LLMs to help them focus.
We should be able to add different tools, make the LLM aware of them and remove them as required. For example, if we wanted to transfer money, we could provide temporary access to a tool that takes some info and initiates the transfer and take that away (erase it from LLMs memory) to ensure it can’t be misused.
How can we improve the prompt so that the LLM doesn’t ask which table to use?
How can we test different database back ends and how can we direct the LLM to generate suitable SQL query?
How can we supplement user interaction with mechanisms like RAG to onboard new information (e.g., dynamically load schema)?
How can we package this? There are various frameworks such as LangChain, Llama Stack that we can explore.
When a large language model generates text one token at a time, it takes the original prompt and all the tokens generated so far as the input.
For this style of text generation the original prompt acts as the guide (as it is the only ‘constant’ being repeated at the start of every input). Each new token added interacts with the original prompt and the tokens added before it to decide future tokens.
Therefore, each added token has to be selected carefully if we are to generate useful output. That said, each token that is added does not hold the same weight in deciding future tokens. There are tokens that do not commit us to a particular approach in solving the problem and there are those that do. The ones that do are called ‘choice points‘.
For example, if we are solving a mathematical problem the numbers and operators we use are the choice points. If we use the wrong operator (e.g., add instead of subtract) or the wrong digits then the output will not be fit for purpose. Compare this to any additional text we add to the calculation to explain the logic. As long as the correct choice points are used there is flexibility in how we structure the text.
Framing the Problem
Given the prompt has ‘p’ tokens.
Then to generate the first token: xp+1 = Sample(P(x1, x2, … , xp))
Then to generate the second token: Xp+2 = Sample(P(x1, x2, … , xp, xp+1))
And then to generate the ith token where i>0, Xp+i+1 = Sample(P(x1, x2, … , xp, Xp+i))
P represents the LLM that generates the probability distribution over the fixed vocabulary for the next token.
Sample represents the sampling function that samples from the probability distribution to select the next token. In case we set disable sampling then we get greedy sampling (highest scoring token).
Now the scores that we obtain from P are dependent on the initial prompt and subsequent tokens. In certain cases the previous tokens will give clear direction to the model and in other cases the direction will not be clear.
Interpreting Model Behaviour
Clear direction is represented by the separation between the highest and second highest scores. More the gap, clearer is the signal generated by the model. This represents important tokens that do not have alternatives.
On the other hand if the separation is not large then we can say the signal is weak. There could be several reasons for the signal to be weak – such as the model is not clear about what comes next given the tokens presented so far or there is genuine flexibility in the next token.
This separation is critical because if we use sampling (i.e., don’t lock down the response) we could get different flows of text in cases where the separation is weak whilst in cases of high separation – securing the tokens critical for a correct response.
An Example of a Choice Point
I am using Microsoft phi-3.5-mini-instruct running within my LLM Server harness. We are collecting the logits and scores at the generation of each token using a LogitsProcessor. We are not using sampling. We then identify the top two tokens, in terms of the model score, and compare the difference between them. Let us go through an example.
The prompt: What is 14+5*34?
The response (colour coded):
Example response to the prompt: What is 14+5*34?
I am calculating the percentage difference between the highest and second highest scoring tokens. Then colour coding the generated tokens based on that difference using some simple CSS and HTML (output above).
Mapping colour coding to the percentage difference between the highest and second highest tokens.
We can see the green bits are fairly generic. For example: the tokens ‘In this expression’ is green which means the top two tokens were quite close to each other in terms of the score (2-4%). These tokens are redundant as if we remove these the text still makes sense.
The ‘,‘ after the expression has higher separation (10%) which makes sense given this is required given the tokens that precede it.
The real hight value tokens are those that represent the numbers and operators (e.g., ‘5 * 34 = 170’). Even there the choice points are quite clear. Once the model was given the token ‘5’ as part of the input the ‘*’ and ’34’ became critical for correct calculation. It could have taken another approach: start with ’34’ as the choice point and then ‘* 5’ become critical for the correct calculation.
Another Example
This time let us look at something subjective and not as clear cut as a mathematical calculation. Again we are not using sampling therefore we always pick the highest scoring token.
The prompt: What is the weather likely to be in December?
The response (colour coded):
Colour coding of the response tokens using the same mapping to percentage difference.
The starting token ‘The’ is not that critical but once it is presented to the LLM the following tokens become clearer. The token ‘significantly’ is not that critical (as we can see from the low difference). It could be removed without changing the accuracy or quality of the output (in my opinion).
In the first major bullet once it had started with the token ‘Nor’ the following tokens (‘thern’, ‘ ‘, ‘Hem’, ‘is’, ‘phere’) become super critical. Interestingly, once it had received the second major bullet (‘**’ without the leading space) as an input it became critical to have ‘Sou’ as the next token as ‘Nor’ was already present.
The Impact
The key question that comes next is: why is this analysis interesting?
In my opinion this is an interesting analysis because of two reasons:
It helps us evaluate the confidence level of the model very easily.
It can help us with fine-tuning especially when we want to promote the use of business/domain specific terms over generic ones.
There is even more exciting things waiting when we start sampling. Sampling allows us to build a picture of the ‘optimum’ search space for the response. The LLM and Sampling interact with each other and drive the generation across the ‘high quality’ sequences within the optimum search space. I will attempt to make my second post about this.
Swacch Bharat Abhiyan (SBA), started in 2014, is the ongoing Govt. of India campaign focusing on hygiene with one of the key targets of eliminating open defecation (especially in Rural India). It is based on the learnings from the UPA-era Nirmal Bharat Abhiyan (2009) which was not a success.
Building toilets all across India with a focus on villages is one of the key activities carried out as part of the campaign alongside spreading awareness about personal hygiene and changing human behaviour related to this basic human need.
There are two versions of SBA – one focussed on urban areas and the other on rural areas with different agencies involved in execution.
Why am I writing about it?
I am writing about this topic because when I was travelling in India I came across a toilet constructed as part of this program. A few days later a thought jumped into my head… where are the plumbers and where are the sewers? This made me want to understand more about this initiative.
Current State
The scale of the initiative is something to behold. From A-list celebrities to volunteers on the ground – one can say that an honest attempt has been made to stop open defecation. Given the scale of India and the time it takes to change behaviour once toilets are available no one expected a rapid end to open defecation.
What is being seen on the ground is that the large number of toilets built in rural areas since the program started (coverage up from 40% in 2014 to 90+% in 2019) are beginning to have a material impact on health and quality of life. This is especially noticeable in women and children.
In the dashboard above ODF Plus Village/District/State is one that has maintained their open defecation free (ODF) status and has arrangements for solid and liquid waste management.
ODF Plus Model Village/District/State is one that has gone beyond ODF Plus and is looking after general cleanliness measured using different attributes such as absence of plastic waste in public areas.
We can say with some confidence that there has been India-wide implementation of this program with the Central and State Governments working actively to improve access to toilets.
There is an excellent report here prepared by the Feedback Foundation that gives more details about the current state of waste management in India (beyond human waste).
The Questions
Now all this is well and good but an obvious question will be: if we are constructing toilets where there were no toilets before where is the waste going to go and who is going to take care of the toilets?
Where Does The Waste Go?
One would assume if a locality did not have toilets it would not have had a sewer network. As of 2020 just over 30% of India’s urban population has access to sewers [1]. This number is quite a bit lower when it comes to rural households. This is also one of the biggest gaps in the SBA mission.
The Urban Context
For example, within the urban context almost all the large cities in India have grown organically. This means the sewers and other utility networks have also grown organically with some sections still going back to the time of the Raj.
For SBA-Urban if there is a sewer within 30m of the toilet being constructed [2] – the toilet must be connected to the sewer. This is easier said than done especially when those 30m could be cutting through someone’s house or a busy street.
The Rural Context
For the rural version of SBA things get bit crazier. Given that sewer systems are non-existent in villages there is no easy way to transport the waste once you pull the flush. Also, given the effort required to build a sewer system, the toilet building was not going to be paused to allow the infrastructure to catch up. Therefore, low cost static solutions such as pits and septic systems were deployed.
But these pits/septic systems are not suitable for all parts of India given the geographical diversity. For example, in planar regions prone to heavy rainfall and flooding (such as Bihar) these pits can become flooded rendering the toilets unusable and lead to groundwater contamination [3].
The pits have another small problem: they need to be emptied every so often and the sludge disposed off safely. This can be done manually or it can be done using specialised equipment.
This goes back to solid/liquid waste management point in the Plus/Plus Model categorisation. But how much of this management is monitored? How much of this is in private hands? What is happening to the extracted waste? This paper attempts to throw some light on this problem looking at the city of Bengaluru (Karnataka). https://iwaponline.com/washdev/article/9/2/338/66056/When-the-pits-fill-up-in-visible-flows-of-waste-in
What About The Plumbers?
Now since toilets were being built at breakneck speed (kudos to the Central and State Governments) who would look after them? Would you now have plumbers being trained up in every village to look after their brand new toilets?
Given that 50% of Indian households have access to piped water [4] I was also expecting plumbers to be involved in maintaining and building the plumbing.
All of this points to a massive skills gap as the quantity and quality of toilets improve across India. Where are the plumbers to meet the demand? Will we see reverse migration of plumbers from cities to smaller towns and villages?
I came across this interesting article from 2017 around training plumbers while you build toilets [5] and it makes an interesting point about lack of standardisation in plumbing across India. To support the sanitation mission there must be more focus on standardisation of plumbing. This is not a ‘boring’ topic. Because if you have non-standard repair of all these millions of new toilets then we will find over time they will start to malfunction with real consequences such as ground water contamination at scale.
Conclusion
We can say only the first part of this grand and important task of bringing sanitation and clean water to the masses of India is nearing completion (sort-of). Till every village does not have a proper sewer system and waste management/treatment facilities the picture won’t be complete.
All of this work has to be led locally. It is only the locals who can monitor and feedback on the services. It is also the locals who would be impacted by illegal dumping of sludge.
A key question will also be about the backfilling of this work. How do we provide sewers to densely populated areas (especially those outside the major population centres)?
The output produced by Large language models seems even more incredible given the fact that they predict the next token (i.e., next bit of text) based on provided input (prompt) and the tokens generated so far.
The human brain does not work like this. We constantly evaluate, re-write, and re-think as we work. We also use hierarchical planning (e.g., breaking down a document into sections and sub-sections) and other methods to ‘think’ about text at various levels of detail.
LLMs on the other hand can only move forward one token at a time. There is no ‘reverse and rewrite’ mode (as yet).
So it would make sense for people to investigate generating a sequence of tokens rather than just the next token and to see if this has an impact on the quality of the output (which to be fair is already quite amazing).
First let us understand what we mean by generating multiple tokens. Assuming we have the following prompt: “Calculate 4+6”.
A single token prediction LLM with provide the following sequential output (hidden from us because of convenience methods provided by transformers library):
prompt -> <space>
prompt + <space> -> 1
prompt + <space>1 -> 0
prompt + <space>10 -> .
Final output: <space>10.
A multi-token prediction with length four might work as below (pipes separate tokens generated in parallel):
prompt -> <space>|1|0|.
Final output: <space>10.
Figure 1 shows the architecture that includes 4 heads to generate n=4 next tokens in parallel.
It is immediately clear if we have this kind of n token generation we are going to get massive speedup in inference at the cost of additional resource (for the extra heads). Also this will make training harder and resource intensive. To get to next-token generation the model can discard heads 2, 3, and 4.
Figure 1: Model architecture for n=4 token generation (source: the paper linked above)
Why Does This Work?
In their results they show significant improvements in Code related tasks and other benchmarks. The key thing to understand is that when we train using multiple-heads we are passing lot of ‘hidden’ information about token choices.
As they state in the paper, and we humans know intuitively: “Not all token decisions are equally important for generating useful texts from language models”.
Imagine when you are presented some text to read. Your brain knows the words it can skip without impacting your understanding of the text. These words may impart stylistic variation (e.g., first voice vs third-person voice) but do not add much to your understanding. These can be thought of as style tokens.
Then there will be some words that will grab your attention because they define the message in the text. These tokens they call choice points and they define the semantic properties of the text.
In a similar way LLMs have to generate (one token at a time – without any reverse-rewrite) the sequence of style tokens and choice point tokens that provides a correct response to the prompt.
Now you can probably begin to understand how LLMs can go off-track. If mistakes are made in the early choice point tokens then it is difficult to recover (as it cannot go back and reselect a different choice point). If mistakes are made in style tokens then recovery may still be possible.
When we train using multiple heads (even if we discard them during inference) we are teaching the LLM about relationships between next n tokens. And the key thing (as they show in the paper) – during training the correlated choice point tokens are weighed together with larger values than the style tokens.
In simple words two tokens that in the generation are related and add meaning to the text are given more importance than those that do not.
This looking-ahead property during training ensures that we already have a mental map (a tiny one) of what then next few words may look like when we are generating.
But n = ?
A valid question would be – what should be the value of n. In other words how big a mental map should be build during the training.
One reasonable answer would be – depends on the task – complex tasks may require bigger mental maps.
For most of the tasks they experiment with n = 4 seems to be the optimal value (except one task where it is 6). Another interesting result is that performance seems to drop at higher values of n. This is also understandable as if we try and think too far ahead we may find many ‘next steps’ and find it difficult to select the correct one.
In the previous post I investigated a text-to-SQL LLM to enable ‘talking to your data’. In this post I am running the same use-case to test the SQL generation capability of GitHub Copilot (GHC) using the plugin for VS Code. This uses the personal edition of GitHub Copilot.
One drawback of using GHC is that we cannot integrate with the backend data source to execute the generated SQL because at the time of writing there is no API-based access available.
Instead if we use the VS Code plugin – we can include the schema definition and other artefacts as we chat with the model. The model takes these included artefacts into consideration when it responds.
This type of ‘copilot’ integration is suitable for an analyst tool that is suitable for both non-technical and technical users. The conversational interface can be used to outline the questions which result in query generation and execution. The technical user can quickly build complex queries.
I expect this to be a generally available capability as all major database providers are integrating some kind of ‘copilot’ / gen-ai assisted querying tool in their product.
Creating the Synthetic Data
In the screenshot below we can see the VS Code GitHub Copilot chat plugin. On the left hand side we have the data schema file open (yellow bracket). This is automatically taken by GHC as a reference (red line) when we ask a question (above the red line).
In its response GHC first explains the required conditions (white bracket) and then it provides the SQL code that we can execute (green line).
The resulting query is absolutely correct.
Asking Questions
We ask the same questions as in the previous post (red lines in the screenshots). This is quite easy as the only reference we need is the data schema file as above which is taken automatically by GHC. The responses from GHC explain the logic behind the response, then show the generated SQL query, and finally the expected result. All the queries are correct!
Tell me the name of the author that has the most sales?Tell me which books had the highest sales?Tell me which author has written the most books?
Denormalising the Data
Data denormalisation is often required for analytics at scale to avoid expensive join operations and to keep the queries simple.
GitHub Copilot (GHC) is able to denormalise the schema quite easily as well. In the image below it provides the DDL for the denormalised version.
Denormalise the bookstore tables.
And the fun doesn’t stop there… if I ask GHC to provide a script to extract data from the normalised version of the tables to the denormalised table, it can do that as well!
Insert data into the denormalised structure.
Here is the result after I run the above insert statement. Now this is not the complete data because as the astute reader will observe – in the denormalised table definition all the values cannot be null. Therefore the insert statement has taken that into account and brought over only those books that have transactions against them (see the Transactions table below).
This post is about testing text-to-SQL generation capabilities with human in the loop. The broader use-case is around being able to talk with your data for generating insights.
The results will blow your mind!
So instead of creating dashboards around questions (e.g., which product had the highest sales in the previous 3 months?) you could ask any question and an automated system would go and query the right data to return a response in a human understandable way.
Model: For this post I am using the Defog text-to-SQL model (8 billion parameters) based on Llama-3 from Hugging Face: defog/llama-3-sqlcoder-8b. This is a relatively new model (mid-May 2024).
Infrastructure: the model is running locally and takes about 30gigs of RAM for the prompts I used. I use a SQLite 3 database as a backend.
Model Deployment: I have created a python flask web-server (read more here) that allows me to deploy any model once when the server starts and then access it via a web-endpoint.
Software: A python web-client for my model server that takes care of prompting, workflow, human-in-the-loop, and the integration with other components.
Figure 1 shows the flow of the main app and its interactions with the LLM and the SQL database.
Figure 1: Flow of the application (green arrow) and interactions with the LLM and Database.
Challenges to overcome: This is a complex use-case that covers synthetic data creation based on schema, Insert, and Select use-cases against a multi-table database. This requires the LLM to:
Understand the Data Definition (in SQL)
Correctly generate synthetic data based on the definition
Correctly generate the insert statements (in the correct order as the foreign keys have a not null clause)
Understand the user’s question
Convert the question into a valid SQL query with joins and other operations as required
Keep in mind that this is not a gazillion parameter GPT-4o or Gemini. This is not even the larger variant of the sqlcoder model with 34 billion parameters. This is the humble 8b parameter version!
The Schema: The schema simulates a book store with four tables – authors, books, inventory, and transactions. The relationships are as follows:
Author (1) -> (*) Books
Books (1) -> (1) Inventory
Books (1) -> (1) Transactions
Synthetic data is generated for all the tables based on directions given in the first Prompt (prompt 1).
The question (Which Author wrote the most books?) requires a join between the Author and Books tables.
The Stages
Posting some snippets of the output here, the full code can be found on my Github repo.
Creating the Tables
A simple data definition (DDL) script that is run by the python client.
create table authors (author_id varchar(20) primary key not null, name text not null);
create table books (book_id varchar(20) primary key not null, name text not null, author_id varchar(20) not null, FOREIGN KEY(author_id) REFERENCES authors(author_id));
create table inventory (inv_id varchar(20) primary key not null, book_id varchar(20) not null, quantity int not null, FOREIGN KEY(book_id) REFERENCES books(book_id));
create table transactions(tx_id varchar(20) primary key not null, book_id varchar(20) not null, quantity_sold int not null, sale_date date not null, FOREIGN KEY(book_id) REFERENCES books(book_id));
We can see the output above with the full structure of the books table. I am using SQLite Studio to view the structure and data only, all changes are made programmatically. This is still using python sqlite libraries so nothing magical yet.
Creating and Inserting Synthetic Data
For the next step (creating and inserting synthetic data) we bring in the LLM (sqlcoder). The prompt used can be seen in Figure 2. The ddl placeholder just injects the above create SQL statements to provide information about the structure of tables and relationships between them. We also define the number of records per table and some other constratins.
Figure 2: Prompt for generating synthetic data.
The LLM returns the following response to the above prompt. We can see the constraints for number of records and number of books per author are respected.
Below we can see a screenshot from SQLiteStudio after the generated insert statements were run. We can see the SQL generator has clearly picked out the schema of each table and the relationships between them just from the data
The stage is now set for us to move to ‘talking to your data’.
Asking A Few Questions
The first question we ask is: Tell me which author has written the most books?
If you are familiar with the Spider 1.0 challenge from Yale (for Text-to-SQL) this above question is a medium complexity one because the model has to correctly identify the need for a join and populate the current table and field names given the schema.
The prompt template is given below. The placeholder question1 is what we use for questions. In this case I created two separate prompts but the structure is exactly the same and we can easily convert it into a function that takes in a question as a string and outputs the filled template.
Figure 3: Prompt template for a question.
The entry below shows the SQL query generated by the model and the Query Output when that query is executed.
SQL Query:
WITH BookCounts AS (SELECT a.name, COUNT(b.book_id) AS book_count FROM authors a JOIN books b ON a.author_id = b.author_id GROUP BY a.name) SELECT name, book_count FROM BookCounts ORDER BY book_count DESC LIMIT 1;
Query Output: ('John Smith', 3)
The Query output gives the correct answer.
The next question is: Tell me which books had the highest sales?
Now the key point about this query is that I am not talking about the book that sold the most, instead I am talking about the ‘books’. The model recognises this. We use the same template as above but just replace the text in the question1 placeholder with the above question.
SQL Query:
WITH TotalSales AS (SELECT b.book_id, SUM(t.quantity_sold) AS total_sales FROM transactions t JOIN books b ON t.book_id = b.book_id GROUP BY b.book_id) SELECT bs.book_id, bs.total_sales FROM TotalSales bs ORDER BY bs.total_sales DESC NULLS LAST LIMIT 5;
Query Output:
('b5', 6)
('b4', 4)
('b2', 3)
('b1', 2)
('b3', 1)
The Query output once again gives the correct output. The model has interpreted the need to give the top n results. In fact the model has gone a step further and set n to ‘5’, so it returns the top 5 books in terms of sales using a join between the transaction and books table.
The final question is: Tell me which author has the most sales?
Looking at Spider 1.0 this would come in the Hard category.
This time I wanted to check a question that requires more than one join and a group operation. So in this case you need to join the transactions and books table and then group and join the result to the authors table to cater for the fact that a single author may have written multiple books.
SQL Query:
WITH TotalSales AS (SELECT b.author_id, SUM(t.quantity_sold) AS total_quantity_sold FROM transactions t JOIN books b ON t.book_id = b.book_id GROUP BY b.author_id), RankedAuthors AS (SELECT a.author_id, a.name, ts.total_quantity_sold, RANK() OVER (ORDER BY ts.total_quantity_sold DESC) AS sales_rank FROM authors a JOIN TotalSales ts ON a.author_id = ts.author_id) SELECT ra.name, ra.total_quantity_sold FROM RankedAuthors ra WHERE ra.sales_rank = 1;
Query Output:
('Jane Doe', 6)
Once again the result is correct! Amazing stuff as promised! I have tried this demo more than 20 times and it worked flawlessly.
One thing that really bugs me when running larger LLMs locally is the load time for each model. Larger the on-disk size of the model, the longer it takes for the model to be ready to query.
One solution is to run the model in a Jupyter notebook so we load it once and then query as many times as we want in subsequent code blocks. But this is not ideal because we can still have issues that require us to restart the notebook. These issues usually have little to do with the Gen AI model itself (which in most use-cases is treated as a closed box).
Another requirement is to be able to run your application and your LLM on different machines/instances. This can be a common requirement if you want to create a scalable Gen-AI based app. This is the main reason we have services like the Google Model Garden that don’t require you to ‘host’ models and provide API-based access instead.
To get around this I developed a web-server using python flask that can load any model accessible through the huggingface transformers library (using a plugin approach). This is mainly useful for testing and learning but can be made production ready with little effort.
This client shows the power of abstracting the LLM away from its use. We can use the “/info” path to get the prompting hints from the wrapped model. This will help the client create the prompt properly. This is required because each model could have its own prompting style.
Numbers are really beautiful and can often illuminate hidden corners of a complex situation. But before we start, I want to congratulate the citizens of India on an amazing election during record breaking heatwaves.
The results was a transition from Modi 2.0 to NDA 3.0 (previous two being under Mr Vajpayee) with BJP emerging as the single largest party but falling short of an absolute majority by about 32 seats.
In many ways this is a good situation that BJP is on relatively solid ground to be able to support the coalition with clear dependency on its other members to encourage more consultative approach to governance.
Figure 1: Comparing Total Parliamentary Seats in a state vs % of Seats won by NDA. Yellow line shows the half way mark (50% seats).
Back to the numbers…
I only want to show one graph to help explain why we have the given situation. Figure 1 above shows total seats in a State against the win % (in terms of number of seats won by the BJP).
To get absolute majority the states below the orange line had to be much closer to it or in case of Uttar Pradesh, above it.
What didn’t help was the fact that states in the centre and right (those with larger number of seats in the Parliament) ended up below the yellow line (50% seat win line).
The states below or on the yellow line but above the orange line needed to be above the yellow line. For example, this time BJP lost one seat in Jammu and Kashmir – which took the win % below the 50% line (2 out of 5 seats) but still kept it above the orange line (lower envelope).
I want to add one more piece of information to the graph. The trend in voter turnout. In Figure 2 (below) we see that the red marked states (voter turnout was lower than in 2019) are mostly above the 50% win rate (yellow) line. Those below the orange line are mostly green (voter turnout higher than 2019).
Figure 2: Same plot with colour coded states based on voter turnout trend (green for increase, red for decrease, and black for no data).
My first hypothesis is that BJP voters were far more resilient and enthusiastic about voting.
My second hypothesis is that those supporting non-BJP candidates were not convinced about their chances in front of the shock and awe campaign undertaken by the BJP.
Model risk management (MRM) is a process the regulator expects all UK banks to implement if they are using any kind of model-based decision making. This is not a ‘new’ process for the development, verification, monitoring, and reporting of machine learning, statistical, and rule-based decision models. The end-to-end process is shown in the image below. There are additional components like the Model Inventory.
Why is this important?
This is important because MRM covers any type of AI model, doesn’t matter if you build it or you buy it. Doesn’t matter if it is a small XGBoost model or a Large-Language Model. There are clear guidelines for what is expected as part of the modelling process, validation process, and the risk control process.
I don’t think anyone knows how this applies to LLMs. If anyone has ideas I am happy to learn more.
Especially, third-party LLMs like Llama, Mistral, etc. where details of the modelling process are hidden behind intellectual property protection. Model validation and risk management are also new things when it comes to Large-Language Models. The main reason being that we cannot easily tie these to a specific use-case. All you need to do to change the behaviour of a LLM is to change the prompt. You can make it summarise text and then change the prompt to make it answer a question.
The Context
Let us look at various important definitions and concepts presented in SS 123.
Who is covered by SS 123
All regulated UK banks and building societies.
Why do we need SS 123
To manage Model Risk – this is the risk to the consumer, the bank, and/or the wider economy arising from the inappropriate use of models or due to errors made by the models.
What is covered by SS 123
Business decision making, risk management, and reporting – wide coverage.
What is a Model?
SS 123 defines a model (Principle 1.1 [a]) on pages 8 and 9 as:
A model is a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into output. The definition of a model includes input data that are quantitative and / or qualitative in nature or expert judgement-based, and output that are quantitative or qualitative.
Why do I think this covers LLMs? Because Principle 1.1 [a] explicitly mentions qualitative inputs and outputs, including expert judgement-based. Expert judgement-based aligns nicely with a ‘prompt’ for the LLM.
Additionally SS 123 Principle 1.3 talks about risk-based model tiering in order to identify and classify models that represent the highest risks as it goes through the model lifecycle. Why do I think LLMs should be in the ‘highest’ risk (i.e., most effort)? Let us understand through the ‘materiality’ (impact) and ‘complexity’ rating as described in SS 123.
Materiality
Size-based measures based on exposure, number of customers, and market values and qualitative factors related to its importance to informing business decisions.
LLMs are expected to be used by a large number of related use-cases. These could be as a shared instance or multiple instances of the same model. Therefore, the exposure could be widespread. Furthermore the use of LLMs in summarising, Q&A, and augmentation are all examples of ‘informing’ business decisions. For example, an agent using Gen AI-based Q&A system could rely on interpretation and summarisation provided by a Gen AI model.
Complexity
Various factors including: nature and quality of input data, choice of methodology, requirements and integrity of implementation, and the extensiveness of use of the model. But that is not all, when it comes to ‘newly advanced approaches or technologies’ (page 10), the complexity assessment may also consider risk factors related to: use of alternative and unstructured data, and measures of a model’s interpretability, explainability, transparency, and the potentials for designer or data bias to be present.
This last bit is written for LLMs: the use of unstructured data like text, images, and video. Evaluation of model interpretability, explainability, bias etc. are all important unsolved issues with LLMs.
Finally there is requirement for independent assessment and reassessment of models. How would this work for LLMs? Who would do the assessment?
Work To Do Ahead!
Lot of work still needs to be done clearly by all banks and regulated entities but there are many critical unknowns in this area. Key unknowns are how do we carry out independent reviews without clear guidelines? How do we verify the model build and ongoing testing of LLMs, especially those built by third-parties.
Within a bank a key role will be that of the model owner as they will be accountable for model use and with third-party LLMs will not be able to control the development of the same. Does this mean the end-state for LLMs is all about self-build, self-manage, and self-govern? Because there are two other roles: model users and model developers and the risk starts to build up right from when we start planning the model (e.g., selecting datasets and the LLM task).
Finally, the proportionality of MRM principles implies the largest entities (banks) need to make the most effort in this. In fact even simple requirements like tracking models in a model inventory (Principle 1.2) – the type of information you would need to track for LLMs is not straightforward as compared to a traditional machine learning model.