A short series on RAG – which is one of the most popular methods for working with LLMs whilst minimising the impact of training bias and hallucinations.
- This post: the need for RAG
- Post 2: implementing RAG
- Post 3: measuring benefits of RAG
- Post 4: implementing RAG using LangChain
Retrieval Augmented Generation (RAG) involves supplementing the information stored within a large language model (LLM) with information retrieved from an external source for answering questions with improved accuracy and references for validation.
RAG is especially important when we want to answer questions accurately without retraining the LLM on the dataset of interest. This decoupling from retraining allows us to update data independently of the model.
The Source of LLM’s Knowledge
Let us understand a bit more about the above statement regarding the training of LLMs. Generally, training involves multiple stages with different types of data involved in each stage.
The first stage involves large datasets consisting of Wikipedia, books, manuals, and anything else that we can find on the Internet. This is like parents telling their children to read a wide variety of text to improve their knowledge, vocabulary, and comprehension. This usually results in a generic ‘language model’ also known as a foundation LLM. The main problem with such a model is that while it knows a lot about constructing language it has not learnt how to say construct language to support a chat.
The second and subsequent stages involve fine-tuning a foundation model to specialise it to handle specific tasks such as having a ‘chat’ (e.g., ChatGPT) or helping search for information. Now the data used to train during fine-tuning is quite different from what is used in the first stage and includes human feedback and curated datasets. This is similar to a college student reading a coursebook, they already know the language structures and form. The data is usually specialised to reflect the tasks it is being fine tuned for. A lot of the ‘understanding’ of what is a proper response is learnt at this stage.
Key Concept Therefore as a finished product the LLM has patterns and data from the following sources: the Internet - including books, Wikipedia, and manuals, human feedback on its responses, and specialised curated datasets suitable for the tasks it was developed for.
The Gap
The LLM as a finished product is quite useful. It is able to perform many useful tasks, and use its capability of language formation to respond with the correct syntax and format. For example, if you ask GPT4 to write a poem it will us the correct alphabets and format.
The Scenario
But imagine you were running a company that sold electronic gadgets. Customer and colleague feedback indicates that even though you have a set of Frequently Asked Questions and well structured product guides, it still takes time to get to specific pieces of information. You discover customers are calling to get the agents to summarise and explain parts of the guide. This process is far from simple and often leads to confusion as each agent summarises in a different way and sometimes the less experienced agents provide erroneous information.
Obviously, you get very excited about the power of LLMs and build a business case for how we can save agent time, enable customer self-serve, and increase first-time-correct using LLMs to explain/summarise/guide the agents and the customers. You setup a proof of concept to test the technology and create an environment with your favourite cloud provider to try out some LLMs. This is where you hit the first issue. While the LLMs you are trying have generic knowledge about electronics and can answer common questions, they have no way of answering specific questions because either that data did not form part of the training set or there was not enough of it, or that data did not exist on the public Internet when training was being carried out.
As an example: say the customer wants information about a recently released laptop. LLM will be able answer questions like: ‘what is usb-c’ or ‘tell me about how to clean laptop screens’ as there is a lot of information on the Internet about these topics. But if we ask specific questions like: ‘what is the expected battery life of the laptop’ the LLM will not be able to answer since this laptop model did not exist when the LLM was trained. You want the LLM to say ‘sorry I don’t know’ but it is difficult to identify that unless you can tabulate what you do know.
The worst case is that LLM has partial or related data for say the laptop brand or some other laptop model and it uses that to create a response that sounds genuine but is either incorrect or baseless. This can increase the frustration experienced by the customers and agents.
Key Concept The LLM does not have enough knowledge to answer the question. There are other cases where it may have partial knowledge, incorrect knowledge, or it may 'think' it has enough based on links with unrelated knowledge items. But because the LLM is compelled to generate, it will generate some response. This gap in knowledge between what it needs to respond correctly vs what it knows is where RAG comes in.
Nice. Clear and crisp. Keep writing.
LikeLike