PageIndex Instead of Vectorising

The Retrieval part of RAG requires topic specific data to be retrieved from a knowledge store based on the user’s query.

More focussed the retrieval better grounded will be the generation.

Hidden Complexity of Using Vectors

In Vectorised RAG the retrieval part uses a vector database to store chunks of knowledge and a vector search query to retrieve relevant chunks for the model or agent.

The part of using a vector database is where the hidden complexity lies and this is best explained with a story.

When I started my Masters in 2002 we had a module called Research Methods. I was not aware of any of the core concepts and found the lectures difficult to follow. Search engines were not as powerful or AI-enabled as today. One had to expand ones vocabulary with appropriate technical and domain specific terms to retrieve knowledge to then discover related concepts. Keep doing this rinse and repeat till you map out the topic.

The above story describes how I created a mental index of this new topic ‘Research Methods’ and used that to ground my coursework.

For vectorised RAG we depend on what are known as embedding models to convert text containing knowledge into a vector which is then stored in a vector database. The hope is that the model understands the concepts within that topic and relationships between them. This is implemented, during model training, by exposing the model to real text so that it can learn the statistical patterns within the training text.

Retrieval uses the same embedding model to create the query vector from user’s query text to find ‘related’ knowledge from the vector database.

Now we come to the critical part:

Jut by observing statistical patterns with the training text I can build out a concept space. But there is no way to guarantee that I am a master of that concept space. We humans cannot squash all the concepts into our brains (e.g., an aerospace engineer will have concepts and relationships in their head that a marine engineer may not have even though there will be a high degree of overlap in the concept space).

It is highly unlikely that a relatively small model can contain all the concepts at a level of detail to be useful. Therefore, the vectorisation of different documents may be less or more effective based on how deeply the embedding model was trained on that topic.

For example, if the embedding model was trained without knowledge of Charles Dickens or his works then from a vector creation point of view ‘Bleak House’ will be closer to ‘sad house’ than ‘Charles Dickens’. In fact the concept of Charles Dickens may not be in the embedding model at all and therefore the vectors produced may be based on plain English concept of the word.

Another famous example is given the query ‘apple’ – which of the words in the list should the resulting vector be closest to: [‘orange’, ‘iphone’]. Depending on how the embedding model used to vectorise the query and the words in the list we may see different results.

Impact of training bias: if the model has been trained on knowledge about Apple products like the iPhone then we expect the query to be close to ‘iphone’. If the model has been trained on basic English language then the query may appear closer to ‘orange’. Furthermore, ‘iphone’ and ‘orange’ must be far away from each other unless the model knows about the new ‘orange iPhone’.

Just like my Research Methods example above how you build out a concept map of a topic depends on what sources you have included in your learning.

This is the hidden complexity of vectorisation as an approach. You are completely dependent on the embedding model to help map out the topic space and then retrieve specific concepts.

General purpose embedding models will not capture domain or topic specific concepts.

That is what links the embedding model to the use-case. That is why for medical AI applications you get specialised Medical embedding models (like MedEmbed-small-v0.1).

These embedding models have a time-horizon so if a concept did not exist when these were trained, they would not know about it or how to relate it to other concepts. See example of differences in concept resolution below using OpenAI’s ‘text-embedding-3-small’ (Figure 1a) and ‘text-embedding-3-large’ (Figure 1b). Clearly the ‘large’ model has an accurate model of the Savings Account space for UK.

Figure 1a: Note the significant difference for Individual Savings Account against ISAs and the general term Savings Account. All the ISAs are Individual Savings Accounts which are also Savings Accounts.
Figure 1b: Note the confusion is not seen in the ‘large’ embedding model as all ISAs and the generic term Savings Account are now quite close to Individual Savings Account.

Embedding models cannot explicitly be given context for a comparison.

Another important point from Figure 1a and 1b above is that these distance measures are fixed. We cannot provide context to the embedding model. For example if the context is tax regulation then the similarity score will be very different. As it would be if the context was risk vs reward.

Think of the context as taking the existing high-dimensional vector space in which all these concepts sit and transforming it to bring certain concepts closer and teasing others apart.

Now a question for you: Have you gone through an embedding mode evaluation for your use-case or just picked a default Open AI or Google embedding model and run with that?

All this without even discussing the many variants of how vector results are evaluated and the different styles of vector creation pipelines and the impact of design choices (e.g., chunking strategy).

Enter PageIndex

PageIndex attempts to the burdened developer from all this embedding model selection, tuning vectorisation pipelines, thinking about chunking strategies and spinning up new bits of software like vector databases.

It is based on the concept that let the LLM figure out what is in a document by drawing a table of contents as a light weight index tree called a PageIndex. Then use this content index to identify and extract the relevant sections/concepts from the doc and use those for generating the response in a ‘reasoning loop’ (see Figure 2).

Figure 2: Overview of the PageIndex-based RAG (Source: https://pageindex.ai/blog/pageindex-intro)

Here you are not spinning up a vector db or bouncing between vectors and plain text at the mercy of the embedding model being used. Everything stays plain text and the SMEs can go and evaluate the PageIndex created, tweak it if needed or even create multiple versions for different use-cases.

As a concept it is very interesting. The project can be found here: https://github.com/VectifyAI/PageIndex

{
"node_id": "0006",
"title": "Financial Stability",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"sub_nodes": [
{
"node_id": "0007",
"title": "Monitoring Financial Vulnerabilities",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"node_id": "0008",
"title": "Domestic and International Cooperation and Coordination",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}

The snippet above shows an example of a PageIndex.

Note that the index can be at any level: document, pages, paragraphs etc. and contains metadata to provide additional richness and flexibility to this process (e.g., no need to commit to a chunking strategy).

The way I like to think of PageIndex RAG vs Vectorised RAG is:

  • PageIndex you are using the notes on a topic prepared by your teacher.
  • Vectorised you are using the notes on a topic prepared by your friend based on what the teacher taught in the classroom.

Now this method has its own trade-offs which are critical to understand before we jump on this based on LinkedIn hype.

Firstly, it depends on ‘agentic’ style of generation which means multiple loops / LLM calls as compared to a single lightweight embedding model call. Cost implications can be massive especially when you need to re-index.

Secondly, still dependency on some kind of AI-model. We have expanded the concept space by moving from say a 50m to 8b parameter embedding model to a LLM with greater than 20b parameters, mixture of experts, reasoning mode and many other tweaks. We can provide context to build out different indexes via the page index build prompt. That said we will still be bothered by training bias where different models may produce different indexes.

Thirdly, reindexing becomes unpredictable. Earlier you could train lots of small embedding models based on the topic landscape of your organisation and domain. You could store them as your own assets. Now you are dependent on the major model providers like OpenAI, Anthropic, and Google. Let us say you built out your page index based on v1 of the documents of interest using Gemini Flash 2.0. Six months later some documents are updated to v2 now we need to rebuild the page index. But now Gemini Flash 2.0 has been deprecated so you are forced to move to Gemini Flash 2.5. Here the good thing about the Page Index concept is that you can provide the existing version (based on v1 documents) and the v2 documents and instruct the model to update rather than rebuild from scratch. Again this is increasing our dependence on the model thereby requiring SME validation of the Page Index at each stage.

Fourth, Page Index as a chain of steps. Vector index is a one step process. Page Index makes it an agentic loop where the depth of retrieval is dynamic. This can increase latency and make the process less auditable and repeatable. At the same time it can also make the retrieval process tune-able for different topic areas – where a complex query can be supported by additional retrieval loops where as simple queries can be one-shot. Vector retrieval is by default one-shot.