Generating Embeddings

Embeddings allow non-numeric data such as graphs and text to be passed into neural networks for use-cases such as machine translation. Embeddings map the non-numeric data into a set of numbers that preserve some/all of the properties of the original data.

Let us start with a simple example: we want to translate a single word from one language to another (say English to German). The model will have three parts – an encoder that maps an English word to a point in an abstract numeric space, a ‘decoder’ that maps a point from another numeric space to a German word, and a translator that maps the ‘English’ numeric space to the ‘German’ numeric space.

Encoder: [hat] -> [42]

Translation: [42] -> [100]

Decoder: [100] -> [Hut]

Remember the requirement that some/all of the properties must be preserved when moving to the numeric space? In the example above, for ’42’ to be an embedding and not just an encoding it should reflect the meaning of the word ‘hat’. In a numeric space that has preserved the ‘meaning’ we expect that the numbers for ‘hat’ and ‘cap’ should be close to each other and the number for the word ‘elephant’ should be far away.

In case of images, we would expect embeddings of the different images of hand-written ‘1’ to be close to each other and those for ‘2’ to be far away. Embeddings for ‘7’ might be closer as hand-written ‘7’ can look like a ‘1’.

Generating Embeddings

Training your own embedding generator for the English language is relatively easy as you have lots of free data available. You can use the Book Corpus dataset (https://huggingface.co/datasets/bookcorpus) that consists of 5gb of extracts from books.

To prepare the data for to train the encoder-decoder network you will need a tokenizer which converts a string of words into a string of numeric tokens. I used the freely available GPT2 tokenizer from Huggingface.

During the training process, tokens are then passed through the encoder-decoder network where we attempt to:

  1. train the encoder part to generate an embedding that represents the input string.
  2. train the decoder part to ‘recreate’ the input token string from the generated embedding.

The resulting trained encoder-decoder pair can then be separated (think of breaking a chocolate bar in half) and just the encoder part be used to generate embeddings.

Let us see with an example.

The first step is to tokenize the input string:

‘I love pizza’ -> tokenize -> [45, 54, 200]

Then to train the encoder-decoder pair so that input can be recreated by decoder at the other end:

Input: [45, 54, 200] -> encoder -> [4, 6] -> decoder -> Output: [45, 54, 200]

In the example above the generated embeddings have 2-dimensions (easier to visualise) with the value of [4, 6]. This when passed to the decoder should generate a token string that is same as the input.

The neural network usually looks like the figure ‘8’ with the embedding output found in the narrow ‘waist’.

For my example I used a max sentence length of 1024 words. Embedding dimension of 2 (to make it easier to visualise). Input layer of 1024, with three encoder layers: 500, 200 and 2. Mirroring that the decoder with 2, 200, 500 layers and a final output layer of 1024.

Training took 2 days to complete on my laptop.

Output Examples

Figure 1: Plotting single word embeddings.

Figure 1 shows single words put through the encoder and the resulting embedding. Two dimensional embeddings allow easy visualisation of the embedding points. We can see some clusters for Countries and Animals and the merging of the two clusters for ‘Cat’ and ‘India’. Clearly we are getting some ‘meaning’ of the word translating to the embedding but it is far from perfect.

Figure 2: Plotting phrase embeddings.

In Figure 2 we can see the output of phrase embeddings. We see a cluster of phrases around ‘England’ but then an outlier that talks about the London being the capital of England. Other clusters can be seen around Animal phrases and Country phrases. This again is far from perfect.

The main thing I want to highlight is that with minimal effort preparing the data and training the encoder model, it has extracted some concept of ‘meaning’ and this property is translated into the embedding space. Similar phrases are showing up closer to each other in the numeric embedding space. This is after 2 days of training a relatively small model on a low-powered laptop.

A question that arises is ‘how did it learn the meaning’. The answer is relatively straightforward. It did not learn the actual meaning. It learned the relationships between words based on their occurrence in the training data set. This is why large language models cannot be used in a raw state because the data they are used to train with may have all kinds of skewed subsets that lead to things like gender bias.

Imagine training larger models on more powerful infrastructure. Well, you don’t need to imagine, we can already see the power of GPT-4 which took months to train on highly specialized hardware (Nvidia GPUs).