Introduction to encoding for non-mathematicians

Still under development

A potentially useful example

If you are having trouble getting your head around the idea of the vectors and a multi-dimensional mathematical space, then perhaps this example will help. It should demonstrate how we can position tokens in this space. To make things even easier, we will cheat and make up the vector using numbers knowing what they represent. This differs from the way a transformer works because in a transformer these numbers have been derived from initial random values that have subsequently been tuned over repeated tests. As a result just by looking at them we can’t tell what they actually correspond to. More on that later!

Imagine we are writing a story about travelling. The story begins:

I awoke excited for the day ahead. After all, it is not everyday that you greet the morning framed through the window of a castle! Pausing only for breakfast, I want to get out and explore the city of

What could be the next word?

Let’s assume in our model we have a range of tokens that represent cities. It seems likely that the city’s name could be the next word. The question is which one? To keep things simple we will focus on just three possible tokens: Durham, Edinburgh and Sunderland.

Remember that the model associates each of these tokens with a vector (a set of numbers). In this example each vector contains just four values.

Durham: [54.77676, -1.57566, 45,696, 119]
Edinburgh: [55.95206, -3.19648, 435,791, 47]
Sunderland: [54.90465, -1.38222, 177,965, 60]

So for the token ‘Durham’ we have the following numbers:

54.7767
-1.57566
45,696
119

We can use these numbers to plot these points mathematically. Tokens that plot close together are deemed as similar, those further apart, as different. If we see repeated patterns or groups we might even be able to discern opposites.

We can starting with the simplest approach – plotting all the position of the tokens for a single number. If we take the third number in the vector (45,696 for Durham, 435,791 for Edinburgh, 177,965 for Sunderland) we can generate a line like this:

We can see that along this axis (the line), the cities of Durham and Sunderland are relatively close together, near the start of the line, whilst Edinburgh is much further to the right. These values are actually the latest population estimates, and reflect the fact that the population of Edinburgh is 10 times bigger than that of Durham and more than twice that of Sunderland.

Don’t focus too much on what the measure actually represents, just that we can use it to learn something about the tokens. Along this axis, Durham and Sunderland are relatively similar, Edinburgh is the odd one out.

We can then move to two dimensions and plot a label using two of the numbers for each token, to give us an x-y scatterplot:

From this visualisation we can see that the words Durham and Sunderland again lie much closer to each other than Edinburgh.

In this example the numbers reflect their latitude and longitude (expressed in digital degrees) but that doesn’t actually matter.

We can then add a third variable from each token, to create a 3D plot, which has three axes.

In this 3-dimensional model, we can see that each token is distinct, but that there are some similarities if we only look in a certain direction. As before Durham and Sunderland are both to the left of the graph, Edinburgh is much further right. This reflects differences in the values along the x-axis. But moving from the bottom to the top of the diagram (up the z-axis) we can see that this time Sunderland and Edinburgh lie around the same height, with Durham much higher up the graph. The z-value reflects the mean height of the city above sea level, so we can see that the two coastal towns have similar values in this space.

Mathematically, it is possible to add more dimensions to these diagrams (extra axes) and plot the position of each token using all four co-ordinates. However these 4D diagrams can’t easily be represented in 3D. In a real transformer model each vector contains thousands of numbers, creating a mind-bending multi-dimensional space. We won’t even attempt to draw that!

So far, all I want you to take away is the fact that if we have some numbers associated with each token (the word or word part), then we can combine these in different ways to see relationships.

Applying this to the model

This list is used to transform each token from the context into a vector. This is a required step to process the data mathematically. This multi-dimensional space helps model the meaning of the words.

Be careful, modelling the meaning mathematically (knowing that in some sense words like “dog” and “hound” are often used in a similar way, because they plot closely together on one axis) is not understanding in the human sense.
A generative AI model does not have a concept of what a dog is or what it might be like to be a dog. It can simply predict other (parts of) words that are most likely to follow “dog” in a sentence.

One of the key factors relating to the transformer’s improved performance against early predictive text models is that it processes the whole context, not just the last word. As we’ll see later, the model can also change the probability of a particular word being the next one in the sentence, depending on the other words in the context This is known as attention.