How do transformers work?

Sorry this page is still under development

Transformers are a type of neural network, specially trained to facilitate machine learning. They get their name from the fact that a key step is to transform data (words, audio, pictures, etc.) into numbers, so they can be processed mathematically.

Transformers are the core of large language models, the most common type of generative AI tool today. Chatbots such as ChatGPT use a large language model to take your input and generate a reply. They can also transform text into images.

Other common models include Whisper (and consumer derivatives such as MacWhisper) which are excellent tools for transforming speech into text, and from one language into another.

The Transformer Process

In this high-level explanation, we will use the most common example, a text chatbot – and see how a large language model uses a transformer to generate the response.

1. Input – the Context

The process starts with you typing in a request. This may be in response to a prompt from the Chatbot. In this case the chatbot displayed the prompt “Ask anything”

The words from your request are combined with this prompt (and possibly more hidden instructions). This combination of the initial prompt and your response (known collectively as the context) is then used as the input to the model.

Ask anything. What would be a good thing to feed my dog to make it more active?

Adding the prompt is a way to help shape the output from the large language model. For example, the prompt may state that the request that you typed should be treated as a question. This means that the potential AI-generated responses should be weighted to favour those that are structured as attempts to answer the question.

As an aside, note that looking like an answer and being the right answer are not the same thing.
In it’s raw form, a large language model will always generate a response, even if the words in the output are meaningless, incorrect or contradictory.

2. Tokenization or Embedding

The model begins by breaking the context into small pieces (known as tokens). It may help to think of tokens as words, but in fact they are often only parts of words and also model punctuation characters. A rough rule of thumb is that one token corresponds to about four characters in English words, so 100 tokens is about 75 words.

The 17 word sentence in our example is split into 19 tokens by GPT-4:

To demonstrate that words are sometimes split into several tokens, this 7 word example uses some longer, more technical words, and is split into a total of 11 tokens by GPT-4:

If you need a programmatic interface for tokenizing text

Try it yourself using the OpenAI Tokenizer

Each large language model contains a a defined vocabulary – the list of known tokens. Thus the tokenisation process splits the words into tokens drawn from this vocabulary. These tokens were defined when the model was trained, by feeding it enormous volumes of text – from websites, digital books, etc. The exact number of tokens in a model’s vocabulary is hard to find, but this table from Glarity (2025) provides some estimates:

Estimated maximum vocabulary sizes of some OpenAI models

Model	Maximum Vocabulary
GPT-2	50,257 tokens
GPT-3	175,000 tokens
GPT-4	100,256 tokens
GPT-4o	199,997 tokens

Each token in the vocabulary has a corresponding set of numbers (vectors) which were generated during the training. These vectors can be thought of multi-dimensional co-ordinates. They place the word mathematically in a model of the relationships identified during training. (This is why the model doesn’t need to retain all the training data to make predictions now).

Unsure about the maths? Try this Introduction to Encoding for non-mathematicians

Attention

Sticking to our “dog” example, in most cases it will be mapping the word dog using entries in the vector that map to other occurrences of the furry domesticated four-legged animal. Consider though, if the sentence contained words such as “journalist” and “target”. In this case stronger weightings may be applied to elements in the vector that map “dog” to the concept of following someone closely, so changing the next most likely words.

Multi-Layer Perceptron or a FeedForward Neural Network

Story continuity, grammar and syntax rules, etc.

Unembedding – the softmax function

A way to convert the numbers obtained from repeated matrix vector multiplication, into a set of probabilities for each token. These must lie in the range from 0 to 1, and the sum of all the probabilities adds up to 1 (100%).

Back to How does generative AI work?