AI models are trained on trillions of tokens, billions of parameters, and ever-longer context windows.
But what the heck does any of that mean? Is two trillion tokens twice as good as one trillion? And, for that matter, what is a token, anyway?
In this post, we’ll explain key concepts in understanding large language and other AI models and why they matter, including tokens, parameters, and the context window. The post continues by explaining open source versus proprietary models (the difference may not be what you think it is); multimodal models; and the CPUs, GPUs, and TPUs that run the models. Finally, we have a technical-ish discussion of parameters and neurons in the training process and the problem of language equity.
Tokens
The smallest piece of information we use when we write is a single character, like a letter or number. Similarly, a token is the smallest piece of information an AI model uses. A token represents a snippet of text, ranging from a single letter to entire phrases. The amount of text in a token is typically less than the length of a word, meaning the model ends up including more tokens than words, although fewer tokens than the number of characters in a written version of the document. Each token is assigned an ID, and these IDs, rather than the original text, are used to train the model. This “tokenization” reduces the computational power required to learn from the text.
One common method for tokenization is called Byte Pair Encoding (BPE), which starts with the most basic elements—such as characters—and progressively merges the most frequent pairs to form tokens. This allows for a dynamic tokenization that can efficiently handle common words as single tokens while breaking down less common words into smaller sub-words or even single characters. BPE is particularly useful because it strikes a balance between the granularity of characters and the semantic meaning captured in longer strings of text.
A good rule of thumb is that any given text will have about 30 percent more tokens than it does words, though this can vary based on the text and the specific tokenization algorithm used.
“Training Tokens” are the tokens used to train or fine-tune the model. This number can be in the billions or trillions. Most of the time, the more tokens a model is trained on, the better quality the model is, however the quality and diversity of the tokens used matter for general purpose models (a model trained on trillions of tokens from Reddit would do worse at some general tasks than one with a wider range of token sources), and adding more data makes training longer.
Parameters
The “parameters” in an LLM are the values a model uses to make its predictions. Each parameter changes during training to improve the model’s predictive output. When training is complete, the parameters are fixed so that they no longer change.
Each parameter is a single number, and it is the collection of parameters and how they are weighted that leads to the predictions. The complexity of language means that models follow billions of parameters to make their inferences. Smaller LLMs currently have single-digit billions of parameters, mid-sized models like Llama 2 and GPT-3 are around 70 billion, and GPT-4 has upwards of 175 billion.
More parameters means more complex models that can hopefully handle more complex tasks, although at the cost of requiring more computational power. While additional complexity usually improves the model, it does not always. Smaller models with higher quality training data, such as Microsoft’s Phi-2, can outperform larger models with less refined training data.
For example, if a model has too many parameters relative to the amount of training data it has it can perfectly predict every outcome in its training data. But because it predicts what it already knows so well it may do a poor job of extrapolating beyond its training. This phenomenon is called “overfitting.” Generally, it is possible to avoid overfitting by having more training tokens than parameters.
Context Window Limit
The context window is the maximum amount of previous text the model can “see” when calculating its prediction, and is measured in tokens. While the number of tokens a model is trained on can number in the trillions, and parameters in the billions, the context window is typically in the thousands, although Claude 2 has had a 200k token window and Gemini 1.5 Pro has a context window of one million tokens, and Gemini 1.5 Pro has an experiential 10 million token window.
Larger context windows allow a model to use more user-provided data, like a PDF, and output longer responses, which can lead to more accurate reasoning. However, some longer context models have a “lost in the middle” issue, where content in the middle of the context window isn’t paid enough attention, which can be an issue when trying to reason with complex documents. Recent models like Gemini 1.5 Pro and Claude 3 have made progress to solve this, but benchmarks are still evolving. In addition, bigger windows require more computational power and are slower.
Comparing LLMs
With the information above we can better understand how different models compare, as shown in the table below.
Go Deeper!!
Open Source vs. Proprietary Models
Most people are familiar with the concepts of open source and proprietary software. They’re similar in AI, but with some key differences and controversies over the phrase “open source.”
In general, “open source” AI models are models whose parameter weights are available for the public to use and alter. “Open source,” “open weights,” and “downloadable weights” are used somewhat interchangeably. Some disagree with calling these models “open source,” believing the phrase should be reserved for models whose entire training pipeline – data, architecture, and training plan – is available for download.
Multimodal Models
The most developed LLMs take input in one form and output in the same form. In these single-modal models, for example, you can type a question in a text box and it delivers an answer in text (text-to-text). By contrast, so-called multimodal models are more advanced and can interpret and output multiple formats, such as image-to-image or image-to-text. GPT-4V (V for Vision) can respond to images, and Gemini 1.5 can watch video, see images, and listen to audio, without first transcribing or otherwise processing them to text.
The key differences between single- and multi-modal models have to do with the underlying architecture and their ability to seamlessly handle different formats on either end. Text-to-image and text-to-speech are still considered single-modal because they are designed to input a specific format and output a specific format.
CPU vs GPU vs TPU
LLMs and other AIs require a lot of processing power, and the rise of AI is thus affecting how computer processing is developing and, of course, the names of those processor types.
Central Processing Units (CPUs) are standard chips, useful for general purpose computing. Graphic Processing Units (GPUs) are specialized chips, originally designed for processing graphics, that prioritize parallel, instead of serial, computation. These turned out to be good for cryptocurrency mining, helping create a shortage in 2021, and are also good at processing LLMs.
Tensor Processing Units (TPU) are specially designed for training machine learning models and originally created by Google. They can better handle more complex mathematical computation, which is important for adjusting the parameter weights during training and performing inference efficiently.
Training models require large numbers of processors–generally speaking, the more parameters the model has, the more processing power is required to train it. The Falcon 180B model runs on 4096 GPUs simultaneously. Google’s A3 supercomputer is made up of 26,000 Nvidia GPUs, although its new Hypercomputer will run on its own TPUs.
To The Marianas Trench!!
Parameters and Neurons in the Training Process
Modern AI models are “deep neural networks.” A neural network is made up of layers of smaller components called neurons, modeled loosely on biological neurons and implying some similarity to the way an organic brain works. They are “deep” (as opposed to “shallow”) because they are complex and include multiple layers.
Training a neural network begins by initializing the model's parameters using random values, often from a normal distribution. This initialization helps prevent issues such as vanishing or exploding gradients, which can prevent the model from learning effectively. During training, a subset of the input data is passed through the model in a step known as “the forward pass.” Here, each neuron calculates an output by weighting its input, adding a bias, and then applying an activation function—like the ReLU (rectified linear unit) function—which introduces non-linearity by outputting zero for negative inputs and the input value for positive inputs. This ensures the model cannot be turned into a simple linear equation.
After computing the forward pass, the model's predictions are evaluated against the actual target outputs using a loss function, which measures the model's performance by quantifying the difference between predicted and true values. The loss with respect to each parameter helps determine how to adjust the weights and biases to minimize the loss.
Subsequently, an optimization algorithm (typically Adam or AdamW) updates the parameters based on the gradients computed. This optimization step is crucial for enhancing the model’s predictive accuracy. The entire process—from forwarding pass to parameter update—is iterated with different subsets of the training data for a predetermined number of training steps or until the model exhibits satisfactory performance. Through this iterative training process, the model learns to accurately map inputs to outputs, gradually improving its ability to make predictions or decisions based on new, unseen input data.
Unicode, Tokens, and the Language Equity Problem
Representing text in computing involves assigning a unique number to each character, ranging from symbols like 'a' and '&' to complex characters such as '业' or even emojis like '😊'. This assignment process is known as "encoding." In the early days of computing, different countries developed their own encodings to cater to their specific alphabets. For instance, the United States developed the American Standard Code for Information Interchange (ASCII) standard. This diversity in encodings posed challenges in managing multilingual texts, prompting the need for a universal solution.
This need led to Unicode, a comprehensive encoding standard designed to represent every character used across various languages. Unicode characters can vary in size from 1 to 4 bytes. Commonly used writing systems such as Latin, Arabic, and Cyrillic, as well as the more frequently used Han characters, are encoded in 1 to 3 bytes. In contrast, lesser-used Han characters, emojis, and characters from rare or extinct writing systems require 4 bytes.
Byte-Pair Encoding (BPE, discussed above), typically starts with the basic 256 bytes rather than the extensive range of Unicode characters. This approach means that constructing tokens can be more complex and token-intensive depending on the language. For instance, Telugu, a significant language in India, may generate up to 10 times more tokens for the same amount of text compared to English. That means processing Telugu is more costly for any given amount of text, resulting in less representation and accuracy given computing constraints.
The question of whether it's feasible to develop a fair multilingual model that performs equally well across different languages, as opposed to optimizing models for specific languages, remains an open and complex challenge.
Stay tuned for more on artificial intelligence models from the Technology Policy Institute. Visit chatbead.org to use TPI’s AI tool to search federal infrastructure grant applications. Nathaniel Lovin is Lead Programmer and Senior Research Analyst at the Technology Policy Institute. Sarah Oh Lam is a Senior Fellow at the Technology Policy Institute.