How Are Large Language Models Trained?
Large language models (or really any machine learning system) aren’t instantly smart. Just like humans, they have to be taught (or trained) about a given topic. Training a model is very similar to how you and I would go about learning a topic.
Say we wanted to learn about donut recipes. What are the typical ingredients? What variants are there in making the dough? What kind of toppings can you put on a donut? (pretty much anything, right!) What recipes don’t make donuts?
To learn all this you would gather a bunch of recipes from sources you know are trustworthy. Then you would go about reading. A lot. Over time you would see patterns in all the recipes. Like most of them use flour. The ones that don’t use flour are usually considered gluten-free. This is called training data.
Donuts are typically topped with something sweet like sprinkles. You could use these common patterns to read other recipes and know if it’s for a donut. You would also notice that donuts are round with a hole in the middle. A recipe might call for similar ingredients to a donut but could be for making pancakes. You'll need to find consistent patterns to figure this out.
Training a large language model is very similar to this. The more recipe examples, the better the model can tell if a given recipe is to make a donut. You want a ton of recipes so that your model will be super good at identifying donut recipes.
What Are Large Language Models Used For?
Training a model to be able to determine if a recipe is for a donut is helpful but leaves quite a bit to be desired. Training models is not an easy task so you want to include as many features as possible. In this case, we might want the model to know what kind of donut is being made.
As you give the model all those donut recipes you can include the type of donut with each recipe. This is called data labeling. Using this approach means not only can the model determine if a recipe is to make a donut, but it can also answer what kind of donut is being made! Now someone can ask your model “Does this recipe make a chocolate donut?” Your model was trained with donut recipes that were labeled with the type, so it should be able to provide a very accurate answer.
Large language models come in many shapes and sizes. However, because large language models are so complicated and need huge amounts of data to train on, their designed goal is broad. Imagine creating a model to take 5 seconds of any song in the world and identify its artist. That’s not an easy task and requires knowledge of every song ever made.
Say you wanted to create a model that could identify if a given song was on a specific album. A large language model would not do well with this because you don’t have to train it on all the songs in the world. All it has to know about are the few songs in that album. That's not enough data to provide an accurate response. There are a ton of songs in the world that sound kinda similar to the songs on the album.
Large language models are meant to complete very abstract thoughts, with little context. Like “why did the chicken cross the road?” They are also meant to provide precise accurate answers when given clear examples and descriptions of what is desired. To be good at both these uses, it needs a huge amount of data for learning.
Examples of Large Language Models
As of this document’s publish date here are a few examples of publicly available Large Language Models. We’ve tried to provide some context about the goals of each model and how to get started with them.
All of these models are natural language processing (NLP) models, meaning they have been trained to work with how a Human speaks (letters, words, sentences, etc).
OpenAI GPT-3 (Generative Pre-trained Transformer 3)
This LLM was released in 2020 by OpenAI. It is classified as a generative large language model with around 175 billion parameters. OpenAI used a few different datasets to train GPT about the entire internet, with the biggest being Common Crawl.
GPT’s objective’s are about continuing a provided thought. The thought could be complete like “it’s a great day” or could be a question like “why did the chicken cross the road”. GTP reads the text left-to-right and tries to predict the next few words.
BERT (Bidirectional Encoder Representations from Transformers)
Google released this LLM in 2018. It is based on the transformer architecture. BERT takes a different approach than GPT where it reads text both from the left and from the right to then predict the next few words. This gives the model a better understanding of the context of words.
RoBERTa (Robustly Optimized BERT Pretraining Approach)
This model was introduced by Facebook AI in 2019. It is based on Google’s BERT model with improvements to performance and robustness of the original. The improvements focus on fine-tuning the pretraining process and training on a larger corpus of text data.
T5 (Text-to-Text Transfer Transformer)
Introduced by Google Research in a paper published in 2019, the T5 model is designed to approach all natural language processing (NLP) tasks in a unified manner. It does this by casting all NLP tasks as a text-to-text problem. Both input and output are treated as text strings. This expands the abilities of the model including text classification, translation, summarization, question-answering, and more.
CTRL (Conditional Transformer Language Model)
Created by Salesforce Research in a research paper published in 2019. This model is designed to generate text conditioned on specific instructions or control codes, allowing fine-grained control over the language generation process. It uses control codes to condition the language model's output. The codes act as instructions for the model during text generation. The control codes guide the model to produce text in a particular style, genre, or with specific attributes. This enables fine-tuned customization of the language generation process according to user-specified constraints.
Megatron-Turing (MT-NLG)
This model is a combination of Microsoft’s DeepSpeed deep learning optimization library and NVIDIA’s Megatron-LM large transformer model. At the time of release it claimed the “world’s largest transformer-based language model” title, with 530 billion parameters (significantly more than GPT-3). Its massive size of parameters made the model quite good at zero, one, and few-shot prompts. It set a new bar in terms of scale and quality in modern LLMs.
How to get started with Generative AI using LLMs
Once you set the goal on your Generative AI project, you can select an LLM that best fits the need. Most likely the LLM offers an API to interact with it (ie: submit prompts and receive response). You’ll want the prompts to be a balance between project goals and the LLM’s characteristics. That balance is going to include addition information that the LLM has no knowledge of. Learn more about prompt engineering.
Typically you use a vector database to match a User’s input with your pre-made text, to create a perfectly crafted prompt. This will ensure the LLM’s responses are predictable and stable enough to include in your larger efforts. At its simplest, the flow will be:
- Take in User query
- Find additional context in up-to-date vectorized data
- Combine that additional data with your pre-made text
- Submit final prompt to the LLM
- Respond to the User with LLM response
While this may sound complex, Datastax Astra takes care of most of this for you with a fully integrated solution that provides all of the pieces you need for contextual data. From the nervous system built on data pipelines to embeddings all the way to core memory storage and retrieval, access, and processing in an easy-to-use cloud platform. Try for free today.