What are Large Language Models

What is a Large Language Model?

Say you wanted to participate in the popular game show Jeopardy (it’s an American TV game show where contestants are given the answer and have to guess the question). To be on the show you need to know everything about anything. So you decide to dedicate every day for the next 3 years, reading everything on the internet. Which you quickly realize is harder than it originally appeared and a super huge investment of time. You also realize that there is a vast mix of information on the internet. Some of it is fact, some is opinion, and most is somewhere in between. Jeopardy is based on facts, so spending most of your time somewhere in between is not smart.

After lots of coffee, you decide to take a different approach to training for Jeopardy. Instead of trying to know everything about anything, you focus on how to predict the next word in a sentence. If someone says “Have a nice…”, your training teaches you that the next word is probably going to be “day.” It’s an entirely different way to approach Jeopardy training, but could be your edge to nailing that daily double!

So you focus on the English language. You want to read all the sentences that have been written, to (hopefully) discover patterns. Then you use those patterns to predict the next word, when someone offers you a thought. What kind of data do you need to train with this new approach? How are you going to remember all the patterns?

This is the challenge a large language model (LLM) can solve. They are large because they have been trained across a very large set of data (like all the public content on the internet). They are a language model because they can use that large set of training data to understand how to complete a sentence in a given language (like English, Spanish, or French). And because information on the internet covers such a huge range of opinions, dialects, ideas, etc the models are very good at inferring patterns from the questions they’re asked.

are a form of artificial intelligence (AI). They have been trained on huge amounts of data and can intelligently complete a thought similar to the way a human would… They are artificially intelligent.

How Do Large Language Models Work?

When one goes about creating a large language model the first questions to answer are, what is the goal of the model, and how much data can you gather about the goal? LLMs like GPT have a pretty broad goal - complete any thought or idea. A model’s goal could be a bit more focused like indexing a very large set of documents to make them searchable. Large language models are “large” because their intended goal is typically a very big idea and the data needed to learn about the goal is vast. In fact for GPT’s to complete any thought goal, the right amount of data might not exist to truly train it.

Almost every large language model has some nuance to it. Either differences in the data it was trained on, the way it was trained, optimizations to its learning paths, or how it goes about completing a thought. Compare and . When you open a chat with either, things feel quite the same. You share a thought or ask a question and the model responds with something relevant to the conversation. But under the covers things are very different.

If your first interaction with a Large Language Model was using a website like ChatGPT, then you might be inclined to think LLMs are made to answer your questions. In fact, AI models don’t answer questions at all, they complete thoughts. Prompting a model with “It’s a lovely day.” versus “Is it a lovely day?” will get different responses. Not because one is a question and the other is a statement. To complete a thought a model tries to find the (statistically) best fitting next set of words. Then the next set of words after that, and so on. The response is called a “completion” because the model is trying to figure out what comes next. To us, it sure feels like a question & answer.

What’s the difference between machine learning and LLMs?

A notable difference between many machine learning models and a large language model is that an LLM is based on a neural network. As the name suggests, a neural network simulates how a human’s neurons work. It’s a computational model trying to simulate human functions. As you can imagine this can get really confusing. To express how complicated an LLM is, you refer to the number of parameters in the billions. Very complicated. The needs of a smaller machine learning model typically don’t require the use of a neural network. This makes their complexity a little easier to handle but also limits its computational abilities.

How Are Large Language Models Trained?

Large language models (or really any machine learning system) aren’t instantly smart. Just like humans, they have to be taught (or trained) about a given topic. Training a model is very similar to how you and I would go about learning a topic.

Say we wanted to learn about donut recipes. What are the typical ingredients? What variants are there in making the dough? What kind of toppings can you put on a donut? (pretty much anything, right!) What recipes don’t make donuts?

To learn all this you would gather a bunch of recipes from sources you know are trustworthy. Then you would go about reading. A lot. Over time you would see patterns in all the recipes. Like most of them use flour. The ones that don’t use flour are usually considered gluten-free. This is called training data.

Donuts are typically topped with something sweet like sprinkles. You could use these common patterns to read other recipes and know if it’s for a donut. You would also notice that donuts are round with a hole in the middle. A recipe might call for similar ingredients to a donut but could be for making pancakes. You'll need to find consistent patterns to figure this out.

Training a large language model is very similar to this. The more recipe examples, the better the model can tell if a given recipe is to make a donut. You want a ton of recipes so that your model will be super good at identifying donut recipes.

What Are Large Language Models Used For?

Training a model to be able to determine if a recipe is for a donut is helpful but leaves quite a bit to be desired. Training models is not an easy task so you want to include as many features as possible. In this case, we might want the model to know what kind of donut is being made.

As you give the model all those donut recipes you can include the type of donut with each recipe. This is called data labeling. Using this approach means not only can the model determine if a recipe is to make a donut, but it can also answer what kind of donut is being made! Now someone can ask your model “Does this recipe make a chocolate donut?” Your model was trained with donut recipes that were labeled with the type, so it should be able to provide a very accurate answer.

Large language models come in many shapes and sizes. However, because large language models are so complicated and need huge amounts of data to train on, their designed goal is broad. Imagine creating a model to take 5 seconds of any song in the world and identify its artist. That’s not an easy task and requires knowledge of every song ever made.

Say you wanted to create a model that could identify if a given song was on a specific album. A large language model would not do well with this because you don’t have to train it on all the songs in the world. All it has to know about are the few songs in that album. That's not enough data to provide an accurate response. There are a ton of songs in the world that sound kinda similar to the songs on the album.

are meant to complete very abstract thoughts, with little context. Like “why did the chicken cross the road?” They are also meant to provide precise accurate answers when given clear examples and descriptions of what is desired. To be good at both these uses, it needs a huge amount of data for learning.

Examples of Large Language Models

As of this document’s publish date here are a few examples of publicly available Large Language Models. We’ve tried to provide some context about the goals of each model and how to get started with them.

All of these models are natural language processing (NLP) models, meaning they have been trained to work with how a Human speaks (letters, words, sentences, etc).

OpenAI GPT-3 (Generative Pre-trained Transformer 3)

This LLM was released in 2020 by OpenAI. It is classified as a large language model with around 175 billion parameters. OpenAI used a few different datasets to train GPT about the entire internet, with the biggest being .

GPT’s objective’s are about continuing a provided thought. The thought could be complete like “it’s a great day” or could be a question like “why did the chicken cross the road”. GTP reads the text left-to-right and tries to predict the next few words.

BERT (Bidirectional Encoder Representations from Transformers)

Google released this LLM in 2018. It is based on the architecture. BERT takes a different approach than GPT where it reads text both from the left and from the right to then predict the next few words. This gives the model a better understanding of the context of words.

RoBERTa  (Robustly Optimized BERT Pretraining Approach)

This model was introduced by Facebook AI in 2019. It is based on Google’s BERT model with improvements to performance and robustness of the original. The improvements focus on fine-tuning the pretraining process and training on a larger corpus of text data.

T5 (Text-to-Text Transfer Transformer)

Introduced by Google Research in a paper published in 2019, the T5 model is designed to approach all natural language processing (NLP) tasks in a unified manner. It does this by casting all NLP tasks as a text-to-text problem. Both input and output are treated as text strings. This expands the abilities of the model including text classification, translation, summarization, question-answering, and more.

CTRL (Conditional Transformer Language Model)

Created by Salesforce Research in a research paper published in 2019. This model is designed to generate text conditioned on specific instructions or control codes, allowing fine-grained control over the language generation process. It uses control codes to condition the language model's output. The codes act as instructions for the model during text generation. The control codes guide the model to produce text in a particular style, genre, or with specific attributes. This enables fine-tuned customization of the language generation process according to user-specified constraints.

Megatron-Turing (MT-NLG)

This model is a combination of Microsoft’s DeepSpeed deep learning optimization library and NVIDIA’s Megatron-LM large transformer model. At the time of release it claimed the “world’s largest transformer-based language model” title, with 530 billion parameters (significantly more than GPT-3). Its massive size of parameters made the model quite good at zero, one, and few-shot prompts. It set a new bar in terms of scale and quality in modern LLMs.

How to get started with Generative AI using LLMs

Once you set the goal on your project, you can select an LLM that best fits the need. Most likely the LLM offers an API to interact with it (ie: submit prompts and receive response). You’ll want the prompts to be a balance between project goals and the LLM’s characteristics. That balance is going to include addition information that the LLM has no knowledge of. Learn more about prompt engineering.

Typically you use a to match a User’s input with your pre-made text, to create a perfectly crafted prompt. This will ensure the LLM’s responses are predictable and stable enough to include in your larger efforts. At its simplest, the flow will be:

  1. Take in User query
  2. Find additional context in up-to-date vectorized data
  3. Combine that additional data with your pre-made text
  4. Submit final prompt to the LLM
  5. Respond to the User with LLM response

While this may sound complex, takes care of most of this for you with a fully integrated solution that provides all of the pieces you need for contextual data. From the nervous system built on data pipelines to embeddings all the way to core memory storage and retrieval, access, and processing in an easy-to-use cloud platform. .

Cloud Partners

DataStax, is a registered trademark of DataStax, Inc.. Apache, Apache Cassandra, Cassandra, Apache Pulsar, and Pulsar are either registered trademarks or trademarks of the Apache Software Foundation.

United States