Vector Embeddings: The Foundation of Data Representation
Vector embeddings transform data into mathematical equations, powering the cognitive capabilities of AI, but how do they enable machines to 'learn' and 'grow'?
A vector embedding is a numerical representation of data that captures semantic relationships and similarities, making it possible to perform mathematical operations and comparisons on the data for various tasks like text analysis and recommendation systems.
With all the talk about generative AI, the concepts behind what powers generative AI can be a little overwhelming. In this article we are going to focus on the one functional concept that powers the underlying cognitive ability of AI and provides machine learning models the ability to learn and grow, a vector embedding.
A vector embedding, is at its core, the ability to represent a piece of data as a mathematical equation. Google’s definition of a vector embedding is “a way of representing data as points in n-dimensional space so that similar data points cluster together”. For people who have strong backgrounds in mathematics, I am sure these words make perfect sense, but for those of us that struggle with visual representations of mathematical concepts, this might sound like gibberish.
So let’s look at this another way, let’s say you have a bowl full of M&Ms that you like to snack on and your youngest offspring decides that they are going to be cleave and mix in a bowl full of Skittles candies. For those not necessarily familiar with these two things, M&Ms and Skittles are two colorful candy shelled treats that look very similar but one is chocolate and the other is citrus, two flavors that don’t really mix well. So to remedy the situation we need to sort the candies and we decide to sort them by type and color. So all the green M&Ms go together in one pile and all the green Skittles go together in another pile, all the red M&Ms and all the red Skittles, and so on and so on. When we are done we have distinct piles of M&Ms and Skittles separated by color and we can arrange them visually such that we can quickly see where a new candy falls.
You can see that in the sorting of our candies, we have already started to lay out patterns and groupings that can make it easier to relate the candies together and to find the piles we need when sorting new candies as we find them. A vector embedding takes this visual representation and applies a mathematical representation to its position. A simple way to think about this is if we assign each position with a different value.
With our candies, it is now possible to assign a value to that candy based on its attributes and put the new candie into the correct position based on that number. This is ultimately what a vector embedding is, albeit with far more complexity.
It is this mathematical representation that is at the foundation of cognitive ability that gives generative AI and machine learning models like natural language processing, image generation, and chatbots the ability to collate neurological-like inputs and make decisions on them. A single embedding is like a neuron, and just like a single neuron doesn’t make a brain and a single embedding doesn’t make an AI system. The more embeddings, and the more relationships those embeddings have give the ability to have more and more complex cognitive abilities. When we group large volumes of embeddings into a single repository that can provide fast scalable access like a brain it is called a vector database like Datastax Astra powered by Apache Cassandra.
But to truly understand what a Vector Embedding is and the profound value they provide to generative AI, we must understand how they can be used, how they are created, and what types of data they can represent.
Example: Using Vector Embeddings
One of the challenges we have with vector embeddings is that they can represent almost any kind of data. If you look at most data types used in computer science/programming languages they all represent a finite form of data. Chars were designed to represent characters, ints were designed to represent whole numbers, floats were designed to represent more finite numerical representations with decimal points. New data types have been created to enhance these base data types like strings and arrays, however those types tended to still only be able to represent a specific type of data.
On the surface a vector data type seems to be just an extension of an array that allows for the array to be multidimensional as well as provide directionality when graphed. The biggest advancement with vectors however is the realization that functionally any type of data can be represented as a vector, and more importantly that data can be compared to other pieces of data and similarity can ultimately be mapped within those multidimensional planes.
Okay, something we have to address here is even after writing the above, it still feels like word soup. What does all of this mean? I think to really understand what a vector is and how it can be used comes from one of the early implementations, Word2Vec invented by Google in 2013.
Word2Vec is a technique used to take words as input, convert them into vectors and use those vectors to create graphs where clusters of synonyms are visualized.
The way Word2Vec functionally works is that each creates an n-dimensional coordinate mapping or a vector. In our above example we have a 5-dimensional coordinate mapping, true vector mappings can have hundreds or thousands of dimensions, far too many for our minds to visualize or comprehend. It is the high dimensional data that provides machine learning models the ability to correlate and plot the data points for things like semantic search or vector search.
In our above diagram, you can see how certain words naturally get grouped together based on aspects of similarity. Bunny and rabbit are more closely related to each other compared to hamster, and all three words, bunny, rabbit, and hamster are more closely grouped together based on vector properties to each other compared to hutches. It is this directionality within the n-dimensional space that allows neural networks to process functionality like nearest neighbor search.
So how does this get applied? Well, one of the easiest ways to visualize this is in recommendation engines. Take for example you are streaming your favorite show if I take the qualities and aspects of that show and vectorize them and then I take the qualities and aspects of all other shows and vectorize them I can now use those qualities to find other shows that are closely related to the show I am watching based directionality. With machine learning and AI the more shows I watch and like the more information the system gains from what areas of the n-dimensional graph I am interested in and makes recommendations for my tastes based on these qualities.
Another example of how this can be applied is in things like search. Take for example Google’s reverse image search. Doing a reverse image search with vectors is extremely fast and easy to do because when the image is given as input the reverse search engine can turn into a vector and then using vector search it can find the specific place in the n-dimensional graph that the image should be and provide the user with any additional metadata it has around that image.
The application of data vectorization is truly limitless at this point. Once data is turned into vectors things like Fraud or anomaly detection can be done. Data processing, transformation, and mapping can be done as part of a machine learning model. Chatbots can be fed production documentation and provide a natural language interface to interact with users trying to figure out how to use a specific feature.
Vector embeddings are the core component of enabling machine learning and AI. Once data is turned into vectors we need to store all the vectors in a highly scalable, highly performant repository called a vector database. Once data has been transformed and stored as vectors that data can now power multiple different vector search use-cases.
Creating Vector Embeddings
So what goes into a vector embedding and how is it created? Creating a vector embedding starts with a discrete data point that gets transformed into a vector representation in a high dimensional space. It is probably easiest to visualize in a low 3d space for our purposes. Let’s say we have three discrete data points, the word cat, the word duck, and the word mudskipper. We are going to evaluate those words on if they walk, swim, or fly. Let’s take the word cat for example. A cat primarily walks so let's assign a value of 3 for walking, a cat can swim but most cats don’t like to swim so let’s assign a value of 1 for that, lastly I don’t know of any cats that can fly on their own so let's assign a value of 0 for that.
So the data points for cat are:
Cat: (Swims- 1, Flys- 0, Walks - 3)
If we do the same for the word duck and mudskipper (a type of fish that can walk on land) we get:
Duck: (Swims, Flys - 2, Walks - 2)
Mudskipper: (Swims - 3, Flys - 0, Walks – 1)
From this mapping we can then plot each word into a 3 dimensional chart and the line that is created is the vector embedding. Cat [1,0,3], Duck [2,2,2], Mudskipper [3,0,1].
Once all our discrete objects (words) have been transformed into our vectors we can then see how closely they relate to each other based on semantic similarity. For example, it is very easy to see that all three words get plotted onto the z axis because all of the animals can walk. Where the true power of things like machine learning comes in is when you look at the vector representations across planes of the graph. So for example, if we compare the animals on walking and swimming we can see that a cat is more closely related to a duck versus a mudskipper.
In our example, we only have a 3 dimensional space but with a true vector embedding the vector spans an N-dimensional space. It is this multidimensional representation that is used by machine learning and neural networks to make decisions and enable Hierarchical Nearest Neighbor search patterns.
When creating vector embeddings there are two approaches that can be taken, feature engineering which takes domain knowledge and expertise to quantify the set of “features” that will be used to define the different vertices of the vector, or using a deep neural network to training models to translate objects into vectors. Training models tend to be the most common approach since feature engineering, while providing a detailed understanding of the domain takes too much time and expense to scale while training models can generate densely populated high dimensional (1000s) vectors.
Pre-trained models
Pre-trained models are models that have been created to solve a general problem that can be used as is or as a starting point to solve complex, finite problems. There are many examples of pre-trained models available for different types of data. BERT, Word2Vec, and ELMo are some of the many models available for text data. These models have been trained in very large datasets and transform words, sentences, and entire paragraphs and documents into vector embeddings. But pre-trained models are not limited to text data, image and audio data have a number of pretrained models generally available as well. Models like Inception which used a convolutional neural networks (CNNs) model and Dall-E 2 which used a diffusion model.
On-Demand Webinar: Vector Similarity Search in Apache Cassandra®
Discover how Vector Search is solving challenges with massive datasets and AI applications.
One of the key opportunities that vector embeddings can provide is the ability to represent any type of data as a vector embedding. There are many current examples where text and image embeddings are being heavily used to create solutions like natural language processing (NLP) Chatbots using tools like GPT-4 or generative image processors like Dall-E 2.
Text Embeddings
Text embeddings are probably the easiest ones to understand and we have been using text embeddings as the foundation for most of our examples. Text embeddings start as a data corpus of text-based objects so for large language models like Word2Vec they use large datasets from things like Wikipedia. But text embeddings can be used for pretty much any type of text-based dataset that you want to quickly and easily search for nearest neighbor or semantically similar results.
For example, say you want to create an NLP chatbot to answer questions about your product, you can use text embeddings of the product documentation and product FAQs that can allow for the chatbot to respond to queries based on questions posed. Or what about feeding all of the cookbooks you have collected over the years as the data corpus and using that data to provide recipes based on all the ingredients you have in your pantry. What text embeddings bring is the ability to take unstructured data like words, paragraphs and documents and represent them in a structured form.
Image Embeddings
Image embeddings like text embeddings can represent multiple different aspects of images. From full images down to individual pixels, image embeddings provide the ability to classify the set of features an image has, and present those features mathematically for analysis by machine learning models or for use by image generators like Dall-E 2.
Probably one of the most common usages of image embeddings is used for classification and reverse image search. Say for example I have a picture of a snake that was taken in my backyard and I would like to know what type of snake it is and if it is venomous. With a large data corpus of all the different types of snakes, I can feed the image of my snake into the vector database of all the vector embeddings of snakes and find the closest neighbor to my image. From that semantic search I can pull all the “attributes” of the closest nearest neighbor image to my snake and determine what kind of snake it is and if I should be concerned.
Another example of how vector embeddings can be used is automated image editing like Google Magic Photo Editor which allows for images to be edited by generative AI making edits to specific parts of an image like removing people from the background or adding better composition.
Product Embeddings
Another example of how vector embeddings can be used is in recommendation engines. Product embeddings can be anything from movies, songs to shampoos. With product embeddings e-commerce sites can observe shoppers behaviors through search results, click stream and purchase patterns and make recommendations based on semantic similarly, recommendations for new or niche products. Say for example that I visit my favorite online retailer. I am perusing the site and adding a bunch of things to my cart for the new puppy I just got. I add puppy food that I am running low on, a new lease, a dog bowl and a water dish. I then search for tennis balls because I want my new puppy to have some toys to play with. Now am I really interested in tennis balls or dog toys? If I was at my local pet store and somebody was helping me they would clearly see that I am not really interested in tennis balls, I am actually interested in dog toys. What product embeddings bring is the ability to glean this information from my purchase experience, use vector embeddings generated for each of those products, focused around dogs, and predict what I am actually looking for dog toys not tennis balls.
How to Get Started with Vector Embeddings
The concepts of vector embeddings can be pretty overwhelming. Anytime we try to visualize n-dimensional and use those to find semantic similarity is going to be a challenge. Thankfully there are a number of tools available that allow for the creation of vector embeddings like Word2Vec, CNNs and many others that can take your data and turn it into vectors.
What you do with that data, how it is stored and accessed and how it is updated is where the real challenge lies.
While this may sound complex, Vector Search on AstraDB takes care of all of this for you with a fully integrated solution that provides all of the pieces you need for contextual data built for AI. From the digital nervous system, Astra Streaming, built on data pipelines that provides inline vector embeddings, all the way to real-time large volume storage, retrieval, access, and processing, via the most scalable vector database, Astra DB, on the market today, in an easy-to-use cloud platform. Try for free today.
A vector embedding represents data as a mathematical equation or points in n-dimensional space, grouping similar data points together. This representation is crucial for machine learning models to learn and grow effectively.
How are vector embeddings created?
They begin with a discrete data point transformed into a vector representation in high-dimensional space. The data points are evaluated based on specific attributes, which are then assigned numerical values to create the vector.
What are some applications of vector embeddings?
Vector embeddings are employed in recommendation engines, semantic search, fraud/anomaly detection, and enabling chatbots to provide natural language interfaces. They are fundamental in powering machine learning and AI systems by storing and comparing data effectively.
How do vector embeddings aid in data comparison?
By representing data as vectors, similarity among data points can be mapped within multidimensional planes, facilitating effective comparison and identification of relationships between different data pieces.
How do vector embeddings contribute to machine learning models?
They provide a mathematical representation of data, enabling machine learning models to correlate and plot data points for tasks like semantic search, enhancing their ability to process and analyze data.
How do vector embeddings support generative AI?
Vector embeddings are foundational in giving generative AI the ability to collate neurological-like inputs and make decisions, underpinning the cognitive abilities required for tasks like natural language processing and image generation.
How do vector embeddings enable semantic search?
By mapping words or other data into vectors in a multi-dimensional space, semantic relationships can be identified, enabling semantic search which finds items based on meaning and context rather than exact matches.