GuideAug 22, 2023

# What are Vector Embeddings? Understanding the Foundation of Data Representation

Vector embeddings transform data into mathematical equations, powering the cognitive capabilities of AI, but how do they enable machines to 'learn' and 'grow'?

## What are Vector Embeddings?

Vector embeddings are numerical representations of data that captures semantic relationships and similarities, making it possible to perform mathematical operations and comparisons on the data for various tasks like text analysis and recommendation systems.

With all the talk about generative AI, the concepts behind what powers generative AI can be a little overwhelming.  In this article we are going to focus on the one functional concept that powers the underlying cognitive ability of AI and provides machine learning models the ability to learn and grow, a vector embedding.

A vector embedding, is at its core, the ability to represent a piece of data as a mathematical equation.  Google’s definition of a vector embedding is “a way of representing data as points in n-dimensional space so that similar data points cluster together”.  For people who have strong backgrounds in mathematics, I am sure these words make perfect sense, but for those of us who struggle with visual representations of mathematical concepts, this might sound like gibberish.

## Understanding Vector Embeddings and the Mathematical Representation of Data

So let’s look at this another way, let’s say you have a bowl full of M&Ms that you like to snack on and your youngest offspring decides that they are going to be cheeky and mix in a bowl full of Skittles candies. For those not necessarily familiar with these two things, M&Ms and Skittles are two colorful candy-shelled treats that look very similar but one is chocolate and the other is citrus, two flavors that don’t mix well. So to remedy the situation we need to sort the candies and we decide to sort them by type and color. So all the green M&Ms go together in one pile and all the green Skittles go together in another pile, all the red M&Ms and all the red Skittles, and so on and so on. When we are done we have distinct piles of M&Ms and Skittles separated by color and we can arrange them visually such that we can quickly see where a new candy falls.

You can see that in the sorting of our candies, we have already started to lay out patterns and groupings that can make it easier to relate the candies together and to find the piles we need when sorting new candies as we find them.  A vector embedding takes this visual representation and applies a mathematical representation to its position. A simple way to think about this is if we assign each position with a different value.

With our candies, it is now possible to assign a value to that candy based on its attributes and put the new candie into the correct position based on that number.  This is ultimately what a vector embedding is, albeit with far more complexity.

It is this mathematical representation that is at the foundation of cognitive ability that gives generative AI and machine learning models like natural language processing, image generation, and chatbots the ability to collate neurological-like inputs and make decisions on them. A single embedding is like a neuron, and just like a single neuron doesn’t make a brain and a single embedding doesn’t make an AI system.  The more embeddings, and the more relationships those embeddings have give the ability to have more and more complex cognitive abilities.  When we group large volumes of embeddings into a single repository that can provide fast scalable access like a brain it is called a vector database like Datastax Astra powered by Apache Cassandra.

But to truly understand what a Vector Embedding is and the profound value they provide to generative AI, we must understand how they can be used, how they are created, and what types of data they can represent.

## Example: Using Vector Embeddings

One of the challenges we have with vector embeddings is that they can represent almost any kind of data.  If you look at most data types used in computer science/programming languages they all represent a finite form of data.  Chars were designed to represent characters, ints were designed to represent whole numbers, and floats were designed to represent more finite numerical representations with decimal points.  New data types have been created to enhance these base data types like strings and arrays, however those types tended to be still only able to represent a specific type of data.

On the surface a vector data type seems to be just an extension of an array that allows for the array to be multidimensional as well as provide directionality when graphed.  The biggest advancement with vectors however is the realization that functionally any type of data can be represented as a vector, and more importantly that data can be compared to other pieces of data and similarity can ultimately be mapped within those multidimensional planes.

Okay, something we have to address here is even after writing the above, it still feels like word soup.  What does all of this mean?  I think to really understand what a vector is and how it can be used comes from one of the early implementations, Word2Vec invented by Google in 2013.

Word2Vec is a technique used to take words as input, convert them into vectors and use those vectors to create graphs where clusters of synonyms are visualized.

The way Word2Vec functionally works is that each creates an n-dimensional coordinate mapping or a vector.  In our above example we have a 5-dimensional coordinate mapping, true vector mappings can have hundreds or thousands of dimensions, far too many for our minds to visualize or comprehend.  It is the high dimensional data that provides machine learning models the ability to correlate and plot the data points for things like semantic search or vector search.

In our above diagram, you can see how certain words naturally get grouped together based on aspects of similarity. Bunny and rabbit are more closely related to each other compared to hamster, and all three words, bunny, rabbit, and hamster are more closely grouped together based on vector properties to each other compared to hutches.  It is this directionality within the n-dimensional space that allows neural networks to process functionality like nearest neighbor search.

## What are the Applications of Vector Embeddings?

So how does this get applied? Well, one of the easiest ways to visualize this is in recommendation engines. Take for example you are streaming your favorite show if I take the qualities and aspects of that show and vectorize them and then I take the qualities and aspects of all other shows and vectorize them I can now use those qualities to find other shows that are closely related to the show I am watching based directionality. With machine learning and AI the more shows I watch and like the more information the system gains from what areas of the n-dimensional graph I am interested in and makes recommendations for my tastes based on these qualities.

Another example of how this can be applied is in things like search. Take for example Google’s reverse image search. Doing a reverse image search with vectors is extremely fast and easy to do because when the image is given as input the reverse search engine can turn into a vector and then using vector search it can find the specific place in the n-dimensional graph that the image should be and provide the user with any additional metadata it has around that image.

The application of data vectorization is truly limitless at this point.  Once data is turned into vectors things like Fraud or anomaly detection can be done.  Data processing, transformation, and mapping can be done as part of a machine-learning model. Chatbots can be fed production documentation and provide a natural language interface to interact with users trying to figure out how to use a specific feature.

Vector embeddings are the core component of enabling machine learning and AI.  Once data is turned into vectors we need to store all the vectors in a highly scalable, highly performant repository called a vector database.  Once data has been transformed and stored as vectors that data can now power multiple different vector search use-cases.

## Creating Vector Embeddings

So what goes into a vector embedding and how is it created?  Creating a vector embedding starts with a discrete data point that gets transformed into a vector representation in a high dimensional space.  It is probably easiest to visualize in a low 3d space for our purposes.  Let’s say we have three discrete data points, the word cat, the word duck, and the word mudskipper.  We are going to evaluate those words on wheather they walk, swim, or fly.  Let’s take the word cat for example.  A cat primarily walks so let's assign a value of 3 for walking, a cat can swim but most cats don’t like to swim so let’s assign a value of 1 for that, lastly I don’t know of any cats that can fly on their own so let's assign a value of 0 for that.

So the data points for cat are:

Cat: (Swims- 1, Flys- 0, Walks - 3)

If we do the same for the word duck and mudskipper (a type of fish that can walk on land) we get:

Duck: (Swims, Flys - 2, Walks - 2)

Mudskipper: (Swims - 3, Flys - 0, Walks – 1)

From this mapping we can then plot each word into a 3 dimensional chart and the line that is created is the vector embedding.  Cat [1,0,3], Duck [2,2,2], Mudskipper [3,0,1].

Once all our discrete objects (words) have been transformed into our vectors we can then see how closely they relate to each other based on semantic similarity.  For example, it is very easy to see that all three words get plotted onto the z-axis because all of the animals can walk. Where the true power of things like machine learning comes in is when you look at the vector representations across planes of the graph.  So for example, if we compare the animals on walking and swimming we can see that a cat is more closely related to a duck versus a mudskipper.

In our example, we only have a 3-dimensional space but with a true vector embedding the vector spans an N-dimensional space.  It is this multidimensional representation that is used by machine learning and neural networks to make decisions and enable Hierarchical Nearest-Neighbor search patterns.

When creating vector embeddings two approaches that can be taken, feature engineering which takes domain knowledge and expertise to quantify the set of “features” that will be used to define the different vertices of the vector, or using a deep neural network to training models to translate objects into vectors.  Training models tend to be the most common approach since feature engineering, while providing a detailed understanding of the domain takes too much time and expense to scale while training models can generate densely populated high dimensional (1000s) vectors.

### Pre-trained models

Pre-trained models are models that have been created to solve a general problem that can be used as is or as a starting point to solve complex, finite problems.  There are many examples of pre-trained models available for different types of data.  BERT, Word2Vec, and ELMo are some of the many models available for text data.  These models have been trained in very large datasets and transform words, sentences, and entire paragraphs and documents into vector embeddings.  But pre-trained models are not limited to text data. Image and audio data have some pre-trained models generally available as well. Models like Inception which used a convolutional neural networks (CNNs) model and Dall-E 2 which used a diffusion model.

## On-Demand Webinar: Vector Similarity Search in Apache Cassandra®

Discover how Vector Search is solving challenges with massive datasets and AI applications.

## What Types of Things Can Be Embedded?

One of the key opportunities that vector embeddings can provide is the ability to represent any type of data as a vector embedding.  There are many current examples where text and image embeddings are being heavily used to create solutions like natural language processing (NLP) Chatbots using tools like GPT-4 or generative image processors like Dall-E 2.

### Text Embeddings

Text embeddings are probably the easiest ones to understand and we have been using text embeddings as the foundation for most of our examples.  Text embeddings start as a data corpus of text-based objects so for large language models like Word2Vec they use large datasets from things like Wikipedia.  But text embeddings can be used for pretty much any type of text-based dataset that you want to quickly and easily search for nearest neighbor or semantically similar results.

For example, say you want to create an NLP chatbot to answer questions about your product, you can use text embeddings of the product documentation and product FAQs that can allow for the chatbot to respond to queries based on questions posed.  Or what about feeding all of the cookbooks you have collected over the years as the data corpus and using that data to provide recipes based on all the ingredients you have in your pantry? What text embeddings bring is the ability to take unstructured data like words, paragraphs and documents and represent them in a structured form.

### Image Embeddings

Image embeddings like text embeddings can represent multiple different aspects of images.  From full images down to individual pixels, image embeddings provide the ability to classify the set of features an image has, and present those features mathematically for analysis by machine learning models or for use by image generators like Dall-E 2.

Probably one of the most common usages of image embeddings is used for classification and reverse image search.  Say for example I have a picture of a snake that was taken in my backyard and I would like to know what type of snake it is and if it is venomous.  With a large data corpus of all the different types of snakes, I can feed the image of my snake into the vector database of all the vector embeddings of snakes and find the closest neighbor to my image.  From that semantic search I can pull all the “attributes” of the closest nearest neighbor image to my snake and determine what kind of snake it is and if I should be concerned.

Another example of how vector embeddings can be used is automated image editing like Google Magic Photo Editor which allows for images to be edited by generative AI making edits to specific parts of an image like removing people from the background or adding better composition.

### Product Embeddings

Another example of how vector embeddings can be used is in recommendation engines.  Product embeddings can be anything from movies and songs to shampoos.  With product embeddings e-commerce sites can observe shoppers' behaviors through search results, click stream and purchase patterns and make recommendations based on semantic similarly, recommendations for new or niche products.  Say for example that I visit my favorite online retailer. I am perusing the site and adding a bunch of things to my cart for the new puppy I just got.  I add puppy food that I am running low on, a new lease, a dog bowl and a water dish.  I then search for tennis balls because I want my new puppy to have some toys to play with.  Now am I really interested in tennis balls or dog toys?  If I was at my local pet store and somebody was helping me they would clearly see that I am not really interested in tennis balls, I am actually interested in dog toys.  What product embeddings bring is the ability to glean this information from my purchase experience, use vector embeddings generated for each of those products, focused on dogs, and predict what I am actually looking for, which is dog toys, not tennis balls.

### Document Embeddings

Document embeddings extend the concept of text embeddings to larger bodies of text, such as entire documents or collections of documents. These embeddings capture the overall semantic meaning of a document, enabling tasks like document classification, clustering, and information retrieval. For instance, in a corporate setting, document embeddings can help categorize and retrieve relevant documents from a large internal repository based on their semantic content. They can also be used in legal tech for analyzing and comparing legal documents.

### Audio Embeddings

Audio embeddings translate audio data into a vector format. This process involves extracting features from audio signals - like pitch, tone, and rhythm - and representing them in a way that can be processed by machine learning models. Applications of audio embeddings include voice recognition, music recommendation based on sound features, and even emotion detection from spoken words. Audio embeddings are crucial in developing systems like smart assistants that can understand voice commands or apps that recommend music based on a user's listening history.

### Sentence Embeddings

Sentence embeddings represent individual sentences as vectors, capturing the meaning and context of the sentence. They are particularly useful in tasks such as sentiment analysis, where understanding the nuanced sentiment of a sentence is crucial. Sentence embeddings can also play a role in chatbots, enabling them to comprehend and respond to user inputs more accurately. Moreover, they are vital in machine translation services, ensuring that the translated sentences retain the original meaning and context.

### Word Embeddings

Word embeddings are the most granular form of text embeddings, where individual words are converted into vectors. These embeddings capture the semantic relationships between words, like synonyms and antonyms, and contextual usage. They are fundamental in natural language processing tasks such as text classification, language modeling, and synonym generation. Word embeddings are also used in search engines to enhance the relevance of search results by understanding the semantic meaning of the search queries.

## What are the Benefits of Vector Embeddings?

Vector embeddings offer a range of benefits that make them an invaluable tool in various fields of data science and artificial intelligence:

### Efficient Data Representation:

Vector embeddings provide a way to represent complex and high-dimensional data in a more manageable form. This efficient representation is crucial for processing and analyzing large datasets.

### Improved Machine Learning Model Performance:

Embeddings can significantly enhance the performance of machine learning models. They allow these models to understand the nuances and relationships in the data, leading to more accurate predictions and analyses.

Semantic Understanding:

Especially in the context of text data, vector embeddings enable models to capture semantic meaning. They go beyond simple word matching, allowing the model to understand context and the relationships between words or phrases.

Tasks like natural language processing, image recognition, and audio analysis become more feasible with vector embeddings. They transform raw data into a format that machine learning algorithms can efficiently work with.

Personalization and Recommendation Systems:

In e-commerce and content platforms, embeddings help in building sophisticated recommendation systems. They can identify patterns and preferences in user behavior, leading to more personalized and relevant recommendations.

Data Visualization and Clustering:

Vector embeddings can be used to visualize high-dimensional data in lower dimensions. This is valuable in exploratory data analysis, where identifying clusters and patterns is essential.

Cross-modal Data Applications:

Embeddings facilitate the interaction between different types of data, such as text and images, enabling innovative applications like cross-modal search and retrieval.

## What are the Potential Challenges and Limitations of Vector Embeddings?

While vector embeddings offer numerous benefits, they also come with their own set of challenges and limitations:

Quality of Training Data:

The effectiveness of vector embeddings heavily relies on the quality of the training data. Biased or incomplete data can lead to embeddings that are skewed or inaccurate.

High-Dimensional Space Management:

Managing and processing high-dimensional vector spaces can be computationally intensive. This can pose challenges in terms of both computational resources and processing time, particularly for very large datasets.

Loss of Information:

While embeddings condense information into a more manageable format, this process can sometimes lead to the loss of subtle nuances in the data. Important details might be overlooked or underrepresented in the embedding.

Interpretability Issues:

Vector embeddings can be difficult to interpret, especially for those who are not experts in machine learning. This lack of transparency can be a hurdle in fields where understanding the decision-making process of AI models is crucial.

Generalization Versus Specificity:

Finding the right balance between making embeddings general enough to be widely applicable, yet specific enough to be useful for particular tasks, can be challenging.

Understanding these challenges is important for anyone looking to implement vector embeddings in their systems or research. It helps in making informed decisions about how best to use this technology and in anticipating potential hurdles in its application.

## How to Get Started with Vector Embeddings

The concepts of vector embeddings can be pretty overwhelming.  Anytime we try to visualize n-dimensional and use those to find semantic similarity is going to be a challenge.  Thankfully there many tools available that allow for the creation of vector embeddings like Word2Vec, CNNs and many others that can take your data and turn it into vectors.

What you do with that data, how it is stored and accessed and how it is updated is where the real challenge lies.

While this may sound complex, Vector Search on AstraDB takes care of all of this for you with a fully integrated solution that provides all of the pieces you need for contextual data built for AI. From the digital nervous system, Astra Streaming, built on data pipelines that provide inline vector embeddings, all the way to real-time large volume storage, retrieval, access, and processing, via the most scalable vector database, Astra DB, on the market today, in an easy-to-use cloud platform. Try for free today.

## Build generative AI apps at scale on Astra DB vector

Astra DB gives developers the APIs, real-time data and complete ecosystem integrations to put accurate Gen AI apps in production - FAST.

### What is a vector embedding?

A vector embedding represents data as a mathematical equation or points in n-dimensional space, grouping similar data points together. This representation is crucial for machine learning models to learn and grow effectively​​.

### How are vector embeddings created?

They begin with a discrete data point transformed into a vector representation in high-dimensional space. The data points are evaluated based on specific attributes, which are then assigned numerical values to create the vector​​.

### What are some applications of vector embeddings?

Vector embeddings are employed in recommendation engines, semantic search, fraud/anomaly detection, and enabling chatbots to provide natural language interfaces. They are fundamental in powering machine learning and AI systems by storing and comparing data effectively​​.

### How do vector embeddings aid in data comparison?

By representing data as vectors, similarity among data points can be mapped within multidimensional planes, facilitating effective comparison and identification of relationships between different data pieces​​.

### How do vector embeddings contribute to machine learning models?

They provide a mathematical representation of data, enabling machine learning models to correlate and plot data points for tasks like semantic search, enhancing their ability to process and analyze data​​.

### How do vector embeddings support generative AI?

Vector embeddings are foundational in giving generative AI the ability to collate neurological-like inputs and make decisions, underpinning the cognitive abilities required for tasks like natural language processing and image generation​.

### How do vector embeddings enable semantic search?

By mapping words or other data into vectors in a multi-dimensional space, semantic relationships can be identified, enabling semantic search which finds items based on meaning and context rather than exact matches​​.

## One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

Company
Resources
Cloud Partners

DataStax, is a registered trademark of DataStax, Inc.. Apache, Apache Cassandra, Cassandra, Apache Pulsar, and Pulsar are either registered trademarks or trademarks of the Apache Software Foundation.