TechnologyMarch 18, 2024

DataStax Integrates NVIDIA NIM for Deploying AI Models

DataStax integrates NVIDIA NIM with high-performance RAG solutions for deploying AI models with faster embeddings and indexing using NVIDIA microservices.
DataStax Integrates NVIDIA NIM for Deploying AI Models

DataStax is broadening retrieval and augmentation use cases by integrating NVIDIA NIM and NVIDIA NeMo Retriever microservices in DataStax to deliver high-performance retrieval-augmented generation (RAG) solutions with fast embeddings. NVIDIA microservices, part of the NVIDIA AI Enterprise software platform, provide developers with a set of easy-to-use building blocks to accelerate deployment of generative AI across enterprises. 

End users of generative AI applications expect real-time responses. But the complexity of accessing structured and unstructured data can introduce latency that impacts the user experience—and up to 40% of that latency can come from calls to the embedding service and vector search service.

That’s why DataStax is improving the delivery of retrieval-augmented use cases by integrating NVIDIA microservices into DataStax Astra DB to deliver high-performance RAG data solutions. 

The challenge

To make large language models (LLMs) practical for enterprise users, RAG combines pre-trained language models with a retrieval system that enables enterprises to talk to their own data. RAG is incredibly useful because it reduces hallucinations, helps LLMs to be more specific by using enterprise data, and can be refreshed as quickly as new information becomes available. It’s also a more resource-efficient approach as the embedded retrieval inference costs are substantially lower than batch training custom models.

The enterprise corpus of unstructured data - from software logs to customer chat history - is a gold mine of valuable domain knowledge and real-time information necessary for generative AI applications. Calls to the embedding service represents 10-40% of the total latency for a given RAG pipeline with customers. However, many companies face the daunting technological and cost prohibitive challenge of vectorizing their existing and newly added unstructured data for LLM inference. The challenge is compounded by the need to generate embeddings in near real time, and indexing the information in a vector database. 

DataStax is working with NVIDIA to solve this problem. The NVIDIA Retriever embedding microservice generates over 1,200 embeddings per second per GPU at double-digit millisecond latencies, pairing well with a highly scalable NoSQL datastore like Astra DB, which is able to ingest new embeddings at 4000+ TPS at single-digit millisecond latencies on low-cost commodity storage solutions/disks. DataStax, with NIM, provides ~11.9 millisecond embedding+indexing latencies, 4000+ of ops/second, and commensurate lower operational costs.

NVIDIA NeMo Retriever is a collection of generative AI microservices enabling organizations to seamlessly connect custom models to diverse business data and deliver highly accurate responses. It enhances generative AI applications with RAG capabilities that can be connected to business data wherever it resides.

Why it matters: Skypoint’s use case

Tisson Mathew, CEO and founder of healthcare data solutions provider Skypoint, noted the importance of speed when it comes to providing a great experience for customers. 

“At Skypoint, we have a strict SLA of five seconds to generate responses for our frontline healthcare providers,” Mathew said. “Hitting this SLA is especially difficult in the scenario that there are multiple LLM and vector search queries. Being able to shave off time from generating embeddings is of vast importance to improving the user experience.”

Benchmarks

To understand the impact of performant RAG solutions, we benchmarked the NVIDIA NeMo NV-Embed-QA vector embedding model (for generating vector embeddings) and the DataStax Astra DB Vector database (for storing and managing vectors at scale). We ran the test harness (Open Source NoSQLBench) on the same machine as the model, deployed on Docker containers. The performance tests measured the following three key metrics:

  • Embedding Latency: Time to generate an embedding
  • Indexing / Query Latency: Time to store / query the generated embedding
  • Overall Throughput: Number of processed inputs through the system per second
  • Cost: Hardware and software cost to process tokens

We ran the benchmarks on a single NVIDIA A100 Tensor Core GPU, which demonstrated increased performance from ~181 requests/second to almost 400 ops/sec. Tuning NVIDIA TensorRT software – included with NVIDIA AI Enterprise for production-grade AI – with a tokenization/preprocessing model improved performance by another 25%. We then switched to the latest NVIDIA H100 80GB Tensor Core GPU (a single a3-high gpu-8g instance running on Google Cloud), which resulted in throughput doubling to 800+ ops/sec.

We’ve also looked at configurations that lowered the latency, and found that we can achieve ~365 ops/sec at a ~11.9 millisecond average embedding + indexing time - which is 19x faster than popular cloud embedding models + vector databases.  

When combined with NVIDIA NIM and NeMo Retriever microservices, Astra DB and DataStax Enterprise (DataStax’s on-premises solution) provide a fast vector DB RAG solution that’s built on a scalable NoSQL database that can run on any storage medium. Out-of-the-box integration with RAGStack (powered by LangChain and LlamaIndex) makes it easy for developers to replace their existing embedding model with NIM. In addition, using the RAGStack compatibility matrix tester, enterprises can validate the availability and performance of various combinations of embedding and LLM models for common RAG pipelines. We’ve also integrated and performance tested NVIDIA NeMo Guardrails into RAGStack and shown that fast RAG applications need not come at the expense of safety. 

We’ve also launched in developer preview a new feature - Vectorize - which performs the embedding generation at the database tier. So instead of customers managing their own microservices for generating embeddings, Astra DB will have its own microservices instance and directly  pass the cost savings to the customer (sign-up form).

Today, the Datastax vector database uses DiskANN to process 4,000 inserts per second and make them immediately searchable. While this keeps the data fresh, the tradeoff is reduced efficiency. We are working with NVIDIA to accelerate our vector search algorithms by integrating RAPIDS cuVS to improve efficiency and maintain data freshness. 

To get started, sign up to access DataStax’s Astra DB and NVIDIA NIM and NeMo microservices, and download RAGStack into your development environment. Then get started with the NVIDIA microservices / DataStax template to experience fast vector search at scale for yourself. Finally, check out the testing dashboard to identify what RAG use cases you would like to implement with NVIDIA NIM and DataStax.

Discover more
DataStax Astra DBRetrieval-augmented generation
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.