TechnologyDecember 18, 2023

Vector Search for Production: A GPU-Powered KNN Ground Truth Dataset Generator

Sebastian Estevez
Sebastian EstevezDataStax
Yabin Meng
Yabin MengDataStax
Vector Search for Production: A GPU-Powered KNN Ground Truth Dataset Generator
A ⋅ B = a1×b1 + a2 × b2​+⋯+an×bn

Let's do a quick example:

   A=[0.5959,0.3442,0.7256]

   B=[0.6769,0.3783,0.6314]

A ⋅ B =(0.5959×0.6769)+(0.3442×0.3783)+(0.7256×0.6314) = 0.9917
$ poetry run nw -h
usage: nw [-h] [-m MODEL_NAME] [-k K] [-d DATA_DIR] [--skip-zero-vec | --no-skip-zero-vec]
 [--use-dataset-api | --no-use-dataset-api] [--gen-hdf5 | --no-gen-hdf5] [--post-validation
 | --no-post-validation] [--enable-memory-tuning] [--disable-memory-tuning]
   query_count base_count

nw (neighborhood watch) uses GPU acceleration to generate ground truth KNN datasets

positional arguments:
  query_count           number of query vectors to generate
  base_count            number of base vectors to generate
options:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
              model name to use for generating embeddings, i.e. ada-002,
 textembedding-gecko, or intfloat/e5-large-v2
  -k K, --k K           number of neighbors to compute per query vector
  -d DATA_DIR, --data_dir DATA_DIR
                        Directory to store the generated datasets (default: ./knn_dataset)
  --skip-zero-vec, --no-skip-zero-vec
                   Skip generating zero vectors when there is failure to to retrieve the
 vector (default: True)
  --use-dataset-api, --no-use-dataset-api
                    Use 'pyarrow.dataset' API to read the dataset; recommended for very
 large datasets (default: False)
  --gen-hdf5, --no-gen-hdf5
                        Generate the hdf5 format file (default: True)
  --post-validation, --no-post-validation
                        Validate the generated files (default: False)
  --enable-memory-tuning
                        Enable memory tuning
  --disable-memory-tuning
                        Disable memory tuning (useful for very small datasets)

Some example commands:

    nw 1000 10000 -k 100 -m 'textembedding-gecko' --disable-memory-tuning
    nw 1000 10000 -k 100 -m 'intfloat/e5-large-v2' --disable-memory-tuning
    nw 1000 10000 -k 100 -m 'intfloat/e5-small-v2' --disable-memory-tuning
    nw 1000 10000 -k 100 -m 'intfloat/e5-base-v2' --disable-memory-tuning
Embed tweet thread https://x.com/syllogistic/status/1720624316591509880?s=20
from pylibraft.neighbors.brute_force import knn

distances, neighbors = knn(dataset, queries, k)
Discover more
Vector Search
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.