Upload and Search Large collections cost-efficiently

Time: 2 days	Level: Advanced

In this tutorial, we will describe an approach to upload, index, and search a large volume of data cost-efficiently, on an example of the real-world dataset LAION-400M.

The goal of this tutorial is to demonstrate what minimal amount of resources is required to index and search a large dataset, while still maintaining a reasonable search latency and accuracy.

All relevant code snippets are available in the GitHub repository.

The recommended Qdrant version for this tutorial is v1.13.5 and higher.

Dataset

The dataset we will use is LAION-400M, a collection of approximately 400 million vectors obtained from images extracted from a Common Crawl dataset. Each vector is 512-dimensional and generated using a CLIP model.

Vectors are associated with a number of metadata fields, such as url, caption, LICENSE, etc.

The overall payload size is approximately 200 GB, and the vectors are 400 GB.

The dataset is available in the form of 409 chunks, each containing approximately 1M vectors. We will use the following python script to upload dataset chunks one by one.

Hardware

After some initial experiments, we figured out a minimal hardware configuration for the task:

8 CPU cores
64Gb RAM
650Gb Disk space

This configuration is enough to index and explore the dataset in a single-user mode; latency is reasonable enough to build interactive graphs and navigate in the dashboard.

Naturally, you might need more CPU cores and RAM for production-grade configurations.

It is important to ensure high network bandwidth for this experiment so you are running the client and server in the same region.

Uploading and Indexing

We will use the following python script to upload dataset chunks one by one.

export QDRANT_URL="https://xxxx-xxxx.xxxx.cloud.qdrant.io"
export QDRANT_API_KEY="xxxx-xxxx-xxxx-xxxx"

python upload.py

This script will download chunks of the LAION dataset one by one and upload them to Qdrant. Intermediate data is not persisted on disk, so the script doesn’t require much disk space on the client side.

Let’s take a look at the collection configuration we used:

client.create_collection(
        QDRANT_COLLECTION_NAME,
        vectors_config=models.VectorParams(
            size=512, # CLIP model output size
            distance=models.Distance.COSINE, # CLIP model uses cosine distance
            datatype=models.Datatype.FLOAT16, # We only need 16 bits for float, otherwise disk usage would be 800Gb instead of 400Gb
            on_disk=True # We don't need original vectors in RAM
        ),
        # Even though CLIP vectors don't work well with binary quantization, out of the box,
        # we can rely on query-time oversampling to get more accurate results
        quantization_config=models.BinaryQuantization(
            binary=models.BinaryQuantizationConfig(
                always_ram=True,
            )
        ),
        optimizers_config=models.OptimizersConfigDiff(
            # Bigger size of segments are desired for faster search
            # However it might be slower for indexing
            max_segment_size=5_000_000, 
        ),
        # Having larger M value is desirable for higher accuracy,
        # but in our case we care more about memory usage
        # We could still achieve reasonable accuracy even with M=6 + oversampling
        hnsw_config=models.HnswConfigDiff(
            m=6, # decrease M for lower memory usage
            on_disk=False
        ),
    )

There are a few important points to note:

We use FLOAT16 datatype for vectors, which allows us to store vectors in half the size compared to FLOAT32. There are no significant accuracy losses for this dataset.
We use BinaryQuantization with always_ram=True to enable query-time oversampling. This allows us to get an accurate and resource-efficient search, even though 512d CLIP vectors don’t work well with binary quantization out of the box.
We use HnswConfig with m=6 to reduce memory usage. We will look deeper into memory usage in the next section.

Goal of this configuration is to ensure that prefetch component of the search never needs to load data from disk, and at least a minimal version of vectors and vector index is always in RAM. The second stage of the search can explicitly determine how many times we can afford to load data from a disk.

In our experiment, the upload process was going at 5000 points per second. The indexation process was going in parallel with the upload and was happening at the rate of approximately 4000 points per second.

Memory Usage

After the upload and indexation process is finished, let’s take a detailed look at the memory usage of the Qdrant server.

On the high level, memory usage consists of 3 components:

System memory - 8.34Gb - this is memory reserved for internal systems and OS, it doesn’t depend on the dataset size.
Data memory - 39.27Gb - this is a resident memory of qdrant process, it can’t be evicter and qdrant process will crash if it exceeds the limit.
Cache memory - 14.54Gb - this is a disk cache qdrant uses. It is necessary for fast search but can be evicted if needed.

The most interest for us is Data and Cache memory. Let’s look what exactly is stored in these components.

In our scenario, Qdrant uses memory to store the following components:

Storing vectors
Storing vector index
Storing information about IDs and versions of points

Size of vectors

In our scenario, we store only quantized vectors in RAM, so it is relatively easy to calculate the required size:

400_000_000 * 512d / 8 bits / 1024 (Kb) / 1024 (Mb) / 1024 (Gb) = 23.84Gb

Size of vector index

Vector index is a bit more complicated, as it is not a simple matrix.

Internally, it is stored as a list of connections in a graph, and each connection is a 4-byte integer.

The number of connections is defined by the M parameter of the HNSW index, and in our case, it is 6 on the high level and 2 x M on level 0.

This gives us the following estimation:

400_000_000 * (6 * 2) * 4 bytes / 1024 (Kb) / 1024 (Mb) / 1024 (Gb) = 17.881Gb

In practice the size of index is a bit smaller due to the compression we implemented in Qdrant v1.13.0, but it is still a good estimation.

The HNSW index in Qdrant is stored as a mmap, and it can be evicted from RAM if needed. So, the memory consumption of HNSW falls under the category of Cache memory.

Size of IDs and versions

Qdrant must store additional information about each point, such as ID and version. This information is needed on each request, so it is very important to keep it in RAM for fast access.

Let’s take a look at Qdrant internals to understand how much memory is required for this information.


// This is s simplified version of the IdTracker struct
// It omits all optimizations and small details,
// but gives a good estimation of memory usage
IdTracker {
    // Mapping of internal id to version (u64), compressed to 4 bytes
    // Required for versioning and conflict resolution between segments
    internal_to_version, // 400M x 4 = 1.5Gb

    // Mapping of external id to internal id, 4 bytes per point.
    // Required to determine original point ID after search inside the segment
    internal_to_external: Vec<u128>, // 400M x 16 = 6.4Gb

    // Mapping of external id to internal id. For numeric ids it uses 8 bytes,
    //  UUIDs are stored as 16 bytes.
    // Required to determine sequential point ID inside the segment
    external_to_internal: Vec<u64, u32>, // 400M x (8 + 4) = 4.5Gb
}

In the v1.13.5 we introduced a significant optimization to reduce the memory usage of IdTracker by approximately 2 times. So the total memory usage of IdTracker in our case is approximately 12.4Gb.

So total expected RAM usage of Qdrant server in our case is approximately 23.84Gb + 17.881Gb + 12.4Gb = 54.121Gb, which is very close to the actual memory usage we observed: 39.27Gb + 14.54Gb = 53.81Gb.

We had to apply some simplifications to the estimations, but they are good enough to understand the memory usage of the Qdrant server.

Search

After the dataset is uploaded and indexed, we can start searching for similar vectors.

We can start by exploring the dataset in Web-UI. So you can get an intuition into the search performance, not just table numbers.

Web-UI default requests do not use oversampling, but the observable results are still good enough to see the resemblance between images.

Ground truth data

However, to estimate the search performance more accurately, we need to compare search results with the ground truth. Unfortunately, the LAION dataset doesn’t contain usable ground truth, so we had to generate it ourselves.

To do this, we need to perform a full-scan search for each vector in the dataset and store the results in a separate file. Unfortunately, this process is very time-consuming and requires a lot of resources, so we had to limit the number of queries to 100, we provide a ready-to-use ground truth file and the script to generate it (requires 512Gb RAM machine and about 20 hours of execution time).

Our ground truth file contains 100 queries, each with 50 results. The first 100 vectors of the dataset itself were used to generate queries.

Search Query

To precisely control the amount of oversampling, we will use the following search query:


limit = 50
rescore_limit = 1000 # oversampling factor is 20

query = vectors[query_id] # One of existing vectors

response = client.query_points(
        collection_name=QDRANT_COLLECTION_NAME,
        query=query,
        limit=limit,
        # Go to disk 
        search_params=models.SearchParams(
            quantization=models.QuantizationSearchParams(
                rescore=True,
            ),
        ),
        # Prefetch is performed using only in-RAM data,
        # so querying even large amount of data is fast
        prefetch=models.Prefetch(
            query=query,
            limit=rescore_limit,
            params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    # Avoid rescoring in prefetch
                    # We should do it explicitly on the second stage
                    rescore=False,
                ),
            )
        )
    )

As you can see, this query contains two stages:

First stage is a prefetch, which is performed using only in-RAM data. It is very fast and allows us to get a large amount of candidates.
The second stage is a rescore, which is performed with full-size vectors stored on disks.

By using 2-stage search we can precisely control the amount of data loaded from disk and ensure the balance between search speed and accuracy.

You can find the complete code of the search process in the eval.py

Performance tweak

One important performance tweak we found useful for this dataset is to enable Async IO in Qdrant.

By default, Qdrant uses synchronous IO, which is good for in-memory datasets but can be a bottleneck when we want to read a lot of data from a disk.

Async IO (implemented with io_uring) allows to send parallel requests to the disk and saturate the disk bandwidth.

This is exactly what we are looking for when performing large-scale re-scoring with original vectors.

Instead of reading vectors one by one and waiting for the disk response 1000 times, we can send 1000 requests to the disk and wait for all of them to complete. This allows us to saturate the disk bandwidth and get faster results.

To enable Async IO in Qdrant, you need to set the following environment variable:

QDRANT__STORAGE__PERFORMANCE__ASYNC_SCORER=true

Or set parameter in config file:

storage:
  performance:
    async_scorer: true

In Qdrant Managed cloud Async IO can be enabled via Advanced optimizations section in cluster Configuration tab.

Running search requests

Once all the preparations are done, we can run the search requests and evaluate the results.

You can find the full code of the search process in the eval.py

This script will run 100 search requests with configured oversampling factor and compare the results with the ground truth.

python eval.py --rescore_limit 1000

In our request we achieved the following results:

Rescore Limit	Precision@50	Time per request
1000	75.2%	0.7s
5000	81.0%	2.2s

Additional experiments with m=16 demonstrated that we can achieve 85% precision with rescore_limit=1000, but they would require slightly more memory.

Conclusion

In this tutorial we demonstrated how to upload, index and search a large dataset in Qdrant cost-efficiently. Binary quantization can be applied even on 512d vectors, if combined with query-time oversampling.

Qdrant allows to precisely control where each part of storage is located, which allows to achieve a good balance between search speed and memory usage.

Potential improvements

In this experiment, we investigated in detail which parts of the storage are responsible for memory usage and how to control them.

One especially interesting part is the VectorIndex component, which is responsible for storing the graph of connections between vectors.

In our further research, we will investigate the possibility of making HNSW more disk-friendly so it can be offloaded to disk without significant performance losses.