• Articles
  • Scalar Quantization: Background, Practices & More | Qdrant
Back to Qdrant Internals

Scalar Quantization: Background, Practices & More | Qdrant

Kacper Łukawski

·

March 27, 2023

Scalar Quantization: Background, Practices & More | Qdrant

Efficiency Unleashed: The Power of Scalar Quantization

High-dimensional vector embeddings can be memory-intensive, especially when working with large datasets consisting of millions of vectors. Memory footprint really starts being a concern when we scale things up. A simple choice of the data type used to store a single number impacts even billions of numbers and can drive the memory requirements crazy. The higher the precision of your type, the more accurately you can represent the numbers. The more accurate your vectors, the more precise is the distance calculation. But the advantages stop paying off when you need to order more and more memory.

Qdrant chose float32 as a default type used to store the numbers of your embeddings. So a single number needs 4 bytes of the memory and a 512-dimensional vector occupies 2 kB. That’s only the memory used to store the vector. There is also an overhead of the HNSW graph, so as a rule of thumb we estimate the memory size with the following formula:

memory_size = 1.5 * number_of_vectors * vector_dimension * 4 bytes

While Qdrant offers various options to store some parts of the data on disk, starting from version 1.1.0, you can also optimize your memory by compressing the embeddings. We’ve implemented the mechanism of Scalar Quantization! It turns out to have not only a positive impact on memory but also on the performance.

Scalar quantization

Scalar quantization is a data compression technique that converts floating point values into integers. In case of Qdrant float32 gets converted into int8, so a single number needs 75% less memory. It’s not a simple rounding though! It’s a process that makes that transformation partially reversible, so we can also revert integers back to floats with a small loss of precision.

Theoretical background

Assume we have a collection of float32 vectors and denote a single value as f32. In reality neural embeddings do not cover a whole range represented by the floating point numbers, but rather a small subrange. Since we know all the other vectors, we can establish some statistics of all the numbers. For example, the distribution of the values will be typically normal:

A distribution of the vector values

Our example shows that 99% of the values come from a [-2.0, 5.0] range. And the conversion to int8 will surely lose some precision, so we rather prefer keeping the representation accuracy within the range of 99% of the most probable values and ignoring the precision of the outliers. There might be a different choice of the range width, actually, any value from a range [0, 1], where 0 means empty range, and 1 would keep all the values. That’s a hyperparameter of the procedure called quantile. A value of 0.95 or 0.99 is typically a reasonable choice, but in general quantile ∈ [0, 1].

Conversion to integers

Let’s talk about the conversion to int8. Integers also have a finite set of values that might be represented. Within a single byte they may represent up to 256 different values, either from [-128, 127] or [0, 255].

Value ranges represented by int8

Since we put some boundaries on the numbers that might be represented by the f32, and i8 has some natural boundaries, the process of converting the values between those two ranges is quite natural:

$$ f32 = \alpha \times i8 + offset $$

$$ i8 = \frac{f32 - offset}{\alpha} $$

The parameters $ \alpha $ and $ offset $ has to be calculated for a given set of vectors, but that comes easily by putting the minimum and maximum of the represented range for both f32 and i8.

Float32 to int8 conversion

For the unsigned int8 it will go as following:

$$ \begin{equation} \begin{cases} -2 = \alpha \times 0 + offset \\ 5 = \alpha \times 255 + offset \end{cases} \end{equation} $$

In case of signed int8, we’ll just change the represented range boundaries:

$$ \begin{equation} \begin{cases} -2 = \alpha \times (-128) + offset \\ 5 = \alpha \times 127 + offset \end{cases} \end{equation} $$

For any set of vector values we can simply calculate the $ \alpha $ and $ offset $ and those values have to be stored along with the collection to enable to conversion between the types.

Distance calculation

We do not store the vectors in the collections represented by int8 instead of float32 just for the sake of compressing the memory. But the coordinates are being used while we calculate the distance between the vectors. Both dot product and cosine distance requires multiplying the corresponding coordinates of two vectors, so that’s the operation we perform quite often on float32. Here is how it would look like if we perform the conversion to int8:

$$ f32 \times f32’ = $$ $$ = (\alpha \times i8 + offset) \times (\alpha \times i8’ + offset) = $$ $$ = \alpha^{2} \times i8 \times i8’ + \underbrace{offset \times \alpha \times i8’ + offset \times \alpha \times i8 + offset^{2}}_\text{pre-compute} $$

The first term, $ \alpha^{2} \times i8 \times i8’ $ has to be calculated when we measure the distance as it depends on both vectors. However, both the second and the third term ($ offset \times \alpha \times i8’ $ and $ offset \times \alpha \times i8 $ respectively), depend only on a single vector and those might be precomputed and kept for each vector. The last term, $ offset^{2} $ does not depend on any of the values, so it might be even computed once and reused.

If we had to calculate all the terms to measure the distance, the performance could have been even worse than without the conversion. But thanks for the fact we can precompute the majority of the terms, things are getting simpler. And in turns out the scalar quantization has a positive impact not only on the memory usage, but also on the performance. As usual, we performed some benchmarks to support this statement!

Benchmarks

We simply used the same approach as we use in all the other benchmarks we publish. Both Arxiv-titles-384-angular-no-filters and Gist-960 datasets were chosen to make the comparison between non-quantized and quantized vectors. The results are summarized in the tables:

Arxiv-titles-384-angular-no-filters

ef = 128ef = 256ef = 512
Upload and indexing timeMean search precisionMean search timeMean search precisionMean search timeMean search precisionMean search time
Non-quantized vectors649 s0.9890.00940.9940.09320.9960.161
Scalar Quantization496 s0.9860.00370.9930.0600.9960.115
Difference-23.57%-0.3%-60.64%-0.1%-35.62%0%-28.57%

A slight decrease in search precision results in a considerable improvement in the latency. Unless you aim for the highest precision possible, you should not notice the difference in your search quality.

Gist-960

ef = 128ef = 256ef = 512
Upload and indexing timeMean search precisionMean search timeMean search precisionMean search timeMean search precisionMean search time
Non-quantized vectors4520.8020.0770.8870.1350.9410.231
Scalar Quantization3120.8020.0430.8880.0770.9410.135
Difference-30.79%0%-44,16%+0.11%-42.96%0%-41,56%

In all the cases, the decrease in search precision is negligible, but we keep a latency reduction of at least 28.57%, even up to 60,64%, while searching. As a rule of thumb, the higher the dimensionality of the vectors, the lower the precision loss.

Oversampling and rescoring

A distinctive feature of the Qdrant architecture is the ability to combine the search for quantized and original vectors in a single query. This enables the best combination of speed, accuracy, and RAM usage.

Qdrant stores the original vectors, so it is possible to rescore the top-k results with the original vectors after doing the neighbours search in quantized space. That obviously has some impact on the performance, but in order to measure how big it is, we made the comparison in different search scenarios. We used a machine with a very slow network-mounted disk and tested the following scenarios with different amounts of allowed RAM:

SetupRPSPrecision
4.5GB memory6000.99
4.5GB memory + SQ + rescore10000.989

And another group with more strict memory limits:

SetupRPSPrecision
2GB memory20.99
2GB memory + SQ + rescore300.989
2GB memory + SQ + no rescore12000.974

In those experiments, throughput was mainly defined by the number of disk reads, and quantization efficiently reduces it by allowing more vectors in RAM. Read more about on-disk storage in Qdrant and how we measure its performance in our article: Minimal RAM you need to serve a million vectors .

The mechanism of Scalar Quantization with rescoring disabled pushes the limits of low-end machines even further. It seems like handling lots of requests does not require an expensive setup if you can agree to a small decrease in the search precision.

Accessing best practices

Qdrant documentation on Scalar Quantization is a great resource describing different scenarios and strategies to achieve up to 4x lower memory footprint and even up to 2x performance increase.

Was this page useful?

Thank you for your feedback! 🙏

We are sorry to hear that. 😔 You can edit this page on GitHub, or create a GitHub issue.