Quantization

Quantization is an optional feature in Qdrant that enables efficient storage and search of high-dimensional vectors. By transforming original vectors into a new representations, quantization compresses data while preserving close to original relative distances between vectors. Different quantization methods have different mechanics and tradeoffs. We will cover them in this section.

Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces. In the context of Qdrant, quantization allows you to optimize the search engine for specific use cases, striking a balance between accuracy, storage efficiency, and search speed.

There are tradeoffs associated with quantization. On the one hand, quantization allows for significant reductions in storage requirements and faster search times. This can be particularly beneficial in large-scale applications where minimizing the use of resources is a top priority. On the other hand, quantization introduces an approximation error, which can lead to a slight decrease in search quality. The level of this tradeoff depends on the quantization method and its parameters, as well as the characteristics of the data.

Qdrant supports four quantization methods:

  • TurboQuant supports up to 32x compression, with strong recall across most embedding models.
  • Scalar Quantization compresses each vector component from a 32-bit float to an 8-bit integer, achieving 4x compression with minimal accuracy loss.
  • Binary Quantization reduces each vector component to one to two bits for up to 32x compression. Best suited for high-dimensional, centered vector distributions.
  • Product Quantization enables up to 64x compression when minimizing memory is the top priority.

To help you choose the right quantization method for your use case, refer to the next section.

How to Choose the Right Quantization Method

Depending on your requirements for recall, compression, and distance metrics, consult this table for guidance:

CompressionMethod
4Use Scalar Quantization. It is a well-established quantization method with a good balance between recall and compression.

However, unless you need to use the Manhattan (L1) distance metric, consider using 4-bit TurboQuant instead of scalar quantization, as it offers comparable recall at double the compression.
8Use 4-bit TurboQuant. It offers a good balance between recall and compression.

When using the Manhattan (L1) distance metric, consider using another quantization method.
162-bit TurboQuant and 2-bit binary quantization offer similar results at this compression level. Binary quantization is faster, but TurboQuant provides better recall.
241.5-bit TurboQuant and 1.5-bit binary quantization offer similar results at this compression level. Binary quantization is faster, but TurboQuant provides better recall.
321-bit TurboQuant and 1-bit binary quantization offer similar results at this compression level. Binary quantization is faster, but TurboQuant provides better recall.
Up to 64Use Product Quantization if the memory footprint is the top priority and accuracy and speed are not critical.

TurboQuant Quantization

Available as of v1.18.0

TurboQuant is a quantization method developed by Google. It operates by applying a fast random rotation to vectors before compression, which evenly redistributes data across coordinates. This allows applying a single pre-computed, globally optimized quantization mapping across the dataset, enabling TurboQuant to work effectively with any vector distribution and overcoming a key limitation found in binary quantization.

Qdrant’s implementation of TurboQuant extends the original algorithm to close the gap between the algorithm’s theoretical assumptions and real-world embeddings.

TurboQuant uses asymmetric quantization automatically: only stored vectors are compressed, while queries are scored in full precision. This improves accuracy and requires no additional configuration.

Encoding Options

TurboQuant supports four bit depths:

EncodingBit DepthCompression
bits4 (default)4 bits
bits22 bits16×
bits1_51.5 bits24×
bits11 bit32×

In our benchmarks, 4-bit TurboQuant, at twice the compression ratio of scalar quantization, delivers similar recall and speed. Results vary by dataset and embedding model: it may outperform or slightly underperform scalar quantization. This makes 4-bit TurboQuant a good default choice for many use cases.

Compared to binary quantization, TurboQuant offers better recall at lower speed and equivalent storage budgets.

The default encoding is bits4, which offers the best accuracy.

Distance Metric Support

TurboQuant fully supports Cosine, Dot, and Euclidean (L2) distance with SIMD-accelerated scoring.

Manhattan (L1) distance is supported but requires full vector reconstruction per comparison, making it significantly slower than the other metrics. Use Cosine, Dot, or Euclidean distance for best performance with TurboQuant.

Scalar Quantization

Available as of v1.1.0

Scalar quantization, in the context of vector search engines, is a compression technique that compresses vectors by reducing the number of bits used to represent each vector component.

For instance, Qdrant uses 32-bit floating numbers to represent the original vector components. Scalar quantization allows you to reduce the number of bits used to 8. In other words, Qdrant performs float32 -> uint8 conversion for each vector component. Effectively, this means that the amount of memory required to store a vector is reduced by a factor of 4.

In addition to reducing the memory footprint, scalar quantization also speeds up the search process. Qdrant uses a special SIMD CPU instruction to perform fast vector comparison. This instruction works with 8-bit integers, so the conversion to uint8 allows Qdrant to perform the comparison faster.

The main drawback of scalar quantization is the loss of accuracy. The float32 -> uint8 conversion introduces an error that can lead to a slight decrease in search quality. However, this error is usually negligible, and tends to be less significant for high-dimensional vectors. In our experiments, we found that the error introduced by scalar quantization is usually less than 1%.

However, this value depends on the data and the quantization parameters. Please refer to the Quantization Tips section for more information on how to optimize the quantization parameters for your use case.

Binary Quantization

Available as of v1.5.0

Binary quantization is an extreme case of scalar quantization. This feature lets you represent each vector component as a single bit, effectively reducing the memory footprint by a factor of 32. This is the fastest quantization method, since it lets you perform a vector comparison with a few CPU instructions. Binary quantization can achieve up to a 40x speedup compared to the original vectors.

However, binary quantization is only efficient for high-dimensional vectors and require a centered distribution of vector components.

At the moment, binary quantization shows good accuracy results with the following models:

  • OpenAI text-embedding-ada-002 - 1536d tested with dbpedia dataset achieving 0.98 recall@100 with 4x oversampling
  • Cohere AI embed-english-v2.0 - 4096d tested on Wikipedia embeddings - 0.98 recall@50 with 2x oversampling

Models with a lower dimensionality or a different distribution of vector components may require additional experiments to find the optimal quantization parameters.

We recommend using binary quantization only with rescoring enabled, as this can significantly improve search quality. However, keep in mind that if the original vectors are stored on disk, rescoring can significantly decrease search speed.

Additionally, oversampling can be used to tune the tradeoff between search speed and search quality at query time.

Binary Quantization as Hamming Distance

The additional benefit of this method is that you can efficiently emulate Hamming distance with dot product.

Specifically, if original vectors contain {-1, 1} as possible values, then the dot product of two vectors is equal to the Hamming distance by simply replacing -1 with 0 and 1 with 1.

Sample truth table
Vector 1Vector 2Dot product
111
1-1-1
-11-1
-1-11
Vector 1Vector 2Hamming distance
110
101
011
000

As you can see, both functions are equal up to a constant factor, which makes similarity search equivalent. Binary quantization makes it efficient to compare vectors using this representation.

1.5-Bit and 2-Bit Quantization

Available as of v1.15.0

Binary quantization storage can use 2 and 1.5 bits per dimension, improving precision for smaller vectors. One-bit compression resulted in significant data loss and precision drops for vectors smaller than a thousand dimensions, often requiring expensive rescoring. 2-bit quantization offers 16X compression compared to 32X with one bit, improving performance for smaller vector dimensions. The 1.5-bit quantization compression offers 24X compression and intermediate accuracy.

A major limitation of binary quantization is poor handling of values close to zero. 2-bit quantization addresses this by explicitly representing zeros using an efficient scoring mechanism. In the case of 1.5-bit quantization, the zero-bit is shared between two values, balancing the efficiency of binary quantization with the accuracy improvements of 2-bit quantization, especially when 2-bit BQ requires too much memory.

In order to build 2-bit representation, Qdrant computes values distribution and then assigns bit values to 3 possible buckets:

  • -1 - 00
  • 0 - 01
  • 1 - 11

1.5-bit quantization is similar, but merges buckets of pairs of elements into a binary triptets

2-bit quantization

2-bit quantization

See how to set up 1.5-bit and 2-bit quantization in the following section.

Asymmetric Quantization

Available as of v1.15.0

The Asymmetric Quantization technique allows qdrant to use different vector encoding algorithm for stored vectors and for queries. Particularly interesting combination is a Binary stored vectors and Scalar quantized queries.

Asymmetric quantization

Asymmetric quantization

This approach maintains storage size and RAM usage similar to binary quantization while offering improved precision. It is beneficial for memory-constrained deployments, or where the bottleneck is disk I/O rather than CPU. This is particularly useful for indexing millions of vectors as it improves precision without sacrificing much because the limitation in such scenarios is disk speed, not CPU. This approach requires less rescoring for the same quality output.

See how to set up Asymmetric Quantization quantization in the following section

Product Quantization

Available as of v1.2.0

Product quantization is a method of compressing vectors to minimize their memory usage by dividing them into chunks and quantizing each segment individually. Each chunk is approximated by a centroid index that represents the original vector component. The positions of the centroids are determined through the utilization of a clustering algorithm such as k-means. For now, Qdrant uses only 256 centroids, so each centroid index can be represented by a single byte.

Product quantization can compress by a more prominent factor than a scalar one. But there are some tradeoffs. Product quantization distance calculations are not SIMD-friendly, so it is slower than scalar quantization. Also, product quantization has a loss of accuracy, so it is recommended to use it only for high-dimensional vectors.

Please refer to the Quantization Tips section for more information on how to optimize the quantization parameters for your use case.

Setting Up Quantization in Qdrant

You can configure quantization for a collection by specifying the quantization parameters in the quantization_config section of the collection configuration.

Quantization will be automatically applied to all vectors during the indexation process. Quantized vectors are stored alongside the original vectors in the collection, so you will still have access to the original vectors if you need them.

Available as of v1.1.1

The quantization_config can also be set on a per vector basis by specifying it in a named vector.

Setting Up TurboQuant

To enable TurboQuant, specify it in the quantization_config section of the collection configuration.

When enabling TurboQuant on an existing collection, use a PATCH request or the corresponding update_collection method and omit the vector configuration, as it’s already defined.

httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "turbo": {
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
    quantization_config=models.TurboQuantization(
        turbo=models.TurboQuantQuantizationConfig(
            always_ram=True,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

client.createCollection("{collection_name}", {
  vectors: {
    size: 1536,
    distance: "Cosine",
  },
  quantization_config: {
    turbo: {
      always_ram: true,
    },
  },
});
use qdrant_client::qdrant::{
    CreateCollectionBuilder, Distance, TurboQuantizationBuilder, VectorParamsBuilder,
};
use qdrant_client::Qdrant;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(1536, Distance::Cosine))
            .quantization_config(TurboQuantizationBuilder::new().always_ram(true)),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.TurboQuantization;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(1536)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setTurboquant(TurboQuantization.newBuilder().setAlwaysRam(true).build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.CreateCollectionAsync(
	collectionName: "{collection_name}",
	vectorsConfig: new VectorParams { Size = 1536, Distance = Distance.Cosine },
	quantizationConfig: new QuantizationConfig
	{
		Turboquant = new TurboQuantization { AlwaysRam = true }
	}
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
	CollectionName: "{collection_name}",
	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
		Size:     1536,
		Distance: qdrant.Distance_Cosine,
	}),
	QuantizationConfig: qdrant.NewQuantizationTurbo(
		&qdrant.TurboQuantization{
			AlwaysRam: qdrant.PtrOf(true),
		},
	),
})

bits - the encoding bit depth. Defaults to bits4. Available values: bits4, bits2, bits1_5, and bits1. Lower bit depths offer higher compression at the cost of accuracy.

always_ram - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors. Set always_ram to true to store quantized vectors in RAM.

Select a Bit Depth

To use a specific compression level, set the bits parameter:

httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "turbo": {
            "bits": "bits2",
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
    quantization_config=models.TurboQuantization(
        turbo=models.TurboQuantQuantizationConfig(
            always_ram=True,
            bits=models.TurboQuantBitSize.BITS2,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

client.createCollection("{collection_name}", {
  vectors: {
    size: 1536,
    distance: "Cosine",
  },
  quantization_config: {
    turbo: {
      always_ram: true,
      bits: "bits2",
    },
  },
});
use qdrant_client::qdrant::{
    CreateCollectionBuilder, Distance, TurboQuantBitSize, TurboQuantizationBuilder,
    VectorParamsBuilder,
};
use qdrant_client::Qdrant;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(1536, Distance::Cosine))
            .quantization_config(
                TurboQuantizationBuilder::new()
                    .always_ram(true)
                    .bits(TurboQuantBitSize::Bits2),
            ),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.TurboQuantBitSize;
import io.qdrant.client.grpc.Collections.TurboQuantization;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(1536)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setTurboquant(
                        TurboQuantization.newBuilder()
                            .setAlwaysRam(true)
                            .setBits(TurboQuantBitSize.Bits2)
                            .build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.CreateCollectionAsync(
	collectionName: "{collection_name}",
	vectorsConfig: new VectorParams { Size = 1536, Distance = Distance.Cosine },
	quantizationConfig: new QuantizationConfig
	{
		Turboquant = new TurboQuantization { AlwaysRam = true, Bits = TurboQuantBitSize.Bits2 }
	}
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
	CollectionName: "{collection_name}",
	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
		Size:     1536,
		Distance: qdrant.Distance_Cosine,
	}),
	QuantizationConfig: qdrant.NewQuantizationTurbo(
		&qdrant.TurboQuantization{
			AlwaysRam: qdrant.PtrOf(true),
			Bits:      qdrant.TurboQuantBitSize_Bits2.Enum(),
		},
	),
})

Setting Up Scalar Quantization

To enable scalar quantization, you need to specify the quantization parameters in the quantization_config section of the collection configuration.

When enabling scalar quantization on an existing collection, use a PATCH request or the corresponding update_collection method and omit the vector configuration, as it’s already defined.

httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    },
    "quantization_config": {
        "scalar": {
            "type": "int8",
            "quantile": 0.99,
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.createCollection("{collection_name}", {
  vectors: {
    size: 768,
    distance: "Cosine",
  },
  quantization_config: {
    scalar: {
      type: "int8",
      quantile: 0.99,
      always_ram: true,
    },
  },
});
use qdrant_client::qdrant::{
    CreateCollectionBuilder, Distance, QuantizationType, ScalarQuantizationBuilder,
    VectorParamsBuilder,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(768, Distance::Cosine))
            .quantization_config(
                ScalarQuantizationBuilder::default()
                    .r#type(QuantizationType::Int8.into())
                    .quantile(0.99)
                    .always_ram(true),
            ),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.QuantizationType;
import io.qdrant.client.grpc.Collections.ScalarQuantization;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(768)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setScalar(
                        ScalarQuantization.newBuilder()
                            .setType(QuantizationType.Int8)
                            .setQuantile(0.99f)
                            .setAlwaysRam(true)
                            .build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.CreateCollectionAsync(
 collectionName: "{collection_name}",
 vectorsConfig: new VectorParams { Size = 768, Distance = Distance.Cosine },
 quantizationConfig: new QuantizationConfig
 {
  Scalar = new ScalarQuantization
  {
   Type = QuantizationType.Int8,
   Quantile = 0.99f,
   AlwaysRam = true
  }
 }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
	Host: "localhost",
	Port: 6334,
})

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
	CollectionName: "{collection_name}",
	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
		Size:     768,
		Distance: qdrant.Distance_Cosine,
	}),
	QuantizationConfig: qdrant.NewQuantizationScalar(
		&qdrant.ScalarQuantization{
            Type:      qdrant.QuantizationType_Int8,
			Quantile:  qdrant.PtrOf(float32(0.99)),
			AlwaysRam: qdrant.PtrOf(true),
		},
	),
})

There are 3 parameters that you can specify in the quantization_config section:

type - the type of the quantized vector components. Currently, Qdrant supports only int8.

quantile - the quantile of the quantized vector components. The quantile is used to calculate the quantization bounds. For instance, if you specify 0.99 as the quantile, 1% of extreme values will be excluded from the quantization bounds.

Using quantiles lower than 1.0 might be useful if there are outliers in your vector components. This parameter only affects the resulting precision and not the memory footprint. It might be worth tuning this parameter if you experience a significant decrease in search quality.

always_ram - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors. However, in some setups you might want to keep quantized vectors in RAM to speed up the search process.

In this case, you can set always_ram to true to store quantized vectors in RAM.

Setting Up Binary Quantization

To enable binary quantization, you need to specify the quantization parameters in the quantization_config section of the collection configuration.

When enabling binary quantization on an existing collection, use a PATCH request or the corresponding update_collection method and omit the vector configuration, as it’s already defined.

httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "binary": {
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(
            always_ram=True,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.createCollection("{collection_name}", {
  vectors: {
    size: 1536,
    distance: "Cosine",
  },
  quantization_config: {
    binary: {
      always_ram: true,
    },
  },
});
use qdrant_client::qdrant::{
    BinaryQuantizationBuilder, CreateCollectionBuilder, Distance, VectorParamsBuilder,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(1536, Distance::Cosine))
            .quantization_config(BinaryQuantizationBuilder::new(true)),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.BinaryQuantization;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(1536)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setBinary(BinaryQuantization.newBuilder().setAlwaysRam(true).build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.CreateCollectionAsync(
 collectionName: "{collection_name}",
 vectorsConfig: new VectorParams { Size = 1536, Distance = Distance.Cosine },
 quantizationConfig: new QuantizationConfig
 {
  Binary = new BinaryQuantization { AlwaysRam = true }
 }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
	Host: "localhost",
	Port: 6334,
})

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
	CollectionName: "{collection_name}",
	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
		Size:     1536,
		Distance: qdrant.Distance_Cosine,
	}),
	QuantizationConfig: qdrant.NewQuantizationBinary(
		&qdrant.BinaryQuantization{
			AlwaysRam: qdrant.PtrOf(true),
		},
	),
})

always_ram - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors. However, in some setups you might want to keep quantized vectors in RAM to speed up the search process.

In this case, you can set always_ram to true to store quantized vectors in RAM.

Set Up Bit Depth

To enable 2bit or 1.5bit quantization, you need to specify encoding parameter in the quantization_config section of the collection configuration. Available values are two_bits and one_and_half_bits.

httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "binary": {
            "encoding": "two_bits",
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(
            encoding=models.BinaryQuantizationEncoding.TWO_BITS,
            always_ram=True,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.createCollection("{collection_name}", {
  vectors: {
    size: 1536,
    distance: "Cosine",
  },
  quantization_config: {
    binary: {
      encoding: "two_bits",
      always_ram: true,
    },
  },
});
use qdrant_client::qdrant::{
    BinaryQuantizationBuilder,
    CreateCollectionBuilder,
    Distance,
    VectorParamsBuilder,
    BinaryQuantizationEncoding,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(1536, Distance::Cosine))
            .quantization_config(BinaryQuantizationBuilder::new(true)
                .encoding(BinaryQuantizationEncoding::TwoBits)
            ),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.BinaryQuantization;
import io.qdrant.client.grpc.Collections.BinaryQuantizationEncoding;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(1536)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setBinary(BinaryQuantization
                        .newBuilder()
                        .setEncoding(BinaryQuantizationEncoding.TwoBits)
                        .setAlwaysRam(true)
                        .build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.CreateCollectionAsync(
  collectionName: "{collection_name}",
  vectorsConfig: new VectorParams { Size = 1536, Distance = Distance.Cosine },
  quantizationConfig: new QuantizationConfig
  {
    Binary = new BinaryQuantization {
      Encoding = BinaryQuantizationEncoding.TwoBits,
      AlwaysRam = true
    }
  }
);
import (
    "context"

    "github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
    Host: "localhost",
    Port: 6334,
})

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
    CollectionName: "{collection_name}",
    VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
        Size:     1536,
        Distance: qdrant.Distance_Cosine,
    }),
    QuantizationConfig: qdrant.NewQuantizationBinary(
        &qdrant.BinaryQuantization{
            Encoding: qdrant.BinaryQuantizationEncoding_TwoBits.Enum(),
            AlwaysRam: qdrant.PtrOf(true),
        },
    ),
})

Set Up Asymmetric Quantization

To enable asymmetric quantization, you need to specify query_encoding parameter in the quantization_config section of the collection configuration. Available values are:

  • default and binary - use regular binary quantization for the query.
  • scalar8bits - use 8bit quantization for the query.
  • scalar4bits - use 4bit quantization for the query.
httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "quantization_config": {
        "binary": {
            "query_encoding": "scalar8bits",
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(
            query_encoding=models.BinaryQuantizationQueryEncoding.SCALAR8BITS,
            always_ram=True,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.createCollection("{collection_name}", {
  vectors: {
    size: 1536,
    distance: "Cosine",
  },
  quantization_config: {
    binary: {
      query_encoding: "scalar8bits",
      always_ram: true,
    },
  },
});
use qdrant_client::qdrant::{
    BinaryQuantizationBuilder,
    CreateCollectionBuilder,
    Distance,
    VectorParamsBuilder,
    BinaryQuantizationQueryEncoding,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(1536, Distance::Cosine))
            .quantization_config(
                BinaryQuantizationBuilder::new(true)
                    .query_encoding(BinaryQuantizationQueryEncoding::scalar8bits())
            ),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.BinaryQuantization;
import io.qdrant.client.grpc.Collections.BinaryQuantizationQueryEncoding;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(1536)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setBinary(BinaryQuantization.newBuilder()
                        .setQueryEncoding(BinaryQuantizationQueryEncoding
                            .newBuilder()
                            .setSetting(BinaryQuantizationQueryEncoding.Setting.Scalar8Bits)
                            .build())
                        .setAlwaysRam(true)
                        .build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.CreateCollectionAsync(
  collectionName: "{collection_name}",
  vectorsConfig: new VectorParams { Size = 1536, Distance = Distance.Cosine },
  quantizationConfig: new QuantizationConfig
  {
    Binary = new BinaryQuantization {
      QueryEncoding = new BinaryQuantizationQueryEncoding
      {
        Setting = BinaryQuantizationQueryEncoding.Types.Setting.Scalar8Bits,
      },
      AlwaysRam = true
    }
  }
);
import (
    "context"

    "github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
    Host: "localhost",
    Port: 6334,
})

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
    CollectionName: "{collection_name}",
    VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
        Size:     1536,
        Distance: qdrant.Distance_Cosine,
    }),
    QuantizationConfig: qdrant.NewQuantizationBinary(
        &qdrant.BinaryQuantization{
            QueryEncoding: qdrant.NewBinaryQuantizationQueryEncodingSetting(qdrant.BinaryQuantizationQueryEncoding_Scalar8Bits),
            AlwaysRam: qdrant.PtrOf(true),
        },
    ),
})

Setting Up Product Quantization

To enable product quantization, you need to specify the quantization parameters in the quantization_config section of the collection configuration.

When enabling product quantization on an existing collection, use a PATCH request or the corresponding update_collection method and omit the vector configuration, as it’s already defined.

httppythontypescriptrustjavacsharpgo
PUT /collections/{collection_name}
{
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    },
    "quantization_config": {
        "product": {
            "compression": "x16",
            "always_ram": true
        }
    }
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
    quantization_config=models.ProductQuantization(
        product=models.ProductQuantizationConfig(
            compression=models.CompressionRatio.X16,
            always_ram=True,
        ),
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.createCollection("{collection_name}", {
  vectors: {
    size: 768,
    distance: "Cosine",
  },
  quantization_config: {
    product: {
      compression: "x16",
      always_ram: true,
    },
  },
});
use qdrant_client::qdrant::{
    CompressionRatio, CreateCollectionBuilder, Distance, ProductQuantizationBuilder,
    VectorParamsBuilder,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .create_collection(
        CreateCollectionBuilder::new("{collection_name}")
            .vectors_config(VectorParamsBuilder::new(768, Distance::Cosine))
            .quantization_config(
                ProductQuantizationBuilder::new(CompressionRatio::X16.into()).always_ram(true),
            ),
    )
    .await?;
import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Collections.CompressionRatio;
import io.qdrant.client.grpc.Collections.CreateCollection;
import io.qdrant.client.grpc.Collections.Distance;
import io.qdrant.client.grpc.Collections.ProductQuantization;
import io.qdrant.client.grpc.Collections.QuantizationConfig;
import io.qdrant.client.grpc.Collections.VectorParams;
import io.qdrant.client.grpc.Collections.VectorsConfig;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client
    .createCollectionAsync(
        CreateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setVectorsConfig(
                VectorsConfig.newBuilder()
                    .setParams(
                        VectorParams.newBuilder()
                            .setSize(768)
                            .setDistance(Distance.Cosine)
                            .build())
                    .build())
            .setQuantizationConfig(
                QuantizationConfig.newBuilder()
                    .setProduct(
                        ProductQuantization.newBuilder()
                            .setCompression(CompressionRatio.x16)
                            .setAlwaysRam(true)
                            .build())
                    .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.CreateCollectionAsync(
 collectionName: "{collection_name}",
 vectorsConfig: new VectorParams { Size = 768, Distance = Distance.Cosine },
 quantizationConfig: new QuantizationConfig
 {
  Product = new ProductQuantization { Compression = CompressionRatio.X16, AlwaysRam = true }
 }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
	Host: "localhost",
	Port: 6334,
})

client.CreateCollection(context.Background(), &qdrant.CreateCollection{
	CollectionName: "{collection_name}",
	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
		Size:     768,
		Distance: qdrant.Distance_Cosine,
	}),
	QuantizationConfig: qdrant.NewQuantizationProduct(
		&qdrant.ProductQuantization{
			Compression: qdrant.CompressionRatio_x16,
			AlwaysRam:   qdrant.PtrOf(true),
		},
	),
})

There are two parameters that you can specify in the quantization_config section:

compression - compression ratio. Compression ratio represents the size of the quantized vector in bytes divided by the size of the original vector in bytes. In this case, the quantized vector will be 16 times smaller than the original vector.

always_ram - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors. However, in some setups you might want to keep quantized vectors in RAM to speed up the search process. Then set always_ram to true.

Disabling Quantization

To disable quantization in an existing collection, you can do the following:

httpbashpythontypescriptrustjavacsharpgo
PATCH /collections/{collection_name}
{
    "quantization_config": "Disabled"
}
curl -X PATCH http://localhost:6333/collections/{collection_name} \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "quantization_config": "Disabled"
  }'
client.update_collection(
    collection_name="{collection_name}",
    quantization_config=models.Disabled.DISABLED,
)
client.updateCollection("{collection_name}", {
    quantization_config: 'Disabled'
});
use qdrant_client::qdrant::{Disabled, UpdateCollectionBuilder};

client
    .update_collection(UpdateCollectionBuilder::new("{collection_name}").quantization_config(Disabled {}))
    .await?;
import io.qdrant.client.grpc.Collections.Disabled;
import io.qdrant.client.grpc.Collections.QuantizationConfigDiff;
import io.qdrant.client.grpc.Collections.UpdateCollection;

client.updateCollectionAsync(
    UpdateCollection.newBuilder()
        .setCollectionName("{collection_name}")
        .setQuantizationConfig(
            QuantizationConfigDiff.newBuilder()
                .setDisabled(Disabled.getDefaultInstance())
                .build())
        .build());
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.UpdateCollectionAsync(
	collectionName: "{collection_name}",
	quantizationConfig: new QuantizationConfigDiff { Disabled = new Disabled() }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
	Host: "localhost",
	Port: 6334,
})

client.UpdateCollection(context.Background(), &qdrant.UpdateCollection{
	CollectionName:     "{collection_name}",
	QuantizationConfig: qdrant.NewQuantizationDiffDisabled(),
})

Searching with Quantization

Once you have configured quantization for a collection, you don’t need to do anything extra to search with quantization. Qdrant will automatically use quantized vectors if they are available.

However, there are a few options that you can use to control the search process:

httppythontypescriptrustjavacsharpgo
POST /collections/{collection_name}/points/query
{
    "query": [0.2, 0.1, 0.9, 0.7],
    "params": {
        "quantization": {
            "ignore": false,
            "rescore": true,
            "oversampling": 2.0
        }
    },
    "limit": 10
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.query_points(
    collection_name="{collection_name}",
    query=[0.2, 0.1, 0.9, 0.7],
    search_params=models.SearchParams(
        quantization=models.QuantizationSearchParams(
            ignore=False,
            rescore=True,
            oversampling=2.0,
        )
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.query("{collection_name}", {
    query: [0.2, 0.1, 0.9, 0.7],
    params: {
        quantization: {
            ignore: false,
            rescore: true,
            oversampling: 2.0,
        },
    },
    limit: 10,
});
use qdrant_client::qdrant::{
    QuantizationSearchParamsBuilder, QueryPointsBuilder, SearchParamsBuilder,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .query(
        QueryPointsBuilder::new("{collection_name}")
            .query(vec![0.2, 0.1, 0.9, 0.7])
            .limit(10)
            .params(
                SearchParamsBuilder::default().quantization(
                    QuantizationSearchParamsBuilder::default()
                        .ignore(false)
                        .rescore(true)
                        .oversampling(2.0),
                ),
            ),
    )
    .await?;
import static io.qdrant.client.QueryFactory.nearest;

import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Points.QuantizationSearchParams;
import io.qdrant.client.grpc.Points.QueryPoints;
import io.qdrant.client.grpc.Points.SearchParams;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client.queryAsync(
        QueryPoints.newBuilder()
                .setCollectionName("{collection_name}")
                .setQuery(nearest(0.2f, 0.1f, 0.9f, 0.7f))
                .setParams(
                        SearchParams.newBuilder()
                                .setQuantization(
                                        QuantizationSearchParams.newBuilder()
                                                .setIgnore(false)
                                                .setRescore(true)
                                                .setOversampling(2.0)
                                                .build())
                                .build())
                .setLimit(10)
                .build())
        .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.QueryAsync(
	collectionName: "{collection_name}",
	query: new float[] { 0.2f, 0.1f, 0.9f, 0.7f },
	searchParams: new SearchParams
	{
		Quantization = new QuantizationSearchParams
		{
			Ignore = false,
			Rescore = true,
			Oversampling = 2.0
		}
	},
	limit: 10
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
	Host: "localhost",
	Port: 6334,
})

client.Query(context.Background(), &qdrant.QueryPoints{
	CollectionName: "{collection_name}",
	Query:          qdrant.NewQuery(0.2, 0.1, 0.9, 0.7),
	Params: &qdrant.SearchParams{
		Quantization: &qdrant.QuantizationSearchParams{
			Ignore:       qdrant.PtrOf(false),
			Rescore:      qdrant.PtrOf(true),
			Oversampling: qdrant.PtrOf(2.0),
		},
	},
})

ignore - Toggle whether to ignore quantized vectors during the search process. By default, Qdrant will use quantized vectors if they are available.

rescore - Qdrant can re-evaluate top-k search results using the original vectors. While this can improve search quality, it may decrease search speed, especially if the original vectors are stored on disk. In such cases, it is recommended to disable rescoring. By default, rescoring is only enabled for binary quantization. Other quantization methods do not rescore by default.

Available as of v1.3.0

oversampling - Defines how many extra vectors should be pre-selected using quantized index, and then re-scored using original vectors. For example, if oversampling is 2.4 and limit is 100, then 240 vectors will be pre-selected using quantized index, and then top-100 will be returned after re-scoring. Oversampling is useful if you want to tune the tradeoff between search speed and search quality in the query time.

Quantization Tips

Accuracy Tuning

In this section, we will discuss how to tune the search precision. The fastest way to understand the impact of quantization on the search quality is to compare the search results with and without quantization.

In order to disable quantization, you can set ignore to true in the search request:

httppythontypescriptrustjavacsharpgo
POST /collections/{collection_name}/points/query
{
    "query": [0.2, 0.1, 0.9, 0.7],
    "params": {
        "quantization": {
            "ignore": true
        }
    },
    "limit": 10
}
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.query_points(
    collection_name="{collection_name}",
    query=[0.2, 0.1, 0.9, 0.7],
    search_params=models.SearchParams(
        quantization=models.QuantizationSearchParams(
            ignore=True,
        )
    ),
)
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ host: "localhost", port: 6333 });

client.query("{collection_name}", {
    query: [0.2, 0.1, 0.9, 0.7],
    params: {
        quantization: {
            ignore: true,
        },
    },
});
use qdrant_client::qdrant::{
    QuantizationSearchParamsBuilder, QueryPointsBuilder, SearchParamsBuilder,
};
use qdrant_client::Qdrant;

let client = Qdrant::from_url("http://localhost:6334").build()?;

client
    .query(
        QueryPointsBuilder::new("{collection_name}")
            .query(vec![0.2, 0.1, 0.9, 0.7])
            .limit(3)
            .params(
                SearchParamsBuilder::default()
                    .quantization(QuantizationSearchParamsBuilder::default().ignore(true)),
            ),
    )
    .await?;
import static io.qdrant.client.QueryFactory.nearest;

import io.qdrant.client.QdrantClient;
import io.qdrant.client.QdrantGrpcClient;
import io.qdrant.client.grpc.Points.QuantizationSearchParams;
import io.qdrant.client.grpc.Points.QueryPoints;
import io.qdrant.client.grpc.Points.SearchParams;

QdrantClient client =
    new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());

client.queryAsync(
        QueryPoints.newBuilder()
                .setCollectionName("{collection_name}")
                .setQuery(nearest(0.2f, 0.1f, 0.9f, 0.7f))
                .setParams(
                        SearchParams.newBuilder()
                                .setQuantization(
                                        QuantizationSearchParams.newBuilder().setIgnore(true).build())
                                .build())
                .setLimit(10)
                .build())
        .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

var client = new QdrantClient("localhost", 6334);

await client.QueryAsync(
	collectionName: "{collection_name}",
	query: new float[] { 0.2f, 0.1f, 0.9f, 0.7f },
	searchParams: new SearchParams
	{
		Quantization = new QuantizationSearchParams { Ignore = true }
	},
	limit: 10
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client, err := qdrant.NewClient(&qdrant.Config{
	Host: "localhost",
	Port: 6334,
})

client.Query(context.Background(), &qdrant.QueryPoints{
	CollectionName: "{collection_name}",
	Query:          qdrant.NewQuery(0.2, 0.1, 0.9, 0.7),
	Params: &qdrant.SearchParams{
		Quantization: &qdrant.QuantizationSearchParams{
			Ignore: qdrant.PtrOf(false),
		},
	},
})
  • Adjust the quantile parameter: The quantile parameter in scalar quantization determines the quantization bounds. By setting it to a value lower than 1.0, you can exclude extreme values (outliers) from the quantization bounds. For example, if you set the quantile to 0.99, 1% of the extreme values will be excluded. By adjusting the quantile, you find an optimal value that will provide the best search quality for your collection.

  • Enable rescoring: Qdrant can re-evaluate top-k search results using the original vectors. While this can improve search quality, it may decrease search speed, especially if the original vectors are stored on disk. In such cases, it is recommended to disable rescoring. By default, rescoring is only enabled for binary quantization. Other quantization methods do not rescore by default.

Memory and Speed Tuning

In this section, we will discuss how to tune the memory and speed of the search process with quantization.

There are 3 possible modes to place storage of vectors within the qdrant collection:

  • All in RAM - all vector, original and quantized, are loaded and kept in RAM. This is the fastest mode, but requires a lot of RAM. Enabled by default.

  • Original on Disk, quantized in RAM - this is a hybrid mode, allows to obtain a good balance between speed and memory usage. Recommended scenario if you are aiming to shrink the memory footprint while keeping the search speed.

    This mode is enabled by setting always_ram to true in the quantization config while using memmap storage:

    httppythontypescriptrustjavacsharpgo
    PUT /collections/{collection_name}
    {
        "vectors": {
            "size": 768,
            "distance": "Cosine",
            "on_disk": true
        },
        "quantization_config": {
            "scalar": {
                "type": "int8",
                "always_ram": true
            }
        }
    }
    
    from qdrant_client import QdrantClient, models
    
    client = QdrantClient(url="http://localhost:6333")
    
    client.create_collection(
        collection_name="{collection_name}",
        vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE, on_disk=True),
        quantization_config=models.ScalarQuantization(
            scalar=models.ScalarQuantizationConfig(
                type=models.ScalarType.INT8,
                always_ram=True,
            ),
        ),
    )
    
    import { QdrantClient } from "@qdrant/js-client-rest";
    
    const client = new QdrantClient({ host: "localhost", port: 6333 });
    
    client.createCollection("{collection_name}", {
      vectors: {
        size: 768,
        distance: "Cosine",
        on_disk: true,
      },
      quantization_config: {
        scalar: {
          type: "int8",
          always_ram: true,
        },
      },
    });
    
    use qdrant_client::qdrant::{
        CreateCollectionBuilder, Distance, QuantizationType, ScalarQuantizationBuilder,
        VectorParamsBuilder,
    };
    use qdrant_client::Qdrant;
    
    let client = Qdrant::from_url("http://localhost:6334").build()?;
    
    client
        .create_collection(
            CreateCollectionBuilder::new("{collection_name}")
                .vectors_config(VectorParamsBuilder::new(768, Distance::Cosine))
                .quantization_config(
                    ScalarQuantizationBuilder::default()
                        .r#type(QuantizationType::Int8.into())
                        .always_ram(true),
                ),
        )
        .await?;
    
    import io.qdrant.client.QdrantClient;
    import io.qdrant.client.QdrantGrpcClient;
    import io.qdrant.client.grpc.Collections.CreateCollection;
    import io.qdrant.client.grpc.Collections.Distance;
    import io.qdrant.client.grpc.Collections.OptimizersConfigDiff;
    import io.qdrant.client.grpc.Collections.QuantizationConfig;
    import io.qdrant.client.grpc.Collections.QuantizationType;
    import io.qdrant.client.grpc.Collections.ScalarQuantization;
    import io.qdrant.client.grpc.Collections.VectorParams;
    import io.qdrant.client.grpc.Collections.VectorsConfig;
    
    QdrantClient client =
        new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());
    
    client
        .createCollectionAsync(
            CreateCollection.newBuilder()
                .setCollectionName("{collection_name}")
                .setVectorsConfig(
                    VectorsConfig.newBuilder()
                        .setParams(
                            VectorParams.newBuilder()
                                .setSize(768)
                                .setDistance(Distance.Cosine)
                                .setOnDisk(true)
                                .build())
                        .build())
                .setQuantizationConfig(
                    QuantizationConfig.newBuilder()
                        .setScalar(
                            ScalarQuantization.newBuilder()
                                .setType(QuantizationType.Int8)
                                .setAlwaysRam(true)
                                .build())
                        .build())
                .build())
        .get();
    
    using Qdrant.Client;
    using Qdrant.Client.Grpc;
    
    var client = new QdrantClient("localhost", 6334);
    
    await client.CreateCollectionAsync(
    	collectionName: "{collection_name}",
    	vectorsConfig: new VectorParams { Size = 768, Distance = Distance.Cosine, OnDisk = true },
    	quantizationConfig: new QuantizationConfig
    	{
    		Scalar = new ScalarQuantization { Type = QuantizationType.Int8, AlwaysRam = true }
    	}
    );
    
    import (
    	"context"
    
    	"github.com/qdrant/go-client/qdrant"
    )
    
    client, err := qdrant.NewClient(&qdrant.Config{
    	Host: "localhost",
    	Port: 6334,
    })
    
    client.CreateCollection(context.Background(), &qdrant.CreateCollection{
    	CollectionName: "{collection_name}",
    	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
    		Size:     768,
    		Distance: qdrant.Distance_Cosine,
    		OnDisk:   qdrant.PtrOf(true),
    	}),
    	QuantizationConfig: qdrant.NewQuantizationScalar(&qdrant.ScalarQuantization{
    		Type:      qdrant.QuantizationType_Int8,
    		AlwaysRam: qdrant.PtrOf(true),
    	}),
    })
    

    In this scenario, the number of disk reads may play a significant role in the search speed. In a system with high disk latency, the re-scoring step may become a bottleneck.

    Consider disabling rescore to improve the search speed:

    httppythontypescriptrustjavacsharpgo
    POST /collections/{collection_name}/points/query
    {
        "query": [0.2, 0.1, 0.9, 0.7],
        "params": {
            "quantization": {
                "rescore": false
            }
        },
        "limit": 10
    }
    
    from qdrant_client import QdrantClient, models
    
    client = QdrantClient(url="http://localhost:6333")
    
    client.query_points(
        collection_name="{collection_name}",
        query=[0.2, 0.1, 0.9, 0.7],
        search_params=models.SearchParams(
            quantization=models.QuantizationSearchParams(rescore=False)
        ),
    )
    
    import { QdrantClient } from "@qdrant/js-client-rest";
    
    const client = new QdrantClient({ host: "localhost", port: 6333 });
    
    client.query("{collection_name}", {
        query: [0.2, 0.1, 0.9, 0.7],
        params: {
            quantization: {
                rescore: false,
            },
        },
    });
    
    use qdrant_client::qdrant::{
        QuantizationSearchParamsBuilder, QueryPointsBuilder, SearchParamsBuilder,
    };
    use qdrant_client::Qdrant;
    
    let client = Qdrant::from_url("http://localhost:6334").build()?;
    
    client
        .query(
            QueryPointsBuilder::new("{collection_name}")
                .query(vec![0.2, 0.1, 0.9, 0.7])
                .limit(3)
                .params(
                    SearchParamsBuilder::default()
                        .quantization(QuantizationSearchParamsBuilder::default().rescore(false)),
                ),
        )
        .await?;
    
    import static io.qdrant.client.QueryFactory.nearest;
    
    import io.qdrant.client.QdrantClient;
    import io.qdrant.client.QdrantGrpcClient;
    import io.qdrant.client.grpc.Points.QuantizationSearchParams;
    import io.qdrant.client.grpc.Points.QueryPoints;
    import io.qdrant.client.grpc.Points.SearchParams;
    
    QdrantClient client =
        new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());
    
    client.queryAsync(
            QueryPoints.newBuilder()
                    .setCollectionName("{collection_name}")
                    .setQuery(nearest(0.2f, 0.1f, 0.9f, 0.7f))
                    .setParams(
                            SearchParams.newBuilder()
                                    .setQuantization(
                                            QuantizationSearchParams.newBuilder().setRescore(false).build())
                                    .build())
                    .setLimit(3)
                    .build())
            .get();
    
    using Qdrant.Client;
    using Qdrant.Client.Grpc;
    
    var client = new QdrantClient("localhost", 6334);
    
    await client.QueryAsync(
    	collectionName: "{collection_name}",
    	query: new float[] { 0.2f, 0.1f, 0.9f, 0.7f },
    	searchParams: new SearchParams
    	{
    		Quantization = new QuantizationSearchParams { Rescore = false }
    	},
    	limit: 3
    );
    
    import (
    	"context"
    
    	"github.com/qdrant/go-client/qdrant"
    )
    
    client, err := qdrant.NewClient(&qdrant.Config{
    	Host: "localhost",
    	Port: 6334,
    })
    
    client.Query(context.Background(), &qdrant.QueryPoints{
    	CollectionName: "{collection_name}",
    	Query:          qdrant.NewQuery(0.2, 0.1, 0.9, 0.7),
    	Params: &qdrant.SearchParams{
    		Quantization: &qdrant.QuantizationSearchParams{
    			Rescore: qdrant.PtrOf(false),
    		},
    	},
    })
    

  • All on Disk - all vectors, original and quantized, are stored on disk. This mode allows to achieve the smallest memory footprint, but at the cost of the search speed.

    It is recommended to use this mode if you have a large collection and fast storage (e.g. SSD or NVMe).

    This mode is enabled by setting always_ram to false in the quantization config while using mmap storage:

    httppythontypescriptrustjavacsharpgo
    PUT /collections/{collection_name}
    {
        "vectors": {
          "size": 768,
          "distance": "Cosine",
          "on_disk": true
        },
        "quantization_config": {
            "scalar": {
                "type": "int8",
                "always_ram": false
            }
        }
    }
    
    from qdrant_client import QdrantClient, models
    
    client = QdrantClient(url="http://localhost:6333")
    
    client.create_collection(
        collection_name="{collection_name}",
        vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE, on_disk=True),
        quantization_config=models.ScalarQuantization(
            scalar=models.ScalarQuantizationConfig(
                type=models.ScalarType.INT8,
                always_ram=False,
            ),
        ),
    )
    
    import { QdrantClient } from "@qdrant/js-client-rest";
    
    const client = new QdrantClient({ host: "localhost", port: 6333 });
    
    client.createCollection("{collection_name}", {
      vectors: {
        size: 768,
        distance: "Cosine",
        on_disk: true,
      },
      quantization_config: {
        scalar: {
          type: "int8",
          always_ram: false,
        },
      },
    });
    
    use qdrant_client::qdrant::{
        CreateCollectionBuilder, Distance, QuantizationType, ScalarQuantizationBuilder,
        VectorParamsBuilder,
    };
    use qdrant_client::Qdrant;
    
    let client = Qdrant::from_url("http://localhost:6334").build()?;
    
    client
        .create_collection(
            CreateCollectionBuilder::new("{collection_name}")
                .vectors_config(VectorParamsBuilder::new(768, Distance::Cosine).on_disk(true))
                .quantization_config(
                    ScalarQuantizationBuilder::default()
                        .r#type(QuantizationType::Int8.into())
                        .always_ram(false),
                ),
        )
        .await?;
    
    import io.qdrant.client.QdrantClient;
    import io.qdrant.client.QdrantGrpcClient;
    import io.qdrant.client.grpc.Collections.CreateCollection;
    import io.qdrant.client.grpc.Collections.Distance;
    import io.qdrant.client.grpc.Collections.OptimizersConfigDiff;
    import io.qdrant.client.grpc.Collections.QuantizationConfig;
    import io.qdrant.client.grpc.Collections.QuantizationType;
    import io.qdrant.client.grpc.Collections.ScalarQuantization;
    import io.qdrant.client.grpc.Collections.VectorParams;
    import io.qdrant.client.grpc.Collections.VectorsConfig;
    
    QdrantClient client =
        new QdrantClient(QdrantGrpcClient.newBuilder("localhost", 6334, false).build());
    
    client
        .createCollectionAsync(
            CreateCollection.newBuilder()
                .setCollectionName("{collection_name}")
                .setVectorsConfig(
                    VectorsConfig.newBuilder()
                        .setParams(
                            VectorParams.newBuilder()
                                .setSize(768)
                                .setDistance(Distance.Cosine)
                                .setOnDisk(true)
                                .build())
                        .build())
                .setQuantizationConfig(
                    QuantizationConfig.newBuilder()
                        .setScalar(
                            ScalarQuantization.newBuilder()
                                .setType(QuantizationType.Int8)
                                .setAlwaysRam(false)
                                .build())
                        .build())
                .build())
        .get();
    
    using Qdrant.Client;
    using Qdrant.Client.Grpc;
    
    var client = new QdrantClient("localhost", 6334);
    
    await client.CreateCollectionAsync(
     collectionName: "{collection_name}",
     vectorsConfig: new VectorParams { Size = 768, Distance = Distance.Cosine, OnDisk = true},
     quantizationConfig: new QuantizationConfig
     {
      Scalar = new ScalarQuantization { Type = QuantizationType.Int8, AlwaysRam = false }
     }
    );
    
    import (
    	"context"
    
    	"github.com/qdrant/go-client/qdrant"
    )
    
    client, err := qdrant.NewClient(&qdrant.Config{
    	Host: "localhost",
    	Port: 6334,
    })
    
    client.CreateCollection(context.Background(), &qdrant.CreateCollection{
    	CollectionName: "{collection_name}",
    	VectorsConfig: qdrant.NewVectorsConfig(&qdrant.VectorParams{
    		Size:     768,
    		Distance: qdrant.Distance_Cosine,
    		OnDisk:   qdrant.PtrOf(true),
    	}),
    	QuantizationConfig: qdrant.NewQuantizationScalar(
    		&qdrant.ScalarQuantization{
    			Type:      qdrant.QuantizationType_Int8,
    			AlwaysRam: qdrant.PtrOf(false),
    		},
    	),
    })
    

Was this page useful?

Thank you for your feedback! 🙏

We are sorry to hear that. 😔 You can edit this page on GitHub, or create a GitHub issue.