Troubleshoot Read-Write Contention

Qdrant is designed to index and optimize data as it arrives. While serving search queries, Qdrant’s background optimizer continuously builds HNSW indexes, merges segments, and applies quantization. Queries and the background optimizer compete for the same CPU time, memory bandwidth, and I/O (read-write contention). Qdrant’s defaults don’t prioritize either, but you can make several configuration changes to shift the balance.

This guide walks through a set of configuration changes to improve read latency under heavy write load. The steps are ordered by impact: start with step 1 and stop when your latency target is met. After each step, measure read latency and write throughput. If a change doesn’t improve latency enough or causes unacceptable throughput loss, revert it and move to the next step.

Step 1: Prevent Reads from Large Unindexed Segments

Requires Qdrant v1.17.1 or later. prevent_unoptimized is an experimental feature.

Under heavy write load, the optimizer can fall behind. When that happens, searches need to perform a full scan over large unindexed segments, which increases query latency. The prevent_unoptimized setting addresses this. Set it to true to prevent searching over unindexed data.

Enable this setting per collection or globally in the configuration file.

httpbashpythontypescriptrustjavacsharpgo
PATCH /collections/{collection_name}
{
    "optimizers_config": {
        "prevent_unoptimized": true
    }
}
curl -X PATCH http://localhost:6333/collections/{collection_name} \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "optimizers_config": {
        "prevent_unoptimized": true
    }
  }'
client.update_collection(
    collection_name="{collection_name}",
    optimizers_config=models.OptimizersConfigDiff(prevent_unoptimized=True),
)
client.updateCollection("{collection_name}", {
  optimizers_config: {
    prevent_unoptimized: true,
  },
});
use qdrant_client::qdrant::{OptimizersConfigDiffBuilder, UpdateCollectionBuilder};

client
    .update_collection(
        UpdateCollectionBuilder::new("{collection_name}").optimizers_config(
            OptimizersConfigDiffBuilder::default().prevent_unoptimized(true),
        ),
    )
    .await?;
import io.qdrant.client.grpc.Collections.OptimizersConfigDiff;
import io.qdrant.client.grpc.Collections.UpdateCollection;

client
    .updateCollectionAsync(
        UpdateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setOptimizersConfig(
                OptimizersConfigDiff.newBuilder().setPreventUnoptimized(true).build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.UpdateCollectionAsync(
	collectionName: "{collection_name}",
	optimizersConfig: new OptimizersConfigDiff { PreventUnoptimized = true }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.UpdateCollection(context.Background(), &qdrant.UpdateCollection{
	CollectionName: "{collection_name}",
	OptimizersConfig: &qdrant.OptimizersConfigDiff{
		PreventUnoptimized: qdrant.PtrOf(true),
	},
})

Set wait=false on All Write Requests

Several Qdrant clients (Python, TypeScript, JavaScript, .NET, and Java) default to setting the wait parameter to true on write requests. This makes writes synchronous: the write request returns only after the update has been applied and is visible in search results.

Enabling prevent_unoptimized changes the semantics of wait=true: it waits until every deferred point, including the current update and all preceding deferred points, has been indexed. Under continuous writes, this can take a long time and will likely cause client-side timeouts.

Worse, a blocked wait=true request delays all subsequent wait=true updates behind it: head-of-line blocking that can stall the entire update pipeline.

Set wait=false on all write requests when prevent_unoptimized is enabled.

For clients that default to wait=true, you’ll need to override this explicitly. The Go and Rust clients and the REST API already default to wait=false.

Step 2: Try Smaller Batch Sizes

When ingesting data into Qdrant, batching upserts is a best practice that significantly improves throughput by reducing the per-request overhead of individual insertions. The optimal batch size depends on your vector dimensions, payload size, and hardware.

Reducing batch size shortens each individual write operation, which tightens the window of contention with reads. The tradeoff is more round trips to Qdrant, which adds a small amount of network overhead.

A smaller batch size is especially worth trying when your payloads are large. A batch of 100 points with 10 KB payloads each is a much heavier write than the same number of points with minimal payload: it takes longer to process and consumes more memory during the transaction.

Start by halving your current batch size and measure the effect on read latency and write throughput. If latency improves without unacceptable throughput loss, halve again.

Step 3: Lower the Optimizer’s CPU Budget

By default, Qdrant auto-selects the number of CPU threads the optimizer can use, typically leaving one or two cores free and giving the rest to the optimizer. Under write pressure, this can leave little headroom for queries. Use the optimizer_cpu_budget setting to lower the optimizer’s CPU budget.

optimizer_cpu_budget is a node-level setting in the Qdrant configuration file, under performance:

performance:
  # If positive, use exactly this many CPU threads for optimization.
  # If negative, subtract from available CPUs.
  # If 0, auto-select (default).
  optimizer_cpu_budget: 2

On Qdrant Cloud, it’s available under Advanced Optimizations in the cluster Configuration tab.

A good starting point is 50% of your available vCPUs. On an eight-vCPU node, that’s four threads for the optimizer and four for queries. On a 16-vCPU node, set it to eight.

Tune from there. If indexing falls too far behind (watch the deferred point count), increase the budget. If query latency is still too high, lower it further.

Because this is a node-level setting, it requires a rolling restart to take effect: after applying the configuration change, restart one node at a time.

Step 4: Tune the Number of Optimizer Threads

Two settings give you finer control over optimizer CPU usage than optimizer_cpu_budget alone. They work at different levels:

  • max_optimization_threads: The maximum number of optimization jobs that can run in parallel per shard. Defaults to null (no limit; saturates available CPU).
  • max_indexing_threads: The number of threads each optimization job uses to build the HNSW index. Defaults to 0 (auto-select).

The total CPU consumed by indexing on a shard is roughly max_optimization_threads × max_indexing_threads. Reducing either value leaves more headroom for queries.

Set max_optimization_threads to 1 per shard. This serializes optimization work rather than running multiple jobs in parallel, which smooths out CPU spikes and makes optimizer resource usage more predictable.

Configure it per collection without needing a restart:

httpbashpythontypescriptrustjavacsharpgo
PATCH /collections/{collection_name}
{
    "optimizers_config": {
        "max_optimization_threads": 1
    }
}
curl -X PATCH http://localhost:6333/collections/{collection_name} \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "optimizers_config": {
        "max_optimization_threads": 1
    }
  }'
client.update_collection(
    collection_name="{collection_name}",
    optimizers_config=models.OptimizersConfigDiff(max_optimization_threads=1),
)
client.updateCollection("{collection_name}", {
  optimizers_config: {
    max_optimization_threads: 1,
  },
});
use qdrant_client::qdrant::{OptimizersConfigDiffBuilder, UpdateCollectionBuilder};

client
    .update_collection(
        UpdateCollectionBuilder::new("{collection_name}").optimizers_config(
            OptimizersConfigDiffBuilder::default().max_optimization_threads(1),
        ),
    )
    .await?;
import io.qdrant.client.grpc.Collections.MaxOptimizationThreads;
import io.qdrant.client.grpc.Collections.OptimizersConfigDiff;
import io.qdrant.client.grpc.Collections.UpdateCollection;

client
    .updateCollectionAsync(
        UpdateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setOptimizersConfig(
                OptimizersConfigDiff.newBuilder()
                .setMaxOptimizationThreads(
                    MaxOptimizationThreads.newBuilder().setValue(1).build())
                .build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.UpdateCollectionAsync(
	collectionName: "{collection_name}",
	optimizersConfig: new OptimizersConfigDiff { MaxOptimizationThreads = new MaxOptimizationThreads { Value = 1 } }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.UpdateCollection(context.Background(), &qdrant.UpdateCollection{
	CollectionName: "{collection_name}",
	OptimizersConfig: &qdrant.OptimizersConfigDiff{
		MaxOptimizationThreads: qdrant.NewMaxOptimizationThreads(1),
	},
})

Keep max_indexing_threads below 16. If your node has fewer than eight cores, leave it at 0 and let Qdrant auto-select.

httpbashpythontypescriptrustjavacsharpgo
PATCH /collections/{collection_name}
{
    "hnsw_config": {
        "max_indexing_threads": 4
    }
}
curl -X PATCH http://localhost:6333/collections/{collection_name} \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "hnsw_config": {
        "max_indexing_threads": 4
    }
  }'
client.update_collection(
    collection_name="{collection_name}",
    hnsw_config=models.HnswConfigDiff(max_indexing_threads=4),
)
client.updateCollection("{collection_name}", {
  hnsw_config: {
    max_indexing_threads: 4,
  },
});
use qdrant_client::qdrant::{HnswConfigDiffBuilder, UpdateCollectionBuilder};

client
    .update_collection(
        UpdateCollectionBuilder::new("{collection_name}")
            .hnsw_config(HnswConfigDiffBuilder::default().max_indexing_threads(4)),
    )
    .await?;
import io.qdrant.client.grpc.Collections.HnswConfigDiff;
import io.qdrant.client.grpc.Collections.UpdateCollection;

client
    .updateCollectionAsync(
        UpdateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setHnswConfig(
                HnswConfigDiff.newBuilder().setMaxIndexingThreads(4).build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.UpdateCollectionAsync(
	collectionName: "{collection_name}",
	hnswConfig: new HnswConfigDiff { MaxIndexingThreads = 4 }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.UpdateCollection(context.Background(), &qdrant.UpdateCollection{
	CollectionName: "{collection_name}",
	HnswConfig: &qdrant.HnswConfigDiff{
		MaxIndexingThreads: qdrant.PtrOf(uint64(4)),
	},
})

Relationship to optimizer_cpu_budget

optimizer_cpu_budget (step 3) sets the node-wide CPU ceiling for all optimizer work. max_optimization_threads and max_indexing_threads control how that budget is distributed across jobs and shards. Use Step 3 to cap the total, and Step 4 to shape how work is scheduled within that cap.

Step 5: Use Delayed Fan-Outs

Requires replicas. Available as of v1.17.0.

This step applies only if your collection has a replication factor greater than one. If you’re running without replicas, skip to step 6.

Under write pressure, some replicas may respond slower than others. They may be catching up on indexing, flushing to disk, or be under heavier load. When a single slow replica raises your 95th or 99th percentile latency, that’s tail latency: one outlier that degrades the experience for a small but visible fraction of requests.

The read_fan_out_delay_ms setting addresses this. When set, if a replica hasn’t responded within the specified threshold, Qdrant sends the same read request to a second replica and uses whichever response comes back first. The slow replica still processes its request, but its response is discarded if the other replica responds faster.

Enable it per collection:

httpbashpythontypescriptrustjavacsharpgo
PATCH /collections/{collection_name}
{
    "params": {
        "read_fan_out_delay_ms": 100
    }
}
curl -X PATCH http://localhost:6333/collections/{collection_name} \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "params": {
        "read_fan_out_delay_ms": 100
    }
  }'
client.update_collection(
    collection_name="{collection_name}",
    collection_params=models.CollectionParamsDiff(read_fan_out_delay_ms=100),
)
client.updateCollection("{collection_name}", {
  params: {
    read_fan_out_delay_ms: 100,
  },
});
use qdrant_client::qdrant::{CollectionParamsDiffBuilder, UpdateCollectionBuilder};

client
    .update_collection(
        UpdateCollectionBuilder::new("{collection_name}")
            .params(CollectionParamsDiffBuilder::default().read_fan_out_delay_ms(100u64)),
    )
    .await?;
import io.qdrant.client.grpc.Collections.CollectionParamsDiff;
import io.qdrant.client.grpc.Collections.UpdateCollection;

client
    .updateCollectionAsync(
        UpdateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setParams(
                CollectionParamsDiff.newBuilder().setReadFanOutDelayMs(100).build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.UpdateCollectionAsync(
	collectionName: "{collection_name}",
	collectionParams: new CollectionParamsDiff { ReadFanOutDelayMs = 100 }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.UpdateCollection(context.Background(), &qdrant.UpdateCollection{
	CollectionName: "{collection_name}",
	Params: &qdrant.CollectionParamsDiff{
		ReadFanOutDelayMs: qdrant.PtrOf(uint64(100)),
	},
})

Replace 100 with your collection’s measured p95 read latency.

Choosing a Threshold

Set the threshold to the 95th percentile of your current read latency. This means roughly 5% of requests trigger a fan-out, adding only a small amount of extra load while mitigating the worst of the tail. Setting it too low (for example, at the median) causes almost every request to fan out, doubling read load without meaningful benefit.

To disable, set read_fan_out_delay_ms back to 0.

Step 6: Try Async I/O

Linux only. Requires kernel support.

If your vectors or HNSW index are on disk, enabling the async scorer can reduce the time Qdrant spends waiting on disk reads. By default, Qdrant reads vectors synchronously: it sends a request, waits for the response, then sends the next. With Async I/O enabled, it batches disk requests and waits for them in parallel, saturating disk bandwidth instead of leaving it idle between reads.

This is most relevant during rescoring: when Qdrant retrieves candidate vectors from disk to rerank results, async I/O allows it to fetch all of them concurrently rather than one by one.

Enable it in the config file by setting async_scorer to true:

storage:
  performance:
    async_scorer: true

Or via an environment variable:

QDRANT__STORAGE__PERFORMANCE__ASYNC_SCORER=true

On Qdrant Cloud, it’s available under Advanced Optimizations in the cluster Configuration tab.

If your data is entirely in memory, this setting has no effect. And because it relies on io_uring, it only works on Linux with a kernel that supports it. Verify kernel support before enabling in production.

Step 7: Lower the Maximum Segment Size

Qdrant builds an HNSW index per segment. By default, max_segment_size is auto-selected based on available CPUs, which tends to favor large segments: better for query throughput but slower to index. Under heavy write load, a large segment takes longer to rebuild, meaning the optimizer holds onto CPU and I/O for longer stretches, increasing contention with queries.

The max_segment_size setting caps how large a segment can be after the optimizer merges and indexes it. Smaller segments rebuild faster, returning CPU to queries sooner. The tradeoff is that more segments mean more work per query: each search fans out across all segments, so query throughput decreases.

Configure it per collection, specified in KB:

httpbashpythontypescriptrustjavacsharpgo
PATCH /collections/{collection_name}
{
    "optimizers_config": {
        "max_segment_size": 100000
    }
}
curl -X PATCH http://localhost:6333/collections/{collection_name} \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "optimizers_config": {
        "max_segment_size": 100000
    }
  }'
client.update_collection(
    collection_name="{collection_name}",
    optimizers_config=models.OptimizersConfigDiff(max_segment_size=100000),
)
client.updateCollection("{collection_name}", {
  optimizers_config: {
    max_segment_size: 100000,
  },
});
use qdrant_client::qdrant::{OptimizersConfigDiffBuilder, UpdateCollectionBuilder};

client
    .update_collection(
        UpdateCollectionBuilder::new("{collection_name}").optimizers_config(
            OptimizersConfigDiffBuilder::default().max_segment_size(100000u64),
        ),
    )
    .await?;
import io.qdrant.client.grpc.Collections.OptimizersConfigDiff;
import io.qdrant.client.grpc.Collections.UpdateCollection;

client
    .updateCollectionAsync(
        UpdateCollection.newBuilder()
            .setCollectionName("{collection_name}")
            .setOptimizersConfig(
                OptimizersConfigDiff.newBuilder().setMaxSegmentSize(100000).build())
            .build())
    .get();
using Qdrant.Client;
using Qdrant.Client.Grpc;

await client.UpdateCollectionAsync(
	collectionName: "{collection_name}",
	optimizersConfig: new OptimizersConfigDiff { MaxSegmentSize = 100000 }
);
import (
	"context"

	"github.com/qdrant/go-client/qdrant"
)

client.UpdateCollection(context.Background(), &qdrant.UpdateCollection{
	CollectionName: "{collection_name}",
	OptimizersConfig: &qdrant.OptimizersConfigDiff{
		MaxSegmentSize: qdrant.PtrOf(uint64(100000)),
	},
})

As a reference point: 1 KB ≈ one vector of 256 dimensions. A value of 100000 (roughly 100 MB) is a reasonable starting point for collections with high write rates. Tune downward if indexing jobs are still too long, and upward if query throughput degrades noticeably.

Step 8: Scale Horizontally with Replicas

The previous steps all operate within a fixed hardware budget: they reallocate CPU between the optimizer and queries. If you’ve exhausted those options and latency is still too high, the answer is more capacity.

Adding nodes to your cluster and increasing the replication factor distributes read traffic across more peers. Because every replica of a shard contains the same data, Qdrant can route read requests to any of them. More replicas mean more vCPUs available to serve queries, and the optimizer on each node only contends with the query load that node carries instead of the full cluster load.

For example, a collection with three shards and a replication factor of two has six replicas total. On a three-node cluster, each node handles two replicas. A read request hits one replica per shard, so query load is spread evenly across all three nodes.

The tradeoff is cost and write latency. Every write must be replicated to all replicas of a shard before it’s acknowledged, so a higher replication factor increases write latency. Adding nodes also means more hardware to provision and operate. Find the balance that fits your latency budget and write throughput requirements.

Step 9: Scale Vertically

Like Step 8, this step adds capacity rather than reallocating it. Where horizontal scaling distributes load across more nodes, vertical scaling gives each node more resources to work with.

CPU. More vCPUs mean the optimizer and queries share a larger pool. The ratios from Steps 3 and 4 still apply, but with 32 vCPUs instead of eight, even a 25% optimizer budget leaves more absolute cores for queries. Vertical CPU scaling amplifies the effect of the tuning you’ve already done.

RAM. This is often the highest-leverage upgrade. When vectors and the HNSW index fit entirely in memory, disk I/O drops out of the picture: the optimizer and queries no longer compete for it. If you’re currently using memmap storage because your dataset outgrew RAM, adding memory may let you move to in-memory storage and eliminate a whole class of contention.

Input/Output Operations per Second (IOPS). If vectors or the HNSW index are stored on disk, disk throughput is a shared resource between the optimizer and queries. The optimizer continuously reads and writes segment data: merging segments, flushing the write-ahead log, and rebuilding indexes. Higher IOPS lets it complete that work faster, shortening the window of I/O contention.

Read More

Was this page useful?

Thank you for your feedback! 🙏

We are sorry to hear that. 😔 You can edit this page on GitHub, or create a GitHub issue.