BM25 with Qdrant Edge
BM25 (Best Matching 25) is a popular sparse-vector ranking algorithm for full-text search. Qdrant Edge includes a built-in BM25 embedder, so you can run keyword search without an internet connection or external embedding service.
The BM25 embedder is compatible with server-side BM25: vectors produced by the Qdrant Edge embedder use the same token IDs and scoring formula as Qdrant Server’s text search pipeline. You can initialize an Edge Shard from a server snapshot and query it with locally produced BM25 vectors without re-indexing.
In Python, use the Bm25 and Bm25Config classes. In Rust, use EdgeBm25 and EdgeBm25Config from the qdrant_edge::bm25_embed module.
Configure a Sparse Vector
To get started with BM25, create an Edge Shard with a sparse vector field and Modifier.Idf. The IDF modifier enables inverse document frequency weighting, which is required for BM25 scoring:
from qdrant_edge import (
EdgeConfig,
EdgeShard,
EdgeSparseVectorParams,
Modifier,
)
config = EdgeConfig(
sparse_vectors={"text": EdgeSparseVectorParams(modifier=Modifier.Idf)},
)
shard = EdgeShard.create(SHARD_DIRECTORY, config)
let config = EdgeConfigBuilder::new()
.sparse_vector("text", EdgeSparseVectorParamsBuilder::new()
.modifier(Modifier::Idf)
.build())
.build();
let shard = EdgeShard::new(Path::new(SHARD_DIRECTORY), config)?;
Create a BM25 Embedder
Instantiate a BM25 embedder with a language setting. The embedder applies stemming and stopword filtering for the specified language:
from qdrant_edge import Bm25, Bm25Config
bm25 = Bm25(Bm25Config(language="english"))
let bm25 = EdgeBm25::new(EdgeBm25Config {
language: Some("english".to_string()),
..Default::default()
})?;
Bm25Config accepts the following parameters:
| Parameter | Description |
|---|---|
language | Language for stemming and stopwords (for example, "english", "german"). Defaults to None, which falls back to English stemming and stopwords. |
k | Term frequency saturation parameter. Default: 1.2. |
b | Document length normalization factor. Default: 0.75. |
avg_len | Expected average document length in tokens. Default: 256. |
lowercase | Convert tokens to lowercase before embedding. Default: true. |
ascii_folding | Normalize accented characters to ASCII equivalents. Default: false. |
stemmer | Override the stemming algorithm. |
stopwords | Override the stopword list. |
tokenizer | Tokenizer used to break down text into individual tokens (words). Can be "prefix", "whitespace", "word", or "multilingual". Default: "word". |
min_token_len | Minimum token length to include. |
max_token_len | Maximum token length to include. |
For a full description of each parameter, see Configuring BM25 Parameters.
Embed and Upsert Documents
Use embed_document to generate a sparse vector for each document, then upsert the points. Call optimize after bulk inserts to build the sparse index:
from qdrant_edge import Point, UpdateOperation
shard.update(UpdateOperation.upsert_points([
Point(1, {"text": bm25.embed_document("the quick brown fox")}, {"title": "Article 1"}),
Point(2, {"text": bm25.embed_document("a lazy dog sleeps")}, {"title": "Article 2"}),
Point(3, {"text": bm25.embed_document("foxes are clever")}, {"title": "Article 3"}),
]))
shard.optimize()
let docs = [
(1u64, "the quick brown fox", "Article 1"),
(2, "a lazy dog sleeps", "Article 2"),
(3, "foxes are clever", "Article 3"),
];
let points = docs.iter().map(|(id, text, title)| {
PointStruct::new(
*id,
Vectors::new_named([("text", bm25.embed_document(text))]),
json!({ "title": title }),
).into()
}).collect();
shard.update(UpdateOperation::PointOperation(
PointOperations::UpsertPoints(PointInsertOperations::PointsList(points)),
))?;
shard.optimize()?;
Query
Use embed_query to generate a sparse vector for the query text, then query the shard:
from qdrant_edge import Query, QueryRequest
query_vector = bm25.embed_query("clever fox")
results = shard.query(QueryRequest(
query=Query.Nearest(query_vector, using="text"),
limit=3,
with_payload=True,
))
let query_vector = bm25.embed_query("clever fox");
let results = shard.query(QueryRequest {
prefetches: vec![],
query: Some(ScoringQuery::Vector(QueryEnum::Nearest(NamedQuery {
query: VectorInternal::from(query_vector),
using: Some("text".to_string()),
}))),
filter: None,
score_threshold: None,
limit: 3,
offset: 0,
params: None,
with_vector: WithVector::Bool(false),
with_payload: WithPayloadInterface::Bool(true),
})?;
Always use embed_query for query text and embed_document for document text. Using the wrong function produces incorrect results, since BM25 applies different term weighting depending on the input type.
