Day 5

The Universal Query API

Picture this: a customer types “leather jackets” into your store’s search bar. You want to show items that match the style semantically - so a bomber jacket surfaces even if it doesn’t mention “leather jackets” verbatim - but you also need to enforce your business rules. Only products under $200, only items in stock, only jackets released within the past year. Traditionally, you’d fire off a search, gather results, then apply filters and glue code. With Qdrant’s Universal Query API, all of that happens in one declarative request.

Run dense + sparse retrieval in parallel with RRF

First, you retrieve candidates from multiple sources in parallel and fuse their ranks. Below, we blend dense semantics from a BGE model with sparse keyword matching from SPLADE by using Reciprocal Rank Fusion to merge the two lists:

from qdrant_client import QdrantClient, models
import os

client = QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))

# For Colab:
# from google.colab import userdata
# client = QdrantClient(url=userdata.get("QDRANT_URL"), api_key=userdata.get("QDRANT_API_KEY"))

response = client.query_points(
    collection_name="products",
    prefetch=[
        models.Prefetch(
            query=dense_vector,
            using="dense_bge",
            limit=20
        ),
        models.Prefetch(
            query=sparse_vector,
            using="sparse_splade",
            limit=20
        )
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=10
)

Qdrant sends both prefetches concurrently, fuses the two ranked lists by reciprocal rank, and returns your top ten products that satisfy both semantic and keyword relevance.

Dense Retrieval + ColBERT Rerank

While the previous lesson showed how to use ColBERT directly for retrieval, in production systems reranking is the more common pattern. The reason is practical: ColBERT’s brute-force MaxSim scoring on an entire large collection can be slow. Instead, you combine two strengths - fast approximate search to narrow candidates, then precise token-level scoring on that smaller set.

You create a collection with two vector fields: a dense vector with HNSW indexing for speed, and a ColBERT multivector with HNSW disabled for precision:

client.create_collection(
    collection_name="articles",
    vectors_config={
        # Fast HNSW-indexed dense retrieval
        "bge-dense": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
        ),
        # Precise multivector reranking (HNSW disabled to save RAM)
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
            hnsw_config=models.HnswConfigDiff(m=0),
        )
    }
)

Now you retrieve 100 candidates quickly with your HNSW-indexed dense field, then apply ColBERT’s MaxSim scoring to rerank those hundred and select the very best ten:

from fastembed import LateInteractionTextEmbedding, TextEmbedding

# Encode with both models
dense = TextEmbedding("BAAI/bge-small-en-v1.5")
dense_query_vector = next(dense.query_embed(["what is the policy?"])).tolist()

colbert = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")
colbert_query_multivector = next(colbert.query_embed(["what is the policy?"])).tolist()

# Fast retrieval + precise reranking in one call
response = client.query_points(
    collection_name="articles",
    prefetch=[
        models.Prefetch(
            query=dense_query_vector,
            using="bge-dense",
            limit=100
        )
    ],
    query=colbert_query_multivector,
    using="colbert",
    limit=10
)

Behind the scenes, Qdrant fetches the 100 nearest points from the HNSW-indexed dense field, then applies the MaxSim late-interaction score from your ColBERT multivector to only those 100 candidates, returning the ten highest-scoring results. This two-stage approach delivers ColBERT’s precision while keeping query latency practical for large-scale deployments.

Global and Prefetch-Specific Filters

Finally, you layer in filtering wherever it makes sense. In the snippet below, you specify global filters at the query level - price under $200 and release date after January 1, 2023 - which automatically propagate to all prefetches. Then, you add an additional prefetch-specific filter on the dense search to only retrieve products that are in stock and in the “jackets” category:

response = client.query_points(
    collection_name="products",
    prefetch=[
        models.Prefetch(
            query=dense_query_vector,
            using="bge-dense",
            limit=100,
            filter=models.Filter(
                must=[
                    models.FieldCondition(
                        key="in_stock",
                        match=models.MatchValue(value=True)
                    ),
                    models.FieldCondition(
                        key="category",
                        match=models.MatchValue(value="jackets")
                    )
                ]
            )
        ),
        models.Prefetch(
            query=sparse_query_vector,
            using="sparse-splade",
            limit=100
        )
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    filter=models.Filter(
        must=[
            models.FieldCondition(
                key="price",
                range=models.Range(lt=200.0)
            ),
            models.FieldCondition(
                key="release_date",
                range=models.DatetimeRange(
                    gte="2023-01-01T00:00:00Z"
                )
            )
        ]
    ),
    limit=10
)

The global filters (price and release_date) apply to both prefetches automatically. The first prefetch adds extra constraints (in_stock and category), while the second prefetch only uses the global filters. This eliminates the need to repeat common filters across every prefetch. All filtering happens efficiently during the retrieval phase - there’s no separate post-processing step. The entire pipeline - hybrid retrieval, filtering, fusion, and reranking - executes in one API call.