Create a Hybrid Search Service with Fastembed

Time: 20 minLevel: BeginnerOutput: GitHub

This tutorial shows you how to build and deploy your own hybrid search service to look through descriptions of companies from and pick the most similar ones to your query. The website contains the company names, descriptions, locations, and a picture for each entry.

As we have already written on our blog, there is no single definition of hybrid search. In this tutorial we are covering the case with a combination of dense and sparse embeddings. The former ones refer to the embeddings generated by such well-known neural networks as BERT, while the latter ones are more related to a traditional full-text search approach.

Our hybrid search service will use Fastembed package to generate embeddings of text descriptions and FastAPI to serve the search API. Fastembed natively integrates with Qdrant client, so you can easily upload the data into Qdrant and perform search queries.

Hybrid Search Schema


To create a hybrid search service, you will need to transform your raw data and then create a search function to manipulate it. First, you will 1) download and prepare a sample dataset using a modified version of the BERT ML model. Then, you will 2) load the data into Qdrant, 3) create a hybrid search API and 4) serve it using FastAPI.

Hybrid Search Workflow


To complete this tutorial, you will need:

  • Docker - The easiest way to use Qdrant is to run a pre-built Docker image.
  • Raw parsed data from
  • Python version >=3.8

Prepare sample dataset

To conduct a hybrid search on startup descriptions, you must first encode the description data into vectors. Fastembed integration into qdrant client combines encoding and uploading into a single step.

It also takes care of batching and parallelization, so you don’t have to worry about it.

Let’s start by downloading the data and installing the necessary packages.

  1. First you need to download the dataset.

Run Qdrant in Docker

Next, you need to manage all of your data using a vector engine. Qdrant lets you store, update or delete created vectors. Most importantly, it lets you search for the nearest vectors via a convenient API.

Note: Before you begin, create a project directory and a virtual python environment in it.

  1. Download the Qdrant image from DockerHub.
docker pull qdrant/qdrant
  1. Start Qdrant inside of Docker.
docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \

You should see output like this

[2021-02-05T00:08:51Z INFO  actix_server::builder] Starting 12 workers
[2021-02-05T00:08:51Z INFO  actix_server::builder] Starting "actix-web-service-" service on

Test the service by going to http://localhost:6333/. You should see the Qdrant version info in your browser.

All data uploaded to Qdrant is saved inside the ./qdrant_storage directory and will be persisted even if you recreate the container.

Upload data to Qdrant

  1. Install the official Python client to best interact with Qdrant.
pip install "qdrant-client[fastembed]>=1.8.2"

Note: This tutorial requires fastembed of version >=0.2.6.

At this point, you should have startup records in the startups_demo.json file and Qdrant running on a local machine.

Now you need to write a script to upload all startup data and vectors into the search engine.

  1. Create a client object for Qdrant.
# Import client library
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")
  1. Select model to encode your data.

You will be using two pre-trained models to compute dense and sparse vectors correspondingly: sentence-transformers/all-MiniLM-L6-v2 and prithivida/Splade_PP_en_v1.

# comment this line to use dense vectors only
  1. Related vectors need to be added to a collection. Create a new collection for your startup vectors.
    # comment this line to use dense vectors only

Qdrant requires vectors to have their own names and configurations.

Methods get_fastembed_vector_params and get_fastembed_sparse_vector_params help you to get the corresponding parameters for the models you are using. These parameters include vector size, distance function, etc.

Without fastembed integration, you would need to specify the vector size and distance function manually. Read more about it here.

Additionally, you can specify extended configuration for your vectors, like quantization_config or hnsw_config.

  1. Read data from the file.
import json

payload_path = "startups_demo.json"
metadata = []
documents = []

with open(payload_path) as fd:
    for line in fd:
        obj = json.loads(line)

In this block of code, we read data from startups_demo.json file and split it into 2 lists: documents and metadata. Documents are the raw text descriptions of startups. Metadata is the payload associated with each startup, such as the name, location, and picture. We will use documents to encode the data into vectors.

  1. Encode and upload data.
    parallel=0,  # Use all available CPU cores to encode data. 
    # Requires wrapping code into if __name__ == '__main__' block
Upload processed data

Download and unpack the processed data from here or use the following script:

tar -xvf startups_hybrid_search_processed_40k.tar.gz

Then you can upload the data to Qdrant.

from typing import List
import json
import numpy as np
from qdrant_client import models

def named_vectors(vectors: List[float], sparse_vectors: List[models.SparseVector]) -> dict:
    # make sure to use the same client object as previously
    # or `set_model_name` and `set_sparse_model_name` manually
    dense_vector_name = client.get_vector_field_name()
    sparse_vector_name = client.get_sparse_vector_field_name()  
    for vector, sparse_vector in zip(vectors, sparse_vectors):
        yield {
            dense_vector_name: vector,
            sparse_vector_name: models.SparseVector(**sparse_vector),

with open("dense_vectors.npy", "rb") as f:
    vectors = np.load(f)
with open("sparse_vectors.json", "r") as f:
    sparse_vectors = json.load(f)
with open("payload.json", "r",) as f:
    payload = json.load(f)

    "startups", vectors=named_vectors(vectors, sparse_vectors), payload=payload

The add method will encode all documents and upload them to Qdrant. This is one of the two fastembed-specific methods, that combines encoding and uploading into a single step.

The parallel parameter enables data-parallelism instead of built-in ONNX parallelism.

Additionally, you can specify ids for each document, if you want to use them later to update or delete documents. If you don’t specify ids, they will be generated automatically and returned as a result of the add method.

You can monitor the progress of the encoding by passing tqdm progress bar to the add method.

from tqdm import tqdm


Build the search API

Now that all the preparations are complete, let’s start building a neural search class.

In order to process incoming requests, the hybrid search class will need 3 things: 1) models to convert the query into a vector, 2) the Qdrant client to perform search queries, 3) fusion function to re-rank dense and sparse search results.

Fastembed integration encapsulates query encoding, search and fusion into a single method call. Fastembed leverages reciprocal rank fusion in order combine the results.

  1. Create a file named and specify the following.
from qdrant_client import QdrantClient

class HybridSearcher:
    DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
    SPARSE_MODEL = "prithivida/Splade_PP_en_v1"
    def __init__(self, collection_name):
        self.collection_name = collection_name
        # initialize Qdrant client
        self.qdrant_client = QdrantClient("http://localhost:6333")
        # comment this line to use dense vectors only
  1. Write the search function.
def search(self, text: str):
    search_result = self.qdrant_client.query(
        query_filter=None,  # If you don't want any filters for now
        limit=5,  # 5 the closest results
    # `search_result` contains found vector ids with similarity scores 
    # along with the stored payload
    # Select and return metadata
    metadata = [hit.metadata for hit in search_result]
    return metadata
  1. Add search filters.

With Qdrant it is also feasible to add some conditions to the search. For example, if you wanted to search for startups in a certain city, the search query could look like this:

from qdrant_client import models


    city_of_interest = "Berlin"

    # Define a filter for cities
    city_filter = models.Filter(

    search_result = self.qdrant_client.query(

You have now created a class for neural search queries. Now wrap it up into a service.

Deploy the search with FastAPI

To build the service you will use the FastAPI framework.

  1. Install FastAPI.

To install it, use the command

pip install fastapi uvicorn
  1. Implement the service.

Create a file named and specify the following.

The service will have only one API endpoint and will look like this:

from fastapi import FastAPI

# The file where HybridSearcher is stored
from hybrid_searcher import HybridSearcher

app = FastAPI()

# Create a neural searcher instance
hybrid_searcher = HybridSearcher(collection_name="startups")

def search_startup(q: str):
    return {"result":}

if __name__ == "__main__":
    import uvicorn, host="", port=8000)
  1. Run the service.
  1. Open your browser at http://localhost:8000/docs.

You should be able to see a debug interface for your service.

FastAPI Swagger interface

Feel free to play around with it, make queries regarding the companies in our corpus, and check out the results.

Join our Discord community, where we talk about vector search and similarity learning, publish other examples of neural networks and neural search applications.

Hybrid Search with Fastembed