Create and restore collections from snapshot

Time: 20 minLevel: Beginner

A collection is a basic unit of data storage in Qdrant. It contains vectors, their IDs, and payloads. However, keeping the search efficient requires additional data structures to be built on top of the data. Building these data structures may take a while, especially for large collections. That’s why using snapshots is the best way to export and import Qdrant collections, as they contain all the bits and pieces required to restore the entire collection efficiently.

This tutorial will show you how to create a snapshot of a collection and restore it. Since working with snapshots in a distributed environment might be thought to be a bit more complex, we will use a 3-node Qdrant cluster. However, the same approach applies to a single-node setup.

You can use the techniques described in this page to migrate a cluster. Follow the instructions in this tutorial to create and download snapshots. When you Restore from snapshot, restore your data to the new cluster.

Prerequisites

Let’s assume you already have a running Qdrant instance or a cluster. If not, you can follow the installation guide to set up a local Qdrant instance or use Qdrant Cloud to create a cluster in a few clicks.

Once the cluster is running, let’s install the required dependencies:

pip install qdrant-client datasets

Establish a connection to Qdrant

We are going to use the Python SDK and raw HTTP calls to interact with Qdrant. Since we are going to use a 3-node cluster, we need to know the URLs of all the nodes. For the simplicity, let’s keep them all in constants, along with the API key, so we can refer to them later:

QDRANT_MAIN_URL = "https://my-cluster.com:6333"
QDRANT_NODES = (
    "https://node-0.my-cluster.com:6333",
    "https://node-1.my-cluster.com:6333",
    "https://node-2.my-cluster.com:6333",
)
QDRANT_API_KEY = "my-api-key"

We can now create a client instance:

from qdrant_client import QdrantClient

client = QdrantClient(QDRANT_MAIN_URL, api_key=QDRANT_API_KEY)

First of all, we are going to create a collection from a precomputed dataset. If you already have a collection, you can skip this step and start by creating a snapshot.

(Optional) Create collection and import data

Load the dataset

We are going to use a dataset with precomputed embeddings, available on Hugging Face Hub. The dataset is called Qdrant/arxiv-titles-instructorxl-embeddings and was created using the InstructorXL model. It contains 2.25M embeddings for the titles of the papers from the arXiv dataset.

Loading the dataset is as simple as:

from datasets import load_dataset

dataset = load_dataset(
    "Qdrant/arxiv-titles-instructorxl-embeddings", split="train", streaming=True
)

We used the streaming mode, so the dataset is not loaded into memory. Instead, we can iterate through it and extract the id and vector embedding:

for payload in dataset:
    id_ = payload.pop("id")
    vector = payload.pop("vector")
    print(id_, vector, payload)

A single payload looks like this:

{
  'title': 'Dynamics of partially localized brane systems',
  'DOI': '1109.1415'
}

Create a collection

First things first, we need to create our collection. We’re not going to play with the configuration of it, but it makes sense to do it right now. The configuration is also a part of the collection snapshot.

from qdrant_client import models

client.recreate_collection(
    collection_name="test_collection",
    vectors_config=models.VectorParams(
        size=768,  # Size of the embedding vector generated by the InstructorXL model
        distance=models.Distance.COSINE
    ),
)

Upload the dataset

Calculating the embeddings is usually a bottleneck of the vector search pipelines, but we are happy to have them in place already. Since the goal of this tutorial is to show how to create a snapshot, we are going to upload only a small part of the dataset.

ids, vectors, payloads = [], [], []
for payload in dataset:
    id_ = payload.pop("id")
    vector = payload.pop("vector")

    ids.append(id_)
    vectors.append(vector)
    payloads.append(payload)

    # We are going to upload only 1000 vectors
    if len(ids) == 1000:
        break

client.upsert(
    collection_name="test_collection",
    points=models.Batch(
        ids=ids,
        vectors=vectors,
        payloads=payloads,
    ),
)

Our collection is now ready to be used for search. Let’s create a snapshot of it.

If you already have a collection, you can skip the previous step and start by creating a snapshot.

Create and download snapshots

Qdrant exposes an HTTP endpoint to request creating a snapshot, but we can also call it with the Python SDK. Our setup consists of 3 nodes, so we need to call the endpoint on each of them and create a snapshot on each node. While using Python SDK, that means creating a separate client instance for each node.

snapshot_urls = []
for node_url in QDRANT_NODES:
    node_client = QdrantClient(node_url, api_key=QDRANT_API_KEY)
    snapshot_info = node_client.create_snapshot(collection_name="test_collection")

    snapshot_url = f"{node_url}/collections/test_collection/snapshots/{snapshot_info.name}"
    snapshot_urls.append(snapshot_url)
// for `https://node-0.my-cluster.com:6333`
POST /collections/test_collection/snapshots

// for `https://node-1.my-cluster.com:6333`
POST /collections/test_collection/snapshots

// for `https://node-2.my-cluster.com:6333`
POST /collections/test_collection/snapshots
Response
{
  "result": {
    "name": "test_collection-559032209313046-2024-01-03-13-20-11.snapshot",
    "creation_time": "2024-01-03T13:20:11",
    "size": 18956800
  },
  "status": "ok",
  "time": 0.307644965
}

Once we have the snapshot URLs, we can download them. Please make sure to include the API key in the request headers. Downloading the snapshot can be done only through the HTTP API, so we are going to use the requests library.

import requests
import os

# Create a directory to store snapshots
os.makedirs("snapshots", exist_ok=True)

local_snapshot_paths = []
for snapshot_url in snapshot_urls:
    snapshot_name = os.path.basename(snapshot_url)
    local_snapshot_path = os.path.join("snapshots", snapshot_name)

    response = requests.get(
        snapshot_url, headers={"api-key": QDRANT_API_KEY}
    )
    with open(local_snapshot_path, "wb") as f:
        response.raise_for_status()
        f.write(response.content)

    local_snapshot_paths.append(local_snapshot_path)

Alternatively, you can use the wget command:

wget https://node-0.my-cluster.com:6333/collections/test_collection/snapshots/test_collection-559032209313046-2024-01-03-13-20-11.snapshot \
    --header="api-key: ${QDRANT_API_KEY}" \
    -O node-0-shapshot.snapshot

wget https://node-1.my-cluster.com:6333/collections/test_collection/snapshots/test_collection-559032209313047-2024-01-03-13-20-12.snapshot \
    --header="api-key: ${QDRANT_API_KEY}" \
    -O node-1-shapshot.snapshot

wget https://node-2.my-cluster.com:6333/collections/test_collection/snapshots/test_collection-559032209313048-2024-01-03-13-20-13.snapshot \
    --header="api-key: ${QDRANT_API_KEY}" \
    -O node-2-shapshot.snapshot

The snapshots are now stored locally. We can use them to restore the collection to a different Qdrant instance, or treat them as a backup. We will create another collection using the same data on the same cluster.

Restore from snapshot

Our brand-new snapshot is ready to be restored. Typically, it is used to move a collection to a different Qdrant instance, but we are going to use it to create a new collection on the same cluster. It is just going to have a different name, test_collection_import. We do not need to create a collection first, as it is going to be created automatically.

Restoring collection is also done separately on each node, but our Python SDK does not support it yet. We are going to use the HTTP API instead, and send a request to each node using requests library.

for node_url, snapshot_path in zip(QDRANT_NODES, local_snapshot_paths):
    snapshot_name = os.path.basename(snapshot_path)
    requests.post(
        f"{node_url}/collections/test_collection_import/snapshots/upload?priority=snapshot",
        headers={
            "api-key": QDRANT_API_KEY,
        },
        files={"snapshot": (snapshot_name, open(snapshot_path, "rb"))},
    )

Alternatively, you can use the curl command:

curl -X POST 'https://node-0.my-cluster.com:6333/collections/test_collection_import/snapshots/upload?priority=snapshot' \
    -H 'api-key: ${QDRANT_API_KEY}' \
    -H 'Content-Type:multipart/form-data' \
    -F 'snapshot=@node-0-shapshot.snapshot'

curl -X POST 'https://node-1.my-cluster.com:6333/collections/test_collection_import/snapshots/upload?priority=snapshot' \
    -H 'api-key: ${QDRANT_API_KEY}' \
    -H 'Content-Type:multipart/form-data' \
    -F 'snapshot=@node-1-shapshot.snapshot'

curl -X POST 'https://node-2.my-cluster.com:6333/collections/test_collection_import/snapshots/upload?priority=snapshot' \
    -H 'api-key: ${QDRANT_API_KEY}' \
    -H 'Content-Type:multipart/form-data' \
    -F 'snapshot=@node-2-shapshot.snapshot'

Important: We selected priority=snapshot to make sure that the snapshot is preferred over the data stored on the node. You can read mode about the priority in the documentation.