Loading a dataset from Hugging Face hub

Hugging Face provides a platform for sharing and using ML models and datasets. Qdrant also publishes datasets along with the embeddings that you can use to practice with Qdrant and build your applications based on semantic search. Please let us know if you’d like to see a specific dataset!

arxiv-titles-instructorxl-embeddings

This dataset contains embeddings generated from the paper titles only. Each vector has a payload with the title used to create it, along with the DOI (Digital Object Identifier).

{
    "title": "Nash Social Welfare for Indivisible Items under Separable, Piecewise-Linear Concave Utilities",
    "DOI": "1612.05191"
}

You can find a detailed description of the dataset in the Practice Datasets section. If you prefer loading the dataset from a Qdrant snapshot, it also linked there.

Loading the dataset is as simple as using the load_dataset function from the datasets library:

from datasets import load_dataset

dataset = load_dataset("Qdrant/arxiv-titles-instructorxl-embeddings")

The dataset contains 2,250,000 vectors. This is how you can check the list of the features in the dataset:

dataset.features

Streaming the dataset

Dataset streaming lets you work with a dataset without downloading it. The data is streamed as you iterate over the dataset. You can read more about it in the Hugging Face documentation.

from datasets import load_dataset

dataset = load_dataset(
    "Qdrant/arxiv-titles-instructorxl-embeddings", split="train", streaming=True
)

Loading the dataset into Qdrant

You can load the dataset into Qdrant using the Python SDK. The embeddings are already precomputed, so you can store them in a collection, that we’re going to create in a second:

from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="arxiv-titles-instructorxl-embeddings",
    vectors_config=models.VectorParams(
        size=768,
        distance=models.Distance.COSINE,
    ),
)

It is always a good idea to use batching, while loading a large dataset, so let’s do that. We are going to need a helper function to split the dataset into batches:

from itertools import islice

def batched(iterable, n):
    iterator = iter(iterable)
    while batch := list(islice(iterator, n)):
        yield batch

If you are a happy user of Python 3.12+, you can use the batched function from the itertools package instead.

No matter what Python version you are using, you can use the upsert method to load the dataset, batch by batch, into Qdrant:

batch_size = 100

for batch in batched(dataset, batch_size):
    ids = [point.pop("id") for point in batch]
    vectors = [point.pop("vector") for point in batch]

    client.upsert(
        collection_name="arxiv-titles-instructorxl-embeddings",
        points=models.Batch(
            ids=ids,
            vectors=vectors,
            payloads=batch,
        ),
    )

Your collection is ready to be used for search! Please let us know using Discord if you would like to see more datasets published on Hugging Face hub.