Build a Multimodal Search System with Qdrant and FastEmbed
Time: 15 min | Level: Beginner | Output: GitHub |
---|
In this tutorial, you will set up a simple Multimodal Image & Text Search with Qdrant & FastEmbed.
Overview
We often understand and share information more effectively when combining different types of data. For example, the taste of comfort food can trigger childhood memories. We might describe a song with just “pam pam clap” sounds. Instead of writing paragraphs. Sometimes, we may use emojis and stickers to express how we feel or to share complex ideas.
Modalities of data such as text, images, video and audio in various combinations form valuable use cases for Semantic Search applications.
Vector databases, being modality-agnostic, are perfect for building these applications.
In this simple tutorial, we are working with two simple modalities: image and text data. However, you can create a Semantic Search application with any combination of modalities if you choose the right embedding model to bridge the semantic gap.
The semantic gap refers to the difference between low-level features (aka brightness) and high-level concepts (aka cuteness).
For example, the ImageBind model from Meta AI is said to bind all 4 mentioned modalities in one shared space.
Prerequisites
Note: The code for this tutorial can be found here
To complete this tutorial, you will need either Docker to run a pre-built Docker image of Qdrant and Python version ≥ 3.8 or a Google Collab Notebook if you don’t want to install anything locally.
We showed how to run Qdrant in Docker in the “Create a Simple Neural Search Service” Tutorial.
Setup
First, install the required libraries qdrant-client
, fastembed
and Pillow
.
For example, with the pip
package manager, it can be done in the following way.
python3 -m pip install --upgrade qdrant-client fastembed Pillow
Dataset
To make the demonstration simple, we created a tiny dataset of images and their captions for you.
Images can be downloaded from here.
It’s important to place them in the same folder as your code/notebook, in the folder named images
.
You can check out how images look like in the following way:
from PIL import Image
Image.open('images/lizard.jpg')
Vectorize data
FastEmbed
supports Contrastive Language–Image Pre-training (CLIP) model, the old (2021) but gold classics of multimodal Image-Text Machine Learning.
CLIP model was one of the first models of such kind with ZERO-SHOT capabilities.
When using it for semantic search, it’s important to remember that the textual encoder of CLIP is trained to process no more than 77 tokens, so CLIP is good for short texts.
Let’s embed a very short selection of images and their captions in the shared embedding space with CLIP.
from fastembed import TextEmbedding, ImageEmbedding
documents = [{"caption": "A photo of a cute pig",
"image": "images/piggy.jpg"},
{"caption": "A picture with a coffee cup",
"image": "images/coffee.jpg"},
{"caption": "A photo of a colourful lizard",
"image": "images/lizard.jpg"}
]
text_model_name = "Qdrant/clip-ViT-B-32-text" #CLIP text encoder
text_model = TextEmbedding(model_name=text_model_name)
text_embeddings_size = text_model._get_model_description(text_model_name)["dim"] #dimension of text embeddings, produced by CLIP text encoder (512)
texts_embeded = list(text_model.embed([document["caption"] for document in documents])) #embedding captions with CLIP text encoder
image_model_name = "Qdrant/clip-ViT-B-32-vision" #CLIP image encoder
image_model = ImageEmbedding(model_name=image_model_name)
image_embeddings_size = image_model._get_model_description(image_model_name)["dim"] #dimension of image embeddings, produced by CLIP image encoder (512)
images_embeded = list(image_model.embed([document["image"] for document in documents])) #embedding images with CLIP image encoder
Upload data to Qdrant
- Create a client object for Qdrant.
from qdrant_client import QdrantClient, models
client = QdrantClient("http://localhost:6333") #or QdrantClient(":memory:") if you're using Google Collab, this option is suitable only for simple prototypes/demos with Python client
- Create a new collection for your images with captions.
CLIP’s weights were trained to maximize the scaled Cosine Similarity of truly corresponding image/caption pairs, so that’s the Distance Metric we will choose for our Collection of Named Vectors.
Using Named Vectors, we can easily showcase both Text-to-Image and Image-to-Text (Image-to-Image and Text-to-Text) search.
if not client.collection_exists("text_image"): #creating a Collection
client.create_collection(
collection_name ="text_image",
vectors_config={ #Named Vectors
"image": models.VectorParams(size=image_embeddings_size, distance=models.Distance.COSINE),
"text": models.VectorParams(size=text_embeddings_size, distance=models.Distance.COSINE),
}
)
- Upload our images with captions to the Collection.
Each image with its caption will create a Point in Qdrant.
client.upload_points(
collection_name="text_image",
points=[
models.PointStruct(
id=idx, #unique id of a point, pre-defined by the user
vector={
"text": texts_embeded[idx], #embeded caption
"image": images_embeded[idx] #embeded image
},
payload=doc #original image and its caption
)
for idx, doc in enumerate(documents)
]
)
Search
Text-to-Image
Let’s see what image we will get to the query “What would make me energetic in the morning?”
from PIL import Image
find_image = text_model.embed(["What would make me energetic in the morning?"]) #query, we embed it, so it also becomes a vector
Image.open(client.search(
collection_name="text_image", #searching in our collection
query_vector=("image", list(find_image)[0]), #searching only among image vectors with our textual query
with_payload=["image"], #user-readable information about search results, we are interested to see which image we will find
limit=1 #top-1 similar to the query result
)[0].payload['image'])
Response:
Image-to-Text
Now, let’s do a reverse search with an image:
from PIL import Image
Image.open('images/piglet.jpg')
Let’s see what caption we will get, searching by this piglet image, which, as you can check, is not in our Collection.
find_image = image_model.embed(['images/piglet.jpg']) #embedding our image query
client.search(
collection_name="text_image",
query_vector=("text", list(find_image)[0]), #now we are searching only among text vectors with our image query
with_payload=["caption"], #user-readable information about search results, we are interested to see which caption we will get
limit=1
)[0].payload['caption']
Response:
'A photo of a cute pig'
Next steps
Use cases of even just Image & Text Multimodal Search are countless: E-Commerce, Media Management, Content Recommendation, Emotion Recognition Systems, Biomedical Image Retrieval, Spoken Sign Language Transcription, etc.
Imagine a scenario: user wants to find a product similar to a picture they have, but they also have specific textual requirements, like “in beige colour”. You can search using just texts or images and combine their embeddings in a late fusion manner (summing and weighting might work surprisingly well).
Moreover, using Discovery Search with both modalities, you can provide users with information that is impossible to retrieve unimodally!
Join our Discord community, where we talk about vector search and similarity learning, experiment, and have fun!