Project: Building a Semantic Search Engine
Now that you’ve seen how semantic search works with movies, it’s time to build your own. Choose a domain you care about and create a search engine that understands meaning, not just keywords.
Your Mission
Build a semantic search engine for a topic of your choice. You’ll discover how chunking strategy affects search quality in your specific domain.
Estimated Time: 120 minutes
What You’ll Build
A working semantic search engine that demonstrates:
- Domain expertise: Choose content you understand so you can evaluate search quality
- Chunking comparison: Test different strategies and see which works best for your content type
- Real semantic understanding: Search by concept, theme, or meaning rather than exact keywords
- Practical insights: Discover what makes chunking effective in your specific domain
Setup
Prerequisites
- Qdrant Cloud cluster (URL + API key)
- Python 3.9+ (or Colab)
- Packages:
qdrant-client,sentence-transformers,google.colab(if using Colab)
Models
- SentenceTransformer:
all-MiniLM-L6-v2(384-dim)
(You can try others in “Optional: Go Further”.)
Dataset
Pick something with rich, descriptive text where semantic search adds value:
- Books/Literature: Search a collection of book summaries, reviews, or excerpts. Find books by theme, mood, or literary style. Example queries: “coming of age stories with unreliable narrators”, “dystopian fiction with environmental themes”
- Recipes/Cooking: Index recipe descriptions and instructions. Search by cooking technique, flavor profile, or dietary needs. Example queries: “comfort food for cold weather”, “quick weeknight meals with Asian flavors”
- News/Articles: Collect articles from your field of interest. Search by topic, perspective, or journalistic approach. Example queries: “analysis of remote work trends”, “climate change solutions in urban planning”
- Research Papers: Academic abstracts or papers from your field. Search by methodology, findings, or theoretical approach. Example queries: “machine learning applications in healthcare”, “qualitative studies on user behavior”
- Product Reviews: Customer reviews for products you know well. Search by user sentiment, use case, or product features. Example queries: “laptops good for video editing under budget”, “skincare for sensitive skin winter routine”
Build Steps
Step 1: Initialize Client
Standard init (local)
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient, models
import os
client = QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))
# For Colab:
# from google.colab import userdata
# client = QdrantClient(url=userdata.get("QDRANT_URL"), api_key=userdata.get("QDRANT_API_KEY"))
encoder = SentenceTransformer("all-MiniLM-L6-v2")
Step 2: Prepare Your Dataset
Create a collection of 8-15 items with rich descriptions:
# Example: Recipe collection
my_dataset = [
{
"title": "Classic Beef Bourguignon",
"description": """A rich, wine-braised beef stew from Burgundy, France.
Tender chunks of beef are slowly simmered with pearl onions, mushrooms,
and bacon in a deep red wine sauce. The long, slow cooking process
develops complex flavors and creates a luxurious, velvety texture.
Perfect for cold winter evenings when you want something hearty and
comforting. Traditionally served with crusty bread or creamy mashed
potatoes to soak up the incredible sauce.""",
"cuisine": "French",
"difficulty": "Intermediate",
"time": "3 hours"
},
# Add 7-14 more items with similarly rich descriptions
]
Step 3: Implement Three Chunking Strategies
def fixed_size_chunks(text, chunk_size=100, overlap=20):
"""Split text into fixed-size chunks with overlap"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk_words = words[i:i + chunk_size]
if chunk_words: # Only add non-empty chunks
chunks.append(' '.join(chunk_words))
return chunks
def sentence_chunks(text, max_sentences=3):
"""Group sentences into chunks"""
import re
sentences = re.split(r'[.!?]+', text)
sentences = [s.strip() for s in sentences if s.strip()]
chunks = []
for i in range(0, len(sentences), max_sentences):
chunk_sentences = sentences[i:i + max_sentences]
if chunk_sentences:
chunks.append('. '.join(chunk_sentences) + '.')
return chunks
def paragraph_chunks(text):
"""Split by paragraphs or double line breaks"""
chunks = [chunk.strip() for chunk in text.split('\n\n') if chunk.strip()]
return chunks if chunks else [text] # Fallback to full text
Step 4: Create Collections and Process Data
Note: If you are already familiar with Qdrant’s filterable HNSW, you will know that effective filtering and grouping often relies on creating a payload index before building HNSW indexes. To keep things simple in this tutorial, we will do a basic search with filters without payload indexes and talk about proper usage of payload indexes on day 2 of this course.
collection_name = "day1_semantic_search"
if client.collection_exists(collection_name=collection_name):
client.delete_collection(collection_name=collection_name)
# Create a collection with three named vectors
client.create_collection(
collection_name=collection_name,
vectors_config={
"fixed": models.VectorParams(size=384, distance=models.Distance.COSINE),
"sentence": models.VectorParams(size=384, distance=models.Distance.COSINE),
"paragraph": models.VectorParams(size=384, distance=models.Distance.COSINE),
},
)
# Index fields for filtering (more on this on day 2)
client.create_payload_index(
collection_name=collection_name,
field_name="chunk_strategy",
field_schema=models.PayloadSchemaType.KEYWORD,
)
# Process and upload data
points = []
point_id = 0
for item in my_dataset:
description = item["description"]
# Process with each chunking strategy
strategies = {
"fixed": fixed_size_chunks(description),
"sentence": sentence_chunks(description),
"paragraph": paragraph_chunks(description),
}
for strategy_name, chunks in strategies.items():
for chunk_idx, chunk in enumerate(chunks):
# Create vectors for this chunk
vectors = {strategy_name: encoder.encode(chunk).tolist()}
points.append(
models.PointStruct(
id=point_id,
vector=vectors,
payload={
**item, # Include all original metadata
"chunk": chunk,
"chunk_strategy": strategy_name,
"chunk_index": chunk_idx,
},
)
)
point_id += 1
client.upload_points(collection_name=collection_name, points=points)
print(f"Uploaded {len(points)} chunks across three strategies")
Step 5: Test and Compare
def compare_search_results(query):
"""Compare search results across all chunking strategies"""
print(f"Query: '{query}'\n")
for strategy in ["fixed", "sentence", "paragraph"]:
results = client.query_points(
collection_name=collection_name,
query=encoder.encode(query).tolist(),
using=strategy,
limit=3,
)
print(f"--- {strategy.upper()} CHUNKING ---")
for i, point in enumerate(results.points, 1):
print(f"{i}. {point.payload['title']}")
print(f" Score: {point.score:.3f}")
print(f" Chunk: {point.payload['chunk'][:80]}...")
print()
# Test with domain-specific queries
test_queries = [
"comfort food for winter", # Adapt these to your domain
"quick and easy weeknight dinner",
"elegant dish for special occasions",
]
for query in test_queries:
compare_search_results(query)
Step 6: Analyze Your Results
After running your tests, analyze what you discovered:
def analyze_chunking_effectiveness():
"""Analyze which chunking strategy works best for your domain"""
print("CHUNKING STRATEGY ANALYSIS")
print("=" * 40)
# Get chunk statistics for each strategy
for strategy in ["fixed", "sentence", "paragraph"]:
# Count chunks per strategy
results = client.scroll(
collection_name=collection_name,
scroll_filter=models.Filter(
must=[
models.FieldCondition(
key="chunk_strategy", match=models.MatchValue(value=strategy)
)
]
),
limit=100,
)
chunks = results[0]
chunk_sizes = [len(chunk.payload["chunk"]) for chunk in chunks]
print(f"\n{strategy.upper()} STRATEGY:")
print(f" Total chunks: {len(chunks)}")
print(f" Avg chunk size: {sum(chunk_sizes)/len(chunk_sizes):.0f} chars")
print(f" Size range: {min(chunk_sizes)}-{max(chunk_sizes)} chars")
analyze_chunking_effectiveness()
Success Criteria
You’ll know you’ve succeeded when:
Your search engine finds relevant results by meaning, not just keywords
You can clearly explain which chunking strategy works best for your domain
You’ve discovered something surprising about how chunking affects search
You can articulate the trade-offs between different approaches
Share Your Discovery
Now it’s time to analyze your results and share what you’ve learned. Follow these steps to document your findings and prepare them for sharing.
Step 1: Reflect on Your Findings
- Domain & Dataset: What content you chose and why; dataset size/complexity.
- Chunking Comparison: What you observed for fixed / sentence / paragraph.
- The Winner: Which worked best and why (one clear reason).
- Example Query: One query where the winner beat another strategy.
Step 2: Post Your Results
Post your results in
using this short template—copy, fill, and send:
**[Day 1] Building a Semantic Search Engine**
**High-Level Summary**
- **Domain:** "I built a semantic search for [recipes/books/articles/etc.]"
- **Winner:** "Best chunking strategy was [fixed/sentence/paragraph] because [one reason]"
**Project-Specific Details**
- **Collection:** day1_semantic_search (Cosine) with vectors: fixed/sentence/paragraph
- **Dataset:** [N items] (snapshot: YYYY-MM-DD)
- **Chunks:** fixed=[count]/[avg chars], sentence=[count]/[avg chars], paragraph=[count]/[avg chars]
- **Demo query:** "Try '[your example query]'" — it found [what was interesting]
**Surprise**
- "[Most unexpected finding was …]"
**Next step**
- "[What you’ll try tomorrow]"
Optional: Go Further
Add Metadata Filtering
Enhance your search with filters like we saw in the movie demo:
# Example: Find Italian recipes that are quick to make
# Tip: You have to recreate the collection and apply create_payload_index for the
# new filters before uploading the data again to be able to filter on the new fields.
results = client.query_points(
collection_name=collection_name,
query=encoder.encode("comfort food").tolist(),
using="sentence",
query_filter=models.Filter(
must=[
models.FieldCondition(key="cuisine", match=models.MatchValue(value="Italian")),
models.FieldCondition(key="time", match=models.MatchValue(value="30 minutes"))
]
),
limit=3
)
Try Different Embedding Models
Experiment with other models to see how they affect results:
# Compare with a different model
encoder_large = SentenceTransformer("all-mpnet-base-v2") # Larger, potentially better
encoder_fast = SentenceTransformer("all-MiniLM-L12-v2") # Different size/speed tradeoff
Ready for Day 2? Tomorrow you’ll learn how Qdrant makes vector search lightning-fast through HNSW indexing and how to optimize for production workloads.
