Chonkie

Chonkie is a no-nonsense, ultra-light, and lightning-fast chunking library designed for RAG (Retrieval-Augmented Generation) applications.

Chonkie integrates seamlessly with Qdrant through the QdrantHandshake class, allowing you to chunk, embed, and store text data without ever leaving the Chonkie SDK.

Setup

Install Chonkie with Qdrant support:

pip install "chonkie[qdrant]"

Basic Usage

The QdrantHandshake provides a simple interface for storing and searching chunks:

from chonkie import QdrantHandshake, SemanticChunker

# Initialize handshake with custom embedding model
handshake = QdrantHandshake(
    url="http://localhost:6333",
    collection_name="my_documents",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)

# Create and write chunks
chunker = SemanticChunker()
chunks = chunker.chunk("Your text content here...")
handshake.write(chunks)

# Search using natural language
results = handshake.search(query="your search query", limit=5)
for result in results:
    print(f"{result['score']}: {result['text']}")

Qdrant Cloud

handshake = QdrantHandshake(
    url="https://your-cluster.qdrant.io",
    api_key="your-api-key",
    collection_name="my_collection",
    embedding_model="BAAI/bge-small-en-v1.5"  # Change to your preferred model
)

Complete RAG Pipeline

Build end-to-end RAG pipelines using Chonkie’s fluent Pipeline API:

from chonkie import Pipeline

# Process documents and store in Qdrant with custom embedding model
docs = (Pipeline()
    .fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("semantic", chunk_size=512)
    .store_in("qdrant",
              collection_name="knowledge",
              url="http://localhost:6333",
              embedding_model="sentence-transformers/all-MiniLM-L6-v2")
    .run())

print(f"Ingested {len(docs)} documents into Qdrant")

Pipeline with Refinements

from chonkie import Pipeline

# Advanced pipeline with overlapping context and custom embeddings
docs = (Pipeline()
    .fetch_from("file", dir="./docs")
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant",
              url="https://your-cluster.qdrant.io",
              api_key="your-api-key",
              collection_name="knowledge_base",
              embedding_model="BAAI/bge-small-en-v1.5")
    .run())

Next steps

Was this page useful?

Thank you for your feedback! 🙏

We are sorry to hear that. 😔 You can edit this page on GitHub, or create a GitHub issue.