DeepEval

DeepEval by Confident AI is an open-source framework for testing large language model systems. Similar to Pytest but designed for LLM outputs, it evaluates metrics like G-Eval, hallucination, answer relevancy.

DeepEval can be integrated with Qdrant to evaluate RAG pipelines — ensuring your LLM applications return relevant, grounded, and faithful responses based on retrieved vector search context.

How it works

A test case is a blueprint provided by DeepEval to unit test LLM outputs. There are two types of test cases in DeepEval:

LLMTestCase: Used to evaluate a single input-output pair, such as RAG responses or agent actions.

ConversationalTestCase: A sequence of LLMTestCase turns representing a back-and-forth interaction with an LLM system. This is especially useful for chatbot or assistant testing.

Metrics Overview

DeepEval offers a suite of metrics to evaluate various aspects of LLM outputs, including:

Answer Relevancy: Measures how relevant the LLM’s output is to the given input query.
Faithfulness: Assesses whether the LLM’s response is grounded in the provided context, ensuring factual accuracy.
Contextual Precision: Determines whether the most relevant pieces of context are ranked higher than less relevant ones.
G-Eval: A versatile metric that uses LLM-as-a-judge with chain-of-thought reasoning to evaluate outputs based on custom criteria.
Hallucination: Detects instances where the LLM generates information not present in the source context.
Toxicity: Assesses the presence of harmful or offensive content in the LLM’s output.
Bias: Evaluates the output for any unintended biases.
Summarization: Measures the quality and accuracy of generated summaries.

For a comprehensive list and detailed explanations of all available metrics, please refer to the DeepEval metrics reference.

Using Qdrant with DeepEval

Install the client libraries.

$ pip install deepeval qdrant-client

$ deepeval login

You can use Qdrant to power your RAG system by retrieving relevant documents for a query, feeding them into your prompt, and evaluating the generated output using DeepEval.

from deepeval.test_case import LLMTestCase, ConversationalTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ...

# 1. Query context from Qdrant
context = qdrant_client.query_points(...)

# 2. Construct prompt using query + retrieved context
prompt = build_prompt(query, context)

# 3. Generate response from your LLM
response = llm.generate(prompt)

# 4. Create a test case for evaluation
test_case = LLMTestCase(
    input=query,
    actual_output=response,
    expected_output=ground_truth_answer,
    retrieval_context=context
)

# 5. Evaluate the output using DeepEval
evaluate(
    test_cases=[test_case],
    metrics=[
        AnswerRelevancyMetric(),
        FaithfulnessMetric(),
        ContextualPrecisionMetric(),
        ...
    ],
)

All evaluations performed using DeepEval can be viewed on the Confident AI Dashboard.

You can scale this process with a dataset (e.g. from Hugging Face) and evaluate multiple test cases at once by looping through question-answer pairs, querying Qdrant for context, and scoring with DeepEval metrics.

DeepEval

How it works

Metrics Overview

Using Qdrant with DeepEval

Further Reading

Was this page useful?

About cookies on this site

Targeting Cookies

Functional Cookies

Strictly Necessary Cookies

Performance Cookies