DeepEval

DeepEval by Confident AI is an open-source framework for testing large language model systems. Similar to Pytest but designed for LLM outputs, it evaluates metrics like G-Eval, hallucination, answer relevancy.

DeepEval can be integrated with Qdrant to evaluate RAG pipelines — ensuring your LLM applications return relevant, grounded, and faithful responses based on retrieved vector search context.

How it works

A test case is a blueprint provided by DeepEval to unit test LLM outputs. There are two types of test cases in DeepEval:

LLMTestCase: Used to evaluate a single input-output pair, such as RAG responses or agent actions.

ConversationalTestCase: A sequence of LLMTestCase turns representing a back-and-forth interaction with an LLM system. This is especially useful for chatbot or assistant testing.

Metrics Overview

DeepEval offers a suite of metrics to evaluate various aspects of LLM outputs, including:

  • Answer Relevancy: Measures how relevant the LLM’s output is to the given input query.
  • Faithfulness: Assesses whether the LLM’s response is grounded in the provided context, ensuring factual accuracy.
  • Contextual Precision: Determines whether the most relevant pieces of context are ranked higher than less relevant ones.
  • G-Eval: A versatile metric that uses LLM-as-a-judge with chain-of-thought reasoning to evaluate outputs based on custom criteria.
  • Hallucination: Detects instances where the LLM generates information not present in the source context.
  • Toxicity: Assesses the presence of harmful or offensive content in the LLM’s output.
  • Bias: Evaluates the output for any unintended biases.
  • Summarization: Measures the quality and accuracy of generated summaries.

For a comprehensive list and detailed explanations of all available metrics, please refer to the DeepEval metrics reference.

Using Qdrant with DeepEval

Install the client libraries.

$ pip install deepeval qdrant-client

$ deepeval login

You can use Qdrant to power your RAG system by retrieving relevant documents for a query, feeding them into your prompt, and evaluating the generated output using DeepEval.

from deepeval.test_case import LLMTestCase, ConversationalTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ...

# 1. Query context from Qdrant
context = qdrant_client.query_points(...)

# 2. Construct prompt using query + retrieved context
prompt = build_prompt(query, context)

# 3. Generate response from your LLM
response = llm.generate(prompt)

# 4. Create a test case for evaluation
test_case = LLMTestCase(
    input=query,
    actual_output=response,
    expected_output=ground_truth_answer,
    retrieval_context=context
)

# 5. Evaluate the output using DeepEval
evaluate(
    test_cases=[test_case],
    metrics=[
        AnswerRelevancyMetric(),
        FaithfulnessMetric(),
        ContextualPrecisionMetric(),
        ...
    ],
)

All evaluations performed using DeepEval can be viewed on the Confident AI Dashboard.

You can scale this process with a dataset (e.g. from Hugging Face) and evaluate multiple test cases at once by looping through question-answer pairs, querying Qdrant for context, and scoring with DeepEval metrics.

Further Reading

Was this page useful?

Thank you for your feedback! 🙏

We are sorry to hear that. 😔 You can edit this page on GitHub, or create a GitHub issue.

We use cookies to learn more about you. At any time you can delete or block cookies through your browser settings.

Learn moreI accept