Experiments

An experiment on Confident AI is a contained way to benchmark LLM applications. You can create an experiment on Confident AI and define evaluation metrics for it to evaluate and test your LLM application's performance at scale. Running an experiment produces a test run, which is the evaluation results of the tests cases that your LLM application was evaluated on.

did you know?

You can evaluate test cases produced by your LLM application directly on Confident AI by simply sending over test cases via deepeval with fields such as actual_output and retrieval_context populated by your LLM application. All compute and LLMs required for evaluation are provided by Confident AI.

Creating An Experiment

You can easily create an experiment on Confident AI's "Evaluation & Testing" page by providing your experiment with a unique name and a set of metrics to start with. In this RAG use case example, we have named our experiment "RAG Experiment" and have chosen the 'Answer Relevancy' and 'Contextual Relevancy' metric as a starting point.

You can then edit the metric configurations (such as threshold), add additional metrics, or even change the experiment name on another individual experiment page once you have created an experiment.

Running An Experiment

To run evaluations on your newly created experiment on Confident AI, simply:

Create LLMTestCases/ConversationalTestCases (in code) with required fields such as actual_output generated by the LLM application you're trying to evaluate.
Send created test cases to Confident AI via deepeval using the confident_evaluate function, supplying the experiment_name in the process.

note

You must be logged in to Confident AI through deepeval for this to work.

from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase

confident_evaluate(
    experiment_name="My First Experiment",
    test_cases=[LLMTestCase(...)]
)

There are two mandatory and one optional arguments when calling the confident_evaluate() function:

experiment_name: a string that specifies the name of the experiment you wish to evaluate your test cases against on Confident AI.
test_cases: a list of LLMTestCases/ConversationalTestCases OR an EvaluationDataset. Confident AI will evaluate your LLM application using the metrics you defined for this particular experiment on these test cases.
disable_browser_opening: a boolean which when set to True, will disable the auto-opening of the browser, which brings you to the experiment page of experiment_name.

Once an experiment has completed running on Confident AI's infrastructure, a test run will be produced. A test run, as will be explained in the next section, is basically the evaluation result of your LLM application based on the results of the defined experimental evaluation metrics, and is also available to view on Confident AI.

Setting Up No-Code Experiment Runs

tip

This is particularly helpful if you wish to enable a no-code evaluation workflow for non-technical users.

You can also setup a webhook endpoint that accepts a POST request over https to enable users to run evaluations without having to code, and start an evaluation through a click of a button instead. At a high level, you would still call the confident_evaluate() function as shown above, but with modifications to verify the webhook event is in fact from Confident AI.

Create a Webhook

To setup a webhook on Confident AI to initiate an experiment run, follow these inital steps:

Host an endpoint that accepts a POST request over https. You should test whether your endpoint is indeed reachable through the internet via https before continuing.
Create a webhook on Confident AI at Evaluation & Testing > Experiment > Webhook by providing your endpoint url. This will generate a signing secret which you'll need later.

caution

Once you have your signing secret, which starts with the prefix confident_, save the signing secret securely in your environment. The signing secret will be used to verify that any request calling your endpoint is indeed from Confident AI.

Implement Your Endpoint

Now that you have your endpoint up and running, write some code to make your endpoint functional by implementing these steps:

Authenticate the webhook request by reading the "Confident-AI-Signature" from the request headers. The signature can then be verified using the verify_confident_signature() function provided by deepeval.
Send a 200 OK status immediately after signature verification. You must send a 200 status code immediately because Confident AI will assume your endpoint isn't setup correctly and won't be waiting for test cases to be sent over otherwise.
Trigger an experiment run by generating and sending over test cases using the confident_evaluate() function.

Here's an example implementation of an endpoint setup using FastAPI:

example.py
import os
from fastapi import FastAPI, Request, BackgroundTasks, Header, HTTPException

from deepeval import confident_evaluate
from deepeval.confident import construct_confident_event

app = FastAPI()

confident_secret = os.environ["CONFIDENT_AI_SECRET"]

@app.post("/evaluation-webhook")
async def evaluation_webhook(
    request: Request,
    background_tasks: BackgroundTasks,
    confident_ai_signature: str = Header(None)  # Read "Confident-AI-Signature" from headers
):
    payload = await request.body()

    # Step 1: Authenticate the webhook request
    confident_event = construct_confident_event(payload, confident_ai_signature, confident_secret)

    # Step 2: Send 200 OK immediately after signature verification
    response = {"status": "Webhook received"}

    # Step 3: Trigger an experiment run in the background
    def evaluate():
        # Generate test cases as usual
        test_cases = [...]
        confident_evaluate(
            # use your event here, you do not
            # have to provide an experiment name
            confident_event=confident_event,
            test_cases=test_cases
        )
    background_tasks.add_task(evaluate)

    return response

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

info

The example above demonstrates how to handle a webhook from Confident AI. The main idea is to authenticate the webhook request, send an immediate acknowledgment, and then process the webhook's data asynchronously. Although the example uses FastAPI, the approach can be adapted to any Python framework.

Here's a breakdown of what's happening:

Webhook Request: When a POST request is sent to the /evaluation-webhook endpoint, the request contains the webhook payload and a signature in the Confident-AI-Signature header.
Authenticate the Request: The construct_confident_event() function verifies that the signature is valid by comparing it with a computed signature based on the payload, and constructs an object for later use. If the signature doesn't match, it returns a 400 Bad Request error. This step ensures that the request is authentic.
Immediate Response: After verifying the signature, the code sends a 200 OK response immediately.
Asynchronous Processing: After the response is sent, an experiment is triggered in the background. In this FastAPI example, the BackgroundTasks class is used to offload the task without blocking the response. In a different framework or environment, you could use Python's threading or concurrent.futures modules to run the evaluation asynchronously instead.

Start an Evaluation

You can now head back to the Experiment Details page and press the "Evaluate" button to trigger an experiment run. You should aim to log any errors and reach out to support@confident-ai.com for any errors during setup.

Creating An Experiment​

Running An Experiment​

Setting Up No-Code Experiment Runs​

Create a Webhook​

Implement Your Endpoint​

Start an Evaluation​