Set Up Monitoring

In the previous sections, we saw how we can use Confident AI to benchmark and evaluate LLM applications in development. However, once your LLM application is in production, you can no longer benchmark, and what you need is real-time performance insights to determine whether your application is performing well enough when put in the real-world.

note

Monitoring with Confident AI allows you to bring the metrics you have in development to a production environment for real-time insights. These metrics are also referred to as online metrics.

Confident AI allows you to monitor LLM outputs in production through a single API call through deepeval. By monitoring outputs in production, you can easily identify unsatisfactory LLM outputs, collect human feedback in the real-world, and incorperate such data to build the next version of your evaluation dataset.

DID YOU KNOW?

Confident AI's LLM monitoring is a step up from other solutions because apart from tracing, which is great for debugging, we focus on monitoring the observable inputs and outputs of your LLM application. This means our monitoring:

Is great for product teams, where the end results matters more than the traces required for debugging.
Enables fine-grained filtering and re-creation of production environments for A|B testing.
Allow teams to incorporate real-world data to make existing datasets more robust over time for trustworthy evaluation.

Why Should I Setup Monitoring?

You should setup LLM monitoring because it allows you to pinpoint where your LLM application falls short in the real-world. Here are 5 benefits to setting up monitoring:

No more querying databases to see what your application is generating.
Granular filters and searches, to pinpoint how different iterations of your LLM app is performing.
Real-time performance insights using deepeval metrics, to A|B test how different prompts/model affects quality.
Enables the collection of human feedback, both from end-users and internal annotators.
Takes less than 30 seconds to setup, and it doesn't hurt to do it.

Setting up monitoring takes literally 30 seconds (assuming you already have your code editor open). It also DOES NOT introduce any latency to your application since monitoring is done asynchronously (although we also offer a synchronous counterpart).

Monitor Your First Output

To monitor an LLM output, simply call the deepeval.monitor(...) method, which can be ran asynchronously or synchronously, and makes a single API call to Confident AI. This typically happens at the end of your LLM application function when all the generation has completed.

Below are four examples which covers how to monitor in both both synchronous and asynchronous environments, with or without streaming enabled:

Asynchronous monitoring example with streaming

import openai
import asyncio
import deepeval

async def async_with_stream(user_message: str):
    model = "gpt-4-turbo"
    response = await openai.ChatCompletion.acreate(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": user_message}],
        stream=True
    )

    full_output = ""
    async for chunk in response:
        delta = chunk["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            full_output += delta

    # Run monitor() asynchronously in the background
    asyncio.create_task(
        deepeval.a_monitor(input=user_message, output=full_output, model=model, event_name="RAG chatbot")
    )

asyncio.run(async_stream("Tell me a joke."))

Asynchronous monitoring example without streaming

main.py
import openai
import asyncio
import deepeval

async def async_without_stream(user_message: str):
    model = "gpt-4-turbo"
    response = await openai.ChatCompletion.acreate(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": user_message}]
    )
    output = response["choices"][0]["message"]["content"]

    # Run monitor() asynchronously in the background
    asyncio.create_task(
        deepeval.a_monitor(input=user_message, output=full_output, model=model, event_name="RAG chatbot")
    )
    return output

async def main():
    print(await async_without_stream("Tell me a joke."))

asyncio.run(main())

Synchronous monitoring example without streaming

main.py
import openai
import deepeval

def sync_without_stream(user_message: str):
    model = "gpt-4-turbo"
    response = openai.ChatCompletion.create(
    model=model,
    messages=[{"role": "user", "content": user_message}]
    )
    output = response["choices"][0]["message"]["content"]

    # Run monitor() synchronously
    deepeval.monitor(input=user_message, output=output, model=model, event_name="RAG chatbot")
    return output

print(sync_without_stream("Tell me a joke."))

Synchronous monitoring example with streaming

main.py
import openai
import deepeval

def sync_with_stream(user_message: str):
    model = "gpt-4-turbo"
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": user_message}],
        stream=True
    )

    full_output = ""
    for chunk in response:
        delta = chunk["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            full_output += delta

    # Run monitor() synchronously
    deepeval.monitor(input=user_message, output=full_output, model=model, event_name="RAG chatbot")

sync_with_stream("Tell me a joke.")

note

The examples above uses OpenAI but realistically it will be whatever implementation your LLM application is using to generate the observable outputs based on some input it receives.

There are four mandatory and ten optional parameters when using the monitor() function to monitor responses in production:

input: type str, the input to your LLM application.
response: type str, the final output of your LLM application.
event_name: type str specifying the type of response event monitored. The event_name can be thought of as an identifier for different functionalities your LLM application performs.
model: type str specifying the name of the LLM model used.
[Optional] retrieval_context: type list[str] that indicates the context that were retrieved in your RAG pipeline.
[Optional] additional_data: type dict[str, Union[str, dict, Link]]. See below for more details.
[Optional] hyperparameters: type dict[str, Union[str, int, float]]. You can provide a dictionary to specify any additional hyperparamters used to generate the response.
[Optional] distinct_id: type str to identify end users using your LLM application.
[Optional] conversation_id: type str to group together multiple messages under a single conversation thread.
[Optional] completion_time: type float that indicates how many seconds it took your LLM application to complete.
[Optional] token_usage: type float that specifies the number of input + output tokens consumed.
[Optional] token_cost: type float that specifies the total cost of this LLM call.
[Optional] fail_silently: type bool. You should set this to False in development to check if monitor() is working properly. Defaulted to False.
[Optional] raise_exception: type bool. You should set this to False in production if you don't want to raise expections in production. Defaulted to True.

caution

Please do NOT provide placeholder values for optional parameters as this will pollute data used for filtering and searching in your project. Instead you should leave it blank.

The monitor()/a_monitor() method returns an monitor_id upon a successful API request to Confident's servers, which you can later use to collect human feedback with regarding a particular LLM interaction you've monitored.

import deepeval

monitor_id = deepeval.monitor(...)

Congratulations 🎉! With a few lines of code, deepeval now automatically logs all outputs of your LLM application in production to Confident AI.

Logging Hyperparameters

info

In addition to tracking which model was used to generate each output, you can log custom hyperparameters for every monitored interaction.

Using the monitor()/a_monitor() method to log hyperparameters enables you to recreate the exact conditions under which specific outputs were generated in Confident AI.

By applying filters based on these hyperparameters, you can quickly find outputs produced under a particular configuration.

import deepeval

deepeval.monitor(
    ...,
    hyperparameters={
        "prompt template": "...",
        "temperature": 1,
        "chunk size": 500
    }
)

Logging Custom Properties

Similar to hyperparameters, you can easily associate custom additional data for each output, which would also allow you to filter and search for on Confident AI.

note

For example, if your application is a RAG pipeline and you wish to log not just the retrieval_context but also the links of the documents these retrieval_context belong to, here is the place to do it.

import deepeval
from deepeval.monitor import Link

deepeval.monitor(
    ...
    additional_data={
        "Example Text": "...",
        # the Link class allows you to access the clickable links directly on Confident AI
        "Example Link": Link(value="https://www.youtube.com"),
        "Example list of Links": [Link(value="https://www.instagram.com")],
        "Example JSON": {"Any Key": "Any Value"}
    },
)

There are four types of data you can log for additional_data:

A plain string.
A Link object.
A list of Link objects.
A custom dictionary of any structure (as shown in the "Example Json").

Although you can technically log a link as a native string, the Link(value="...") class allows you to access said link directly on Confident AI. You should also aim to have your Links start with 'https://...'.

Logging Conversations

If you're building conversational agents or conversational systems such as LLM chatbots or voice agents, you can group together monitored outputs as sequential messages.

tip

If you're storing converation threads in a database, the conversation_id will most likely be the ID of your conversation data row.

This is only possible if a conversation_id was supplied to deepeval's monitor() or a_monitor() function during monitoring.

import deepeval

deepeval.monitor(conversation_id="your-conversation-id", ...)

But What About Tracing?

Confident AI also offers flexible LLM tracing which you can associate with monitored LLM interaction. To learn how to setup tracing, go to this section.

In the next section, we'll show how to enable online metrics to run evaluations on each incoming LLM output that you monitor using deepeval's metrics.

Why Should I Setup Monitoring?​

Monitor Your First Output​

Logging Hyperparameters​

Logging Custom Properties​

Logging Conversations​

But What About Tracing?​

Why Should I Setup Monitoring?

Monitor Your First Output

Logging Hyperparameters

Logging Custom Properties

Logging Conversations

But What About Tracing?