Set Up Monitoring
In the previous sections, we saw how we can use Confident AI to benchmark and evaluate LLM applications in development. However, once your LLM application is in production, you can no longer benchmark, and what you need is real-time performance insights to determine whether your application is performing well enough when put in the real-world.
Monitoring with Confident AI allows you to bring the metrics you have in development to a production environment for real-time insights. These metrics are also referred to as online metrics.
Confident AI allows you to monitor LLM outputs in production through a single API call through deepeval
. By monitoring outputs in production, you can easily identify unsatisfactory LLM outputs, collect human feedback in the real-world, and incorperate such data to build the next version of your evaluation dataset.
Confident AI's LLM monitoring is a step up from other solutions because apart from tracing, which is great for debugging, we focus on monitoring the observable inputs and outputs of your LLM application. This means our monitoring:
- Is great for product teams, where the end results matters more than the traces required for debugging.
- Enables fine-grained filtering and re-creation of production environments for A|B testing.
- Allow teams to incorporate real-world data to make existing datasets more robust over time for trustworthy evaluation.
Why Should I Setup Monitoring?
You should setup LLM monitoring because it allows you to pinpoint where your LLM application falls short in the real-world. Here are 5 benefits to setting up monitoring:
- No more querying databases to see what your application is generating.
- Granular filters and searches, to pinpoint how different iterations of your LLM app is performing.
- Real-time performance insights using
deepeval
metrics, to A|B test how different prompts/model affects quality. - Enables the collection of human feedback, both from end-users and internal annotators.
- Takes less than 30 seconds to setup, and it doesn't hurt to do it.
Setting up monitoring takes literally 30 seconds (assuming you already have your code editor open). It also DOES NOT introduce any latency to your application since monitoring is done asynchronously (although we also offer a synchronous counterpart).
Monitor Your First Output
To monitor an LLM output, simply call the deepeval.monitor(...)
method, which can be ran asynchronously or synchronously, and makes a single API call to Confident AI. This typically happens at the end of your LLM application function when all the generation has completed.
Below are four examples which covers how to monitor in both both synchronous and asynchronous environments, with or without streaming enabled:
Asynchronous monitoring example with streaming
import openai
import asyncio
import deepeval
async def async_with_stream(user_message: str):
model = "gpt-4-turbo"
response = await openai.ChatCompletion.acreate(
model="gpt-4-turbo",
messages=[{"role": "user", "content": user_message}],
stream=True
)
full_output = ""
async for chunk in response:
delta = chunk["choices"][0].get("delta", {}).get("content", "")
if delta:
print(delta, end="", flush=True)
full_output += delta
# Run monitor() asynchronously in the background
asyncio.create_task(
deepeval.a_monitor(input=user_message, output=full_output, model=model, event_name="RAG chatbot")
)
asyncio.run(async_stream("Tell me a joke."))
Asynchronous monitoring example without streaming
import openai
import asyncio
import deepeval
async def async_without_stream(user_message: str):
model = "gpt-4-turbo"
response = await openai.ChatCompletion.acreate(
model="gpt-4-turbo",
messages=[{"role": "user", "content": user_message}]
)
output = response["choices"][0]["message"]["content"]
# Run monitor() asynchronously in the background
asyncio.create_task(
deepeval.a_monitor(input=user_message, output=full_output, model=model, event_name="RAG chatbot")
)
return output
async def main():
print(await async_without_stream("Tell me a joke."))
asyncio.run(main())
Synchronous monitoring example without streaming
import openai
import deepeval
def sync_without_stream(user_message: str):
model = "gpt-4-turbo"
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": user_message}]
)
output = response["choices"][0]["message"]["content"]
# Run monitor() synchronously
deepeval.monitor(input=user_message, output=output, model=model, event_name="RAG chatbot")
return output
print(sync_without_stream("Tell me a joke."))
Synchronous monitoring example with streaming
import openai
import deepeval
def sync_with_stream(user_message: str):
model = "gpt-4-turbo"
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": user_message}],
stream=True
)
full_output = ""
for chunk in response:
delta = chunk["choices"][0].get("delta", {}).get("content", "")
if delta:
print(delta, end="", flush=True)
full_output += delta
# Run monitor() synchronously
deepeval.monitor(input=user_message, output=full_output, model=model, event_name="RAG chatbot")
sync_with_stream("Tell me a joke.")
The examples above uses OpenAI
but realistically it will be whatever implementation your LLM application is using to generate the observable outputs based on some input it receives.
There are four mandatory and ten optional parameters when using the monitor()
function to monitor responses in production:
input
: typestr
, the input to your LLM application.response
: typestr
, the final output of your LLM application.event_name
: typestr
specifying the type of response event monitored. Theevent_name
can be thought of as an identifier for different functionalities your LLM application performs.model
: typestr
specifying the name of the LLM model used.- [Optional]
retrieval_context
: typelist[str]
that indicates the context that were retrieved in your RAG pipeline. - [Optional]
additional_data
: typedict[str, Union[str, dict, Link]]
. See below for more details. - [Optional]
hyperparameters
: typedict[str, Union[str, int, float]]
. You can provide a dictionary to specify any additional hyperparamters used to generate the response. - [Optional]
distinct_id
: typestr
to identify end users using your LLM application. - [Optional]
conversation_id
: typestr
to group together multiple messages under a single conversation thread. - [Optional]
completion_time
: typefloat
that indicates how many seconds it took your LLM application to complete. - [Optional]
token_usage
: typefloat
that specifies the number of input + output tokens consumed. - [Optional]
token_cost
: typefloat
that specifies the total cost of this LLM call. - [Optional]
fail_silently
: typebool
. You should set this toFalse
in development to check ifmonitor()
is working properly. Defaulted toFalse
. - [Optional]
raise_exception
: typebool
. You should set this toFalse
in production if you don't want to raise expections in production. Defaulted toTrue
.
Please do NOT provide placeholder values for optional parameters as this will pollute data used for filtering and searching in your project. Instead you should leave it blank.
The monitor()
/a_monitor()
method returns an monitor_id
upon a successful API request to Confident's servers, which you can later use to collect human feedback with regarding a particular LLM interaction you've monitored.
import deepeval
monitor_id = deepeval.monitor(...)
Congratulations 🎉! With a few lines of code, deepeval
now automatically logs all outputs of your LLM application in production to Confident AI.
Logging Hyperparameters
In addition to tracking which model was used to generate each output, you can log custom hyperparameters for every monitored interaction.
Using the monitor()
/a_monitor()
method to log hyperparameters enables you to recreate the exact conditions under which specific outputs were generated in Confident AI.
By applying filters based on these hyperparameters, you can quickly find outputs produced under a particular configuration.
import deepeval
deepeval.monitor(
...,
hyperparameters={
"prompt template": "...",
"temperature": 1,
"chunk size": 500
}
)
Logging Custom Properties
Similar to hyperparameters, you can easily associate custom additional data for each output, which would also allow you to filter and search for on Confident AI.
For example, if your application is a RAG pipeline and you wish to log not just the retrieval_context
but also the links of the documents these retrieval_context
belong to, here is the place to do it.
import deepeval
from deepeval.monitor import Link
deepeval.monitor(
...
additional_data={
"Example Text": "...",
# the Link class allows you to access the clickable links directly on Confident AI
"Example Link": Link(value="https://www.youtube.com"),
"Example list of Links": [Link(value="https://www.instagram.com")],
"Example JSON": {"Any Key": "Any Value"}
},
)
There are four types of data you can log for additional_data
:
- A plain string.
- A
Link
object. - A list of
Link
objects. - A custom dictionary of any structure (as shown in the "Example Json").
Although you can technically log a link as a native string, the Link(value="...")
class allows you to access said link directly on Confident AI. You should also aim to have your Link
s start with 'https://...'.
Logging Conversations
If you're building conversational agents or conversational systems such as LLM chatbots or voice agents, you can group together monitored outputs as sequential messages.
If you're storing converation threads in a database, the conversation_id
will most likely be the ID of your conversation data row.
This is only possible if a conversation_id
was supplied to deepeval
's monitor()
or a_monitor()
function during monitoring.
import deepeval
deepeval.monitor(conversation_id="your-conversation-id", ...)
But What About Tracing?
Confident AI also offers flexible LLM tracing which you can associate with monitored LLM interaction. To learn how to setup tracing, go to this section.
In the next section, we'll show how to enable online metrics to run evaluations on each incoming LLM output that you monitor using deepeval
's metrics.