Improving Datasets

In the previous sections, we explored how to curate a dataset and later use it for evaluation. However, the quality of your LLM evaluation and benchmarking results can only be as valid as the quality of your datasets (and metrics). To ensure meaningful insights, it's essential to have measures and a standard in place to improve your datasets over time.

Confident AI offers a range of tools to help you to improve your dataset over time, including the incorporation of real-world production data. These methods include:

Continuous editing and annotation – domain experts can manually refine, add comments, and make any necessary adjustments to existing goldens.
Generate new goldens from existing ones – Expand your dataset by synthesizing new goldens based on existing examples, helping to cover more edge cases and variations.
Identify the most challenging task your LLM app faces in production – automatically identify and incorporate the most difficult real-world cases where your app struggles as goldens.

tip

As your LLM application evolves—whether through real-world usage or shifting user demands—your dataset should adapt accordingly to maintain relevant and comprehensive test coverage. The goal isn't just to pass tests or generate benchmarks for the sake of it, but to create a dataset that genuinely enhances your development cycle. A well-maintained dataset ensures that your evaluation process remains robust, helping you identify gaps, handle edge cases, and improve your model's performance in the areas that matter most.

Continous Annotation

Domain experts (which can also be yourself) can edit and optionally keep track of edit/revision history on Confident AI's Dataset Editor page. The editor allow gives annotators an option to leave comments on individual goldens.

note

Leaving comments is especially valuable for domain-specific use cases, such as medical and legal applications, as they allow experts to provide engineers with additional context. This helps engineers understand the purpose of a golden, its importance as a test case, and what improvements are needed to ensure their application succeeds.

tip

You can also set a golden as "not ready". A golden that is not ready for evaluation will not be pulled from the cloud and so will not affect your benchmarking results.

Generating New Goldens

Sorry to disappoint but this feature is currently in beta. If wish to have your dataset improved using this method, please send an email to support@confident-ai.com with your project Id and dataset alias. The new goldens will be sent to you for review in 48 hours.

Using Production Data

The best goldens are typically the most realistic ones from production data you monitor. When Confident AI monitors your LLM application in production, it automatically flags underperforming LLM outputs using metrics from deepeval.

info

You'll have to setup LLM monitoring and enable online metrics (for LLM outputs in production) for this to work. The whole process takes <15 minutes.

This makes it easy to identify where your LLM app struggles and decide whether to add those instances to your dataset.

tip

The video above shows that by using Confident AI to monitor your LLM application, you're able to filter for all unsatisfactory LLM outputs based on failing contextual relevancy metric scores, and add it an existing dataset in a few clicks.

Continous Annotation​

Generating New Goldens​

Using Production Data​

Continous Annotation

Generating New Goldens

Using Production Data