Chroma
Quick Summary
Chroma is one of the most popular open-source AI application databases, and supports many retrieval features such as embeddings storage, vector search, document storage, metadata filtering, and multi-modal retrieval.
DeepEval allows you to easily evaluate and optimize your Chroma retriever by tuning hyperparameters like n_results
(more commonly known as top-K) and the embedding model
used in your Chroma retrieval pipeline.
To get started, install Chroma through the CLI using the following command:
pip install chromadb
To learn more about using Chroma for your RAG pipeline, visit this page. The diagram below illustrates how you can utilize Chroma as the entire retrieval pipeline for your LLM application.
Setup Chroma
To get started with Chroma, initialize a persistent client and create a collection to store your documents. The collection acts as a vector database for storing and retrieving embeddings, while the persistent client ensures data is retained across sessions.
import chromadb
# Initialize Chroma client
client = chromadb.PersistentClient(path="./chroma_db")
# Create or load a collection
collection = client.get_or_create_collection(name="rag_documents")
Next, define an embedding model (we'll use sentence_transformers
) to convert document chunks into vectors before adding them to your Chroma collection, along with the document chunks as metadata.
...
# Load an embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Example document chunks
document_chunks = [
"Chroma is an open-source vector database for efficient embedding retrieval.",
"It enables fast semantic search using vector similarity.",
"Chroma retrieves relevant data with cosine similarity.",
...
]
# Store chunks with embeddings in Chroma
for i, chunk in enumerate(document_chunks):
embedding = model.encode(chunk).tolist() # Convert text to vector
collection.add(
ids=[str(i)], # Unique ID for each document
embeddings=[embedding], # Vector representation
metadatas=[{"text": chunk}] # Store original text as metadata
)
You'll be querying from this Chroma collection during generation to retrieve relevant contexts based on the user input
, before passing them along with your input into your LLM's prompt template.
By default, Chroma utilizes cosine similarity
to find similar chunks.
Evaluating Chroma Retrieval
To evaluate your Chroma retriever, you'll first need to prepare an input
query and generate a response from your RAG pipeline in order to create an LLMTestCase
. You'll also need to extract the contexts retrieved from your Chroma collection during generation and prepare the expected LLM response to complete the LLMTestCase
.
By default, input
and actual_output
are required for all metrics. However, retrieval_context
, context
, and expected_output
are optional, and different metrics may or may not require additional parameters. To check the specific requirements, visit the metrics section.
After you've prepared your LLMTestCase
, evaluating your Chroma retriever is as easy passing the test case along with your selection of metrics into DeepEval's evaluate
function.
Preparing your Test Case
To prepare our test case, we'll be using "How does Chroma work?"
as our input. Before generating a response from your RAG pipeline, you'll first need to retrieve the relevant context using a search
function. Our search
function in the example below first embeds the input query before retrieving the top three most relevant text chunks (n_results=3
) from our chroma collection.
...
def search(query):
query_embedding = model.encode(query).tolist()
res = collection.query(
query_embeddings=[query_embedding],
n_results=3 # Retrieve top-K matches
)
return res["metadatas"][0][0]["text"] if res["metadatas"][0] else None
query = "How does Chroma work?"
retrieval_context = search(query)
Next, we'll pass the retrieved context from our Chroma collection into the LLM's prompt template to generate the final response.
...
prompt = """
Answer the user question based on the supporting context.
User Question:
{input}
Supporting Context:
{retrieval_context}
"""
actual_output = generate(prompt) # Replace with your LLM function
print(actual_output)
print(expected_output)
Printing the actual_output
generated by our RAG pipeline yields the following example:
Chroma is a lightweight vector database designed for AI applications, enabling fast semantic retrieval.
Let's compare this to the expected_output
we've prepared:
Chroma is an open-source vector database that enables fast retrieval using cosine similarity.
With all the elements ready, we'll create an LLMTestCase
by providing the input and expected output, along with the actual output and retrieved context.
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
retrieval_context=retrieval_context,
expected_output=expected_output
)
Running Evaluations
To begin running evaluations, we'll need to define metrics relevant to our Chroma retriever. These include ContextualRecallMetric
, ContextualPrecisionMetric
, and ContextualRelevancyMetric
, which specifically evaluate RAG retrievers.
To learn more about how these metrics are calculated and why they're relevant to retrievers, visit the individual metric pages.
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric,
)
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric(),
contextual_relevancy = ontextualRelevancyMetric()
To run evaluations, simply pass the prepared test case you've prepared into the evaluate
function, along with the retriever metrics you defined.
from deepeval import evaluate
...
evaluate(
[test_case],
metrics=[contextual_recall, contextual_precision, contextual_relevancy]
)
Improving Chroma Retrieval
Hypothetically, we've run multiple inputs and prepared several test cases, consistently observing that the Contextual Relevancy
score is below the required threshold.
Inputs | Contextual Relevancy Score | Contextual Recall Score |
---|---|---|
"How does Chroma work?" | 0.45 | 0.85 |
"What is the retrieval process in Chroma?" | 0.43 | 0.92 |
"Explain Chroma's vector database." | 0.55 | 0.67 |
This suggests that you may need to adjust the length of each document or tweak n_results
to retrieve more relevant contexts from your Chroma collection. This is because Contextual Relevancy evaluates both the retrieved text chunks and the top-K selection.
If you're curious about which metrics evaluate which specific retrieval parameters, check out this guide.
Depending on the failing scores in your retriever, you'll want to experiment with different parameters (e.g., n_results
, embedding model
, etc.) in your Chroma retrieval pipeline until you're satisfied with the results. This can be as simple as writing a for loop to run evaluations many times:
...
def search(query, n_results):
query_embedding = model.encode(query).tolist()
res = collection.query(
query_embeddings=[query_embedding],
n_results=n_results # Retrieve top-K matches
)
return res["metadatas"][0][0]["text"] if res["metadatas"][0] else None
# Define input and expected output
...
# Iterate over different top-K values
for top_k in [3, 5, 7]:
retrieval_context = search(input_query, top_k)
# Define test case
...
# Evaluate the retrieval quality
evaluate(
[test_case],
metrics=[contextual_recall, contextual_precision, contextual_relevancy]
)
If you need a systematic way to analyze your retriever and compare the effects of changing chroma hyperparameters side by side, you'll want to log in to Confident AI.