Skip to main content

LLM Guardrails

Confident AI enables you to safeguard your LLM applications against malicious inputs and outputs with just one line of code. Our comprehensive suite of guardrails acts as binary metrics to evaluate user inputs and/or LLM responses for malicious intent and unsafe behavior.

info

Confident AI offers 10+ guards designed to detect 20+ LLM vulnerabilities.

Types of Guardrails

There are two types of guardrails: input guardrails, which protect against malicious inputs before they reach your LLM, and output guardrails, which evaluate the LLM's responses before they reach your users.

Datasets 1

The number of guards you choose to set up and whether you decide to utilize both types of guards depends on your priorities regarding latency, cost, and LLM safety.

tip

While most guards are only either for input or output guarding, some guards such as the cybersecurity guard offer both input and output guarding capabilities.

List of Guards

Confident AI offers a robust selection of input and output guards for comprehensive protection:

Input Guards

Input GuardDescription
CybersecurityGuardDetects cybersecurity attacks such as SQL injection, shell command injection, or role-based access control (RBAC) violations.
PromptInjectionGuardDetects prompt injection attacks where secondary commands are embedded within user inputs to manipulate the system's behavior.
JailbreakingGuardDetects jailbreaking attacks disguised as safe inputs, which attempt to bypass system restrictions or ethical guidelines.
PrivacyGuardDetects inputs that leak private information such as personally identifiable information (PII), API keys, or sensitive credentials.
TopicalGuardDetects inputs that are irrelevant to the context or violate the defined topics or usage boundaries.

Output Guards

Output GuardDescription
GraphicContentGuardDetects outputs containing explicit, violent, or graphic content to ensure safe and appropriate responses.
HallucinationGuardIdentifies outputs that include factually inaccurate or fabricated information generated by the model.
IllegalGuardDetects outputs that promote, support, or facilitate illegal activities or behavior.
ModernizationGuardDetects outputs containing outdated information that do not reflect modern standards or current knowledge.
SyntaxGuardDetects outputs that are syntactically incorrect or do not adhere to proper grammatical standards, ensuring clear communication.
ToxicityGuardDetects harmful, offensive, or toxic outputs to maintain respectful and safe interactions.
CybersecurityGuardDetects outputs that may have been breached or manipulated as a result of a cyberattack.

Custom Guards

Most guards on Confident AI are custom guards. This is because guardrails have strict accuracy and speed requirements and so would require tailored computations of such guards for most use cases.

note

Getting your custom guard created typically takes less than a week, and you should reach out to your contacy at Confident AI or email support@confident-ai.com to get started.

Guarding your LLM Application

To begin guarding your LLM application, simply import and initialize the guards you desire from deepeval.guardrails and pass them to an instantiated Guardrails object. Then, call the guard_output method with the user input and LLM response.

from deepeval.guardrails import Guardrails
from deepeval.guardrails import HallucinationGuard, TopicalGuard

# Initialize your guards
guardrails = Guardrails(
guards = [
HallucinationGuard(),
TopicalGuard(allowed_topics=["health and technology"])
]
)

Guard Your First Input

It is not uncommon to want to guard against malicious user inputs, as it is the best way to not waste tokens generating text that you shouldn't even be generating in the first place. To guard against inputs, simply call the guard_input() method and supply the input.

...

# Example input to guard against
input = "Is the earth flat"

guard_result = guardrails.guard_input(input=input)
# Reject if true
print(guard_result.breached)

Guard Your First Output

To ensure that an unsatisfactory output never reaches an end user, you'll want to guard against generated outputs and re-generate them until satisfactory.

...

# Example input to guard against
input = "Is the earth flat"
output = "I bet it is"

guard_result = guardrails.guard_output(input=input, output=output)
# Re-generate if true
print(guard_result.breached)
note

When using the guard_output() method, you must provide both the input and the LLM output.

Running Guardrails Async

Your Guardrails also support asynchronous through the a_guard_input() and a_guard_output() methods.

...

guard_result = await guardrails.a_guard_output(input, output)

Interpreting Guard Results

The Guardrails.guard method returns a GuardResult object, which can be used to retry LLM response generations when necessary.

from typing import List
from pydantic import BaseModel

class GuardResult(BaseModel):
breached: bool
guard_data: List[GuardData]
info

The breached property is true if a guard has failed. Detailed scores and breakdowns for each guard are available in the guard_scores list. A score of 1 indicates the guard was breached, while 0 means it passed.