Guardrails

Guardrails are safety mechanisms, implemented as rules, classifiers, or secondary models, that monitor and filter the inputs to and outputs from a large language model in real-time. They act as a protective layer between users and the core model, preventing harmful, off-topic, or policy-violating content from being processed or returned.

Key characteristics of guardrails include:

Input Filtering: Guardrails scan incoming user messages for prompt injection attempts, harmful requests, personally identifiable information, or out-of-scope queries before they reach the main model.
Output Validation: After the model generates a response, guardrails check for harmful content, hallucinated claims, policy violations, or sensitive data leaks before the response is delivered to the user.
Programmable Rules: Developers define guardrail policies using natural language rules, regular expressions, or structured configuration. Frameworks like NVIDIA NeMo Guardrails allow specifying conversational boundaries declaratively.
Lightweight Classifiers: Many guardrail systems use small, fast classifier models trained to detect specific categories of unsafe content, running in parallel with the main model to minimize latency impact.
Defense in Depth: Guardrails complement rather than replace model-level safety training. Production systems typically layer multiple guardrail checks alongside RLHF alignment and system prompt constraints for robust protection.

Teams often validate guardrail effectiveness through an Eval Harness, running adversarial test suites to verify that filters catch unsafe inputs and outputs consistently.

Guardrails

Definition