Eval Harness Configuration Template
A template for defining evaluation criteria, quality gates, and pass/fail thresholds for agent output validation.
Overview
The Eval Harness is the automated validation pipeline that gates agent output before it reaches human reviewers. This template defines the configuration for an eval harness instance — which checks to run, what thresholds to set, when to trigger Human In The Loop review, and how to handle failures. Each feature or task type gets its own eval harness configuration, referenced from its Live Spec.
The eval harness combines two categories of checks: deterministic gates (linting, testing, security scanning, architectural conformance) that produce binary pass/fail results, and probabilistic gates (Llm As A Judge assessments) that score non-deterministic quality dimensions like readability and naming consistency. Together, these gates catch both objective errors and subjective quality issues before a human spends time reviewing.
This template is used by the Evaluation Engineer when setting up validation for a new feature area, and by the Context Architect when defining the evaluation section of a Live Spec.
When to Use
Use this template when:
- Setting up an eval harness for a new feature area or task type
- Defining quality gates for agent-generated code in a CI/CD pipeline
- A Rescue Mission revealed that the existing eval harness missed an important quality dimension
- The team wants to add Llm As A Judge evaluation alongside deterministic checks
- Onboarding a new agent workflow that needs quality validation
Before configuring an eval harness, ensure:
- The feature has a reviewed Live Spec with testable acceptance criteria
- The project has a CI/CD pipeline where gate checks can run
- Golden Samples exist for the feature area (needed for probabilistic evaluation baselines)
- The team has agreed on pass/fail thresholds (start permissive and tighten over time)