Eval Harness Configuration Template

Overview

The Eval Harness is the automated validation pipeline that gates agent output before it reaches human reviewers. This template defines the configuration for an eval harness instance — which checks to run, what thresholds to set, when to trigger Human In The Loop review, and how to handle failures. Each feature or task type gets its own eval harness configuration, referenced from its Live Spec.

The eval harness combines two categories of checks: deterministic gates (linting, testing, security scanning, architectural conformance) that produce binary pass/fail results, and probabilistic gates (Llm As A Judge assessments) that score non-deterministic quality dimensions like readability and naming consistency. Together, these gates catch both objective errors and subjective quality issues before a human spends time reviewing.

This template is used by the Evaluation Engineer when setting up validation for a new feature area, and by the Context Architect when defining the evaluation section of a Live Spec.

When to Use

Use this template when:

Setting up an eval harness for a new feature area or task type
Defining quality gates for agent-generated code in a CI/CD pipeline
A Rescue Mission revealed that the existing eval harness missed an important quality dimension
The team wants to add Llm As A Judge evaluation alongside deterministic checks
Onboarding a new agent workflow that needs quality validation

Before configuring an eval harness, ensure:

The feature has a reviewed Live Spec with testable acceptance criteria
The project has a CI/CD pipeline where gate checks can run
Golden Samples exist for the feature area (needed for probabilistic evaluation baselines)
The team has agreed on pass/fail thresholds (start permissive and tighten over time)

Eval Harness Configuration Template

Overview

When to Use

Customize Template