Glossary
EvaluationEmerging

Evaluation Harness

The automated test suite that validates every agent output before it reaches a human reviewer.

Definition

The Evaluation Harness (Eval Harness) is the automated test suite that runs continuously during agent execution, validating every output before it reaches a human reviewer. It combines functional tests, security scans, architectural conformance checks, and LLM-as-a-Judge evaluations into a unified quality gate. No agent-generated code is presented to a human until it passes the Eval Harness.

The Eval Harness performs two types of validation:

  1. Deterministic Validation — binary pass/fail checks based on strict rules, including the existing test suite, linter and formatter checks, security scanners, and architectural conformance rules.
  2. Probabilistic Evaluation — LLM-as-a-Judge assessments for non-deterministic quality aspects such as code readability, naming consistency, and adherence to project conventions.

Key operational characteristics:

  • Circuit Breakers — the harness enforces token budgets and halts execution when an agent exceeds its compute allocation for a single task.
  • Execution Traces — every evaluation run produces detailed logs for debugging and observability.
  • Escalation Triggers — when validation fails repeatedly, the harness raises a Blocker Flag that routes the task to a human operator.

The Eval Harness is the primary automated quality gate in agentic workflows, sitting between agent execution and human review.

Last updated: 3/11/2026