Evaluation Harness

The Evaluation Harness (Eval Harness) is the automated test suite that runs continuously during agent execution, validating every output before it reaches a human reviewer. It combines functional tests, security scans, architectural conformance checks, and LLM-as-a-Judge evaluations into a unified quality gate. No agent-generated code is presented to a human until it passes the Eval Harness.

The Eval Harness performs two types of validation:

Deterministic Validation — binary pass/fail checks based on strict rules, including the existing test suite, linter and formatter checks, security scanners, and architectural conformance rules.
Probabilistic Evaluation — LLM-as-a-Judge assessments for non-deterministic quality aspects such as code readability, naming consistency, and adherence to project conventions.

Key operational characteristics:

Circuit Breakers — the harness enforces token budgets and halts execution when an agent exceeds its compute allocation for a single task.
Execution Traces — every evaluation run produces detailed logs for debugging and observability.
Escalation Triggers — when validation fails repeatedly, the harness raises a Blocker Flag that routes the task to a human operator.

The Eval Harness is the primary automated quality gate in agentic workflows, sitting between agent execution and human review.

Evaluation Harness

Definition