Glossary
EvaluationEmerging
Evaluation Harness
The automated test suite that validates every agent output before it reaches a human reviewer.
Definition
The Evaluation Harness (Eval Harness) is the automated test suite that runs continuously during agent execution, validating every output before it reaches a human reviewer. It combines functional tests, security scans, architectural conformance checks, and LLM-as-a-Judge evaluations into a unified quality gate. No agent-generated code is presented to a human until it passes the Eval Harness.
The Eval Harness performs two types of validation:
- Deterministic Validation — binary pass/fail checks based on strict rules, including the existing test suite, linter and formatter checks, security scanners, and architectural conformance rules.
- Probabilistic Evaluation — LLM-as-a-Judge assessments for non-deterministic quality aspects such as code readability, naming consistency, and adherence to project conventions.
Key operational characteristics:
- Circuit Breakers — the harness enforces token budgets and halts execution when an agent exceeds its compute allocation for a single task.
- Execution Traces — every evaluation run produces detailed logs for debugging and observability.
- Escalation Triggers — when validation fails repeatedly, the harness raises a Blocker Flag that routes the task to a human operator.
The Eval Harness is the primary automated quality gate in agentic workflows, sitting between agent execution and human review.