Evaluation Engineer

The Evaluation Engineer is the role responsible for designing the constraints, test environments, and evaluation rubrics that validate agent-generated output before it reaches human reviewers. This represents the most significant role transformation in agentic teams: QA shifts from finding bugs after implementation to defining the rules that prevent bugs during implementation.

The traditional QA Engineer writes test cases after code is written and reports defects for developers to fix. The Evaluation Engineer inverts this sequence — building the Eval Harness before the agent starts work, so that agent output is continuously validated against predefined criteria throughout execution.

Core duties include:

Building Dockerized test environments — creating isolated, reproducible execution environments where agent-generated code can be tested without risk to production systems. These environments must spin up quickly and tear down cleanly to support the high throughput of agentic pipelines.
Writing test cases before agent execution — defining acceptance tests, integration tests, and constraint checks that the Eval Harness runs automatically as agents produce output. These tests are the primary quality gate in the pipeline.
Developing LLM-as-a-Judge rubrics — authoring structured evaluation criteria that a secondary LLM uses to assess agent output on dimensions that automated tests cannot capture, such as code readability, naming consistency, and adherence to Golden Samples. See LLM-as-a-Judge for details on this evaluation approach.
Maintaining Golden Samples — collaborating with the Principal Systems Architect to keep Golden Samples current as codebase patterns evolve.

Key skills include Python and TypeScript proficiency (the primary languages for test tooling), containerization (Docker and microVM orchestration), and statistical analysis (interpreting evaluation metrics such as Architectural Violation Rate and Pattern Consistency Score to identify systemic quality trends).

The Evaluation Engineer's work determines the reliability floor of the entire agentic pipeline. When evaluation is thorough, human reviewers spend their time on judgment calls rather than catching mechanical errors.

Evaluation Engineer

Definition