Skip to content

Harness Introduction

Harness is testing infrastructure for AI outputs.

It helps with:

  • repeatability
  • quality evaluation
  • lower testing cost with mocks

Harness Engineering is the quality system around AI workflows. It combines scenarios, fixtures, mock servers, evaluators, traces, regression suites, and failure triage so changes to prompts, tools, models, or code can be verified before release.

Core pieces

PiecePurpose
ScenarioDefines input, expectation, and acceptance criteria
FixtureKeeps stable inputs for regression runs
Mock serverReplays downstream success, failure, timeout, and empty data
EvaluatorChecks one quality standard and explains failures
TraceRecords context, tool calls, and model output
Failure triageSeparates prompt failures, tool failures, regressions, and flaky behavior

Harness architecture checklist

  • Each scenario has input, expectation, and failure explanation.
  • Mock servers cover success, failure, timeout, and empty data.
  • Evaluators check one clear standard at a time.
  • Fixtures cover normal, boundary, and adversarial inputs.
  • Traces make context, tool calls, and model output inspectable.
  • Regression suites run before publishing.

Next: Writing Test Scenarios or Evaluation and Quality.

基于 MIT 许可发布