Evaluation Harness

One run tells you nothing.

RunLens is a cohort-based quality harness for autonomous AI agents. Measure consistency, completeness, and correctness across runs, not within them.

$ runlens cohort run --agents 8 --suite onboarding

cohort:  baseline-2026-05-07
agents:  8 parallel executions
suite:   onboarding (12 checkpoints)

├─ consistency:   94.2%  (σ = 0.03)
├─ completeness:  97.8%  (12/12 phases)
├─ correctness:   91.5%  (rubric pass)
└─ regression:    none detected

→ baseline saved. next cohort in 24h.

60% → 25%

Agent performance under multi-run evaluation

A single pass looks great. Run the same task 8 times and your "reliable" agent fails on 3 of them. RunLens catches what one-shot evals miss.

What RunLens measures

Three dimensions that define agent quality at production scale.

Consistency

Same task, same inputs, same result? Multi-run variance is the silent killer of agent reliability. RunLens tracks standard deviation across cohorts.

Completeness

Did the agent execute all required phases? Partial completions that look like successes are worse than failures. Checkpoint-level tracking exposes gaps.

Correctness

Was the output actually right? Rubric-based grading with hierarchical criteria ensures quality, not just task completion.

Regression Detection

Did this deploy make things worse? Automated baseline comparison flags quality degradation before it reaches users.

Stop shipping agents you haven't measured.

If you wouldn't deploy code without tests, don't deploy agents without cohort baselines. RunLens is the test suite for autonomous systems.