A diagnostic benchmark for measuring model-harness configuration effects across 106 sandboxed offline agent tasks in realistic workflows.
Harness Bench spans eight workflow categories and evaluates agents through executable, oracle-checkable tasks rather than isolated final answers.
Open each domain to inspect prompts, fixtures, hooks, oracle graders, LLM rubrics, and model run matrices.
Harness Bench combines reproducible task environments with artifact-based grading and rubric summaries for diagnostic evaluation.
If you use Harness Bench in your research, please cite our paper using the BibTeX entry below.
@misc{harnessbench2026, title = {Harness Bench: Measuring Harness Effects in Realistic Agent Workflows}, author = {Harness Bench Team}, year = {2026}, url = {https://arxiv.org/abs/2605.27922}, note = {106 sandboxed offline agent tasks across 8 categories} }