Diagnostic Benchmark 2026

Harness Bench

A diagnostic benchmark for measuring model-harness configuration effects across 106 sandboxed offline agent tasks in realistic workflows.

Read the Paper View on GitHub Leaderboard Explore Tasks

106

Total Tasks

Task Suite Overview

Harness Bench spans eight workflow categories and evaluates agents through executable, oracle-checkable tasks rather than isolated final answers.

View All Tasks

Workspace / Tools

15tasks

Workspace, Tool Use & Multimodal Operations

File operations, shell commands, browser interactions, local webpages, artifact handling, and multimodal tasks in realistic workspaces.

Workspace

Environment

Medium

Typical Difficulty

Tools

Primary Pressure

15 tasks

Task Count

How It Works

Evaluation Methodology

Harness Bench combines reproducible task environments with artifact-based grading and rubric summaries for diagnostic evaluation.

Step 01

Task Construction & Sandboxing

Each task is handcrafted by domain experts and deployed in a fully isolated, reproducible sandbox environment to eliminate external variables.

Deterministic task states

Isolated execution environments

Automated ground-truth generation

Cross-platform reproducibility

Step 02

Harnessed Agent Execution

Agents run under shared budgets and timeouts while traces, token usage, tool calls, final artifacts, and execution metadata are captured.

Shared budgets and timeouts

Trace and usage capture

Tool and browser interactions

Final artifact collection

Step 03

Oracle and LLM Scoring

Completion is evaluated with deterministic oracle graders where possible and summarized with LLM rubrics for qualitative diagnostics.

Executable oracle graders

Rubric-based summaries

Hooks before and after rounds

Human-readable task pages

Citation

Cite This Work

If you use Harness Bench in your research, please cite our paper using the BibTeX entry below.

@misc{harnessbench2026,
  title     = {Harness Bench: Measuring Harness Effects in Realistic Agent Workflows},
  author    = {Harness Bench Team},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.27922},
  note      = {106 sandboxed offline agent tasks across 8 categories}
}

Harness Bench

Task Suite Overview

Explore Every Domain

Evaluation Methodology

Cite This Work