Diagnostic Benchmark 2026

Harness Bench

A diagnostic benchmark for measuring model-harness configuration effects across 106 sandboxed offline agent tasks in realistic workflows.

106
Total Tasks
8
Categories
5,194
Execution Trajectories
Scroll
Task Suite

Task Suite Overview

Harness Bench spans eight workflow categories and evaluates agents through executable, oracle-checkable tasks rather than isolated final answers.

View All Tasks
Workspace / Tools
15tasks
Workspace, Tool Use & Multimodal Operations
File operations, shell commands, browser interactions, local webpages, artifact handling, and multimodal tasks in realistic workspaces.
Workspace
Environment
Medium
Typical Difficulty
Tools
Primary Pressure
15 tasks
Task Count
Browse by Category

Explore Every Domain

Open each domain to inspect prompts, fixtures, hooks, oracle graders, LLM rubrics, and model run matrices.

How It Works

Evaluation Methodology

Harness Bench combines reproducible task environments with artifact-based grading and rubric summaries for diagnostic evaluation.

01
Step 01
Task Construction & Sandboxing
Each task is handcrafted by domain experts and deployed in a fully isolated, reproducible sandbox environment to eliminate external variables.
Deterministic task states
Isolated execution environments
Automated ground-truth generation
Cross-platform reproducibility
02
Step 02
Harnessed Agent Execution
Agents run under shared budgets and timeouts while traces, token usage, tool calls, final artifacts, and execution metadata are captured.
Shared budgets and timeouts
Trace and usage capture
Tool and browser interactions
Final artifact collection
03
Step 03
Oracle and LLM Scoring
Completion is evaluated with deterministic oracle graders where possible and summarized with LLM rubrics for qualitative diagnostics.
Executable oracle graders
Rubric-based summaries
Hooks before and after rounds
Human-readable task pages
Citation

Cite This Work

If you use Harness Bench in your research, please cite our paper using the BibTeX entry below.

@misc{harnessbench2026,
  title     = {Harness Bench: Measuring Harness Effects in Realistic Agent Workflows},
  author    = {Harness Bench Team},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.27922},
  note      = {106 sandboxed offline agent tasks across 8 categories}
}