Evaluation Results

Harness Bench Leaderboard

Aggregated completion, process, and combined scores across 106 sandboxed offline tasks, 8 task domains, 7 harnesses, and 8 model backends. Results use shared prompts, fixtures, budgets, validators, and rubric scoring.

Best Pair

model / harness

Best Model

mean combined

Best Harness

mean combined

Evaluated Runs

result JSON rows

Overall Ranking

Switch between model, harness, and model-harness pair views. Rankings sort by combined score and show completion and process means alongside it.

Domain Breakdown

Mean combined score by task domain, preserving the same order and naming used throughout the site.

Harness Profile

Harness-level combined means across all available model and task runs.

Task-Level Explorer

Browse individual tasks by domain, difficulty, best score, and score spread. Each task links back to its scenario page.