Evaluation Results
Harness Bench Leaderboard
Aggregated completion, process, and combined scores across 106 sandboxed offline tasks, 8 task domains, 7 harnesses, and 8 model backends. Results use shared prompts, fixtures, budgets, validators, and rubric scoring.
Best Pair
--
model / harness
Best Model
--
mean combined
Best Harness
--
mean combined
Evaluated Runs
--
result JSON rows
Overall Ranking
Switch between model, harness, and model-harness pair views. Rankings sort by combined score and show completion and process means alongside it.
| Loading... |
Domain Breakdown
Mean combined score by task domain, preserving the same order and naming used throughout the site.
Harness Profile
Harness-level combined means across all available model and task runs.
Task-Level Explorer
Browse individual tasks by domain, difficulty, best score, and score spread. Each task links back to its scenario page.
| Loading... |