Run Real Shell Commands In The Workspace

Use terminal commands to create three precise single-line artifacts under the workspace output directory.

Workspace, Tool Use & Multimodal OperationsTask 2Oracle + LLM scoring

Task ID 002-exec

Difficulty Easy

Tags

Model Runs 6 harnesses & 8 models evaluated on this task.

Prompt Workspace, Tool Use & Multimodal Operations · Task 2

Under $WORKSPACE, use the terminal to complete the steps below. The harness has already created an empty out/ directory; if you see AGENTS.md, .git, etc. at the repo root, those come from the runtime. Ignore them and only write outputs under out/.

Each step's artifact must be single-line text with no extra leading or trailing whitespace:

Arithmetic expansion: write 42 to out/step1.txt.
basename: create a file such as out/a/b/c.txt, then write only its filename, c.txt, to out/step2.txt.
Pipeline and tr: write hello to out/step3.txt.

Do not rely on external resources or the network beyond what the prompt provides.

Input Files 0 files

No static fixture files

LLM Rubric

USER_TEMPLATE = """This task: use shell commands to produce three single-line
artifacts under out/: arithmetic expansion, basename extraction, and a
pipeline/tr lowercase transform. Oracle checks the final file contents.

Evaluate the agent run:
- tool_use_appropriate: prioritizes terminal commands and redirects for
  shell-native operations.
- consistency: completes the three requested steps in a traceable order.
- robustness: creates needed directories and avoids unrelated workspace edits.

Return ONLY JSON with scores, security_gate, and notes.
"""

Completion Grader

def _first_line_trim(p: Path) -> str:
    if not p.is_file():
        return ""
    try:
        line = p.read_text(encoding="utf-8", errors="replace").splitlines()[0]
    except (OSError, IndexError):
        return ""
    return line.strip()


def score_workspace(workspace: Path) -> dict[str, Any]:
    w = workspace.resolve()
    checks: list[dict[str, Any]] = []
    expect = [
        ("out/step1.txt", "42"),
        ("out/step2.txt", "c.txt"),
        ("out/step3.txt", "hello"),
    ]
    for rel, want in expect:
        fp = w / rel
        got = _first_line_trim(fp)
        ok = got == want
        checks.append(
            {
                "id": rel.replace("/", "_"),
                "label": f"{rel} == {want!r}",
                "pass": ok,
                "weight": round(1.0 / len(expect), 4),
                "detail": None if ok else f"got {got!r}",
            }
        )
    outcome = round(sum(1 for c in checks if c["pass"]) / len(expect), 4)
    return {
        "task": "002-exec",
        "workspace": str(w),
        "checks": checks,
        "outcome_score": outcome,
    }

← Task 1 Task 3 →