Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptOffice & Business Communication · Task 1
Read $WORKSPACE/in/meeting_transcript.txt (full transcript of a product meeting) and write an English meeting summary to $WORKSPACE/out/meeting_summary.txt (UTF-8 text; create mkdir -p out first if needed).
Content and format:
- Use one to three paragraphs in English to cover: main topics, aligned conclusions or decisions, major risks or open issues, and clear follow-ups (no need to recap sentence-by-sentence, but a reader should follow the discussion without reading the full transcript). Only include what the transcript supports—do not speculate or invent.
- Length: aim for a typical short workplace summary—clearly more than a few slogan-like bullets, and clearly shorter than copying the transcript; most readers should finish in about one to two minutes. Avoid ultra-short bullet dumps or very long essays.
- Information density: naturally include several recurring or anchoring terms from the transcript (org/product-line names, phase or release wording, resource or constraint language, named technical or process objects, etc.) while keeping the summary coherent; avoid stuffing unrelated terms just to look “complete”.
Output path: out/meeting_summary.txt only.
Input Files1 file
in/meeting_transcript.txt
LLM Rubric
USER_TEMPLATE = (
"Task name: {task_name}\n\n"
+ _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
+ "\n\n"
+ _dr.RUBRIC_USER_OUTCOME_NOTICE
+ "\n\n"
+ """This task: read **in/meeting_transcript.txt**, write **out/meeting_summary.txt** (English meeting summary) with length in the allowed character range and required phrases (Q2, budget, milestone, OpenClaw, M1, risk, gateway, action).
Evaluate the agent run (criteria each 0.0-1.0):
- tool_use_appropriate: read/write tools for transcript and summary; irrelevant tools score lower.
- consistency: read source → draft summary → write out/meeting_summary.txt.
- robustness: missing file or encoding issues handled or reported.
"""
+ "\n\n"
+ _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
+ "\n\n"
+ _dr.RUBRIC_USER_JSON_NO_QUALITY
+ "\n\n"
+ """Return ONLY JSON:
{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, "security_gate": 1, "notes": "one line"}}
Optional **total**: harness recomputes mean of three process scores.
--- PROXY TRACE JSON BELOW ---
{payload}
"""
)Completion Grader
def score_workspace(
workspace: Path,
*,
ground_truth_path: Path | None = None,
) -> dict[str, Any]:
w = workspace.resolve()
gt_path = ground_truth_path or _DEFAULT_GT
if not gt_path.is_file():
return {
"task": "004-meeting-summary",
"workspace": str(w),
"checks": [],
"outcome_score": 0.0,
"error": f"missing ground_truth: {gt_path}",
}
gt = json.loads(gt_path.read_text(encoding="utf-8"))
rel = str(gt.get("summary_path") or "out/meeting_summary.txt")
min_c = int(gt.get("summary_min_chars", 180))
max_c = int(gt.get("summary_max_chars", 480))
phrases: list[str] = list(gt.get("required_phrases") or [])
sp = w / rel
text = ""
if sp.is_file():
try:
text = sp.read_text(encoding="utf-8", errors="replace").strip()
except OSError:
text = ""
n = len(text)
n_checks = 1 + len(phrases)
weight = round(1.0 / n_checks, 6) if n_checks else 0.0
checks: list[dict[str, Any]] = []
ok_len = min_c <= n <= max_c
checks.append(
{
"id": "summary_length",
"label": f"meeting_summary.txt char count in [{min_c}, {max_c}]",
"pass": ok_len,
"weight": weight,
"detail": None if ok_len else f"got {n} chars (file missing or empty counts as 0)",
}
)
for ph in phrases:
contained = ph in text
checks.append(
{
"id": f"phrase_{ph}",
"label": f"summary contains {ph!r}",
"pass": contained,
"weight": weight,
"detail": None if contained else "substring not found",
}
)
outcome = round(sum(c["weight"] for c in checks if c["pass"]), 4)
return {
"task": "004-meeting-summary",
"workspace": str(w),
"checks": checks,
"outcome_score": outcome,
}