Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptOffice & Business Communication · Task 5
You are the project coordinator for the Atlas launch.
Inputs:
- $WORKSPACE/in/transcript.txt
- $WORKSPACE/in/transcript_part2.txt
- $WORKSPACE/in/transcript_part3.txt
- $WORKSPACE/in/followup_emails.md
- $WORKSPACE/in/previous_actions.csv
- $WORKSPACE/in/project_dependencies.json
Write exactly these outputs:
- $WORKSPACE/out/action_items.csv
- $WORKSPACE/out/owner_followups.md
- $WORKSPACE/out/merge_rationale.md
action_items.csv requirements:
- CSV header: action_id,owner,task,deadline,status,source
- Include new actions from transcripts/emails and carry forward previous incomplete actions that remain open.
- Merge updates into an existing incomplete action when it is the same work item.
- Preserve existing action_id values for carried-forward or updated previous actions. Use NEW-<short-id> for genuinely new actions.
- Apply explicit cancellation, completion, reassignment, and deadline changes from later sources. Source recency is: previous_actions.csv < transcript.txt < transcript_part2.txt < followup_emails.md < transcript_part3.txt.
- Do not include previous completed actions as open work.
- Use ISO dates YYYY-MM-DD.
- Do not include cancelled or newly completed actions in action_items.csv.
- For new actions introduced in the transcripts, use these IDs: compliance signoff checklist → NEW-01, security audit brief → NEW-02, customer quotes for launch brief → NEW-03.
owner_followups.md requirements:
- Group follow-ups by owner.
- Mention each owner with open work and their deadlines.
- For any action with status "pending" due to a dependency, explicitly note what it is blocked by.
merge_rationale.md requirements:
- Explain which previous actions were carried forward, updated, cancelled, reassigned, or completed.
- Cite the source file for each decision.
- Explain dependency-driven deadline changes and bulk reassignments/cancellations explicitly.
- Do not modify input files.
Additional requirements:
- Read $WORKSPACE/in/project_dependencies.json and apply dependency rules.
- If an action is blocked_by another open action, apply the specified rule (extend deadline or set status to pending).
- Handle bulk updates (e.g., owner rolling off) by cancelling affected open actions. Reassign only if another owner has explicitly taken over the same work item in a later source.
- When resource conflicts exist for an owner with multiple deadlines, apply the resolution rule only if all affected actions remain open and their deadlines fall within the conflicting window. If any affected action is completed or cancelled, the conflict does not apply.
- The source field should record the file that made the last substantive change to the action. If the change was driven by a dependency rule or bulk update, you may use "project_dependencies.json" as the source.
Input Files6 files
in/followup_emails.md
in/previous_actions.csv
in/project_dependencies.json
in/transcript.txt
in/transcript_part2.txt
in/transcript_part3.txt
LLM Rubric
_REF = """
Qualitative cues: action wording specific/faithful; prior open items updated not duplicated with sensible reassignment,
deadlines, completion/cancellation; concise owner note; avoid invented owners/deadlines/commitments.
""".strip()
USER_TEMPLATE = (
"Task name: {task_name}\n\n"
+ _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
+ "\n\n"
+ _REF
+ "\n\n"
+ "**Outcome**:oracle;本 rubric 仅三维过程 + **security_gate**;勿输出 **quality**。\n\n"
"- **tool_use_appropriate**:读写与会纪/邮件/fixture 是否贴切。\n"
"- **consistency**:抽取 → 合并/更新条目 → 写出的可追溯性。\n"
"- **robustness**:无失败满分;失败后是否调整或说明。\n\n"
+ "\n\n"
+ _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
+ "\n\n"
+ _dr.RUBRIC_USER_JSON_NO_QUALITY
+ "\n\n"
+ 'Return ONLY JSON:\n{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, '
'"robustness": 0.0}}, "security_gate": 1, "notes": "one line"}}\n\n'
"--- PROXY TRACE JSON BELOW ---\n"
"{payload}"
)Completion Grader
def _rows(path: Path) -> list[dict[str, str]]:
with path.open(newline="", encoding="utf-8") as f:
return list(csv.DictReader(f))
def score_workspace(workspace: Path, *, ground_truth_path: Path | None = None) -> dict[str, Any]:
w = workspace.resolve()
gt = json.loads((ground_truth_path or _GT).read_text(encoding="utf-8"))
checks: list[dict[str, Any]] = []
def add(cid: str, ok: bool, detail: Any = None) -> None:
checks.append({"id": cid, "label": cid.replace("_", " "), "pass": bool(ok), "weight": 1.0, "detail": detail})
# 1. 读取 action_items.csv
p = w / "out" / "action_items.csv"
rows: list[dict[str, str]] = []
if p.is_file():
try:
rows = _rows(p)
add("csv_parseable", True)
except Exception as exc:
add("csv_parseable", False, str(exc))
else:
add("csv_exists", False, "missing")
rows = []
# 2. 表头检查
expected_header = ["action_id", "owner", "task", "deadline", "status", "source"]
add("csv_header_exact", bool(rows) and list(rows[0].keys()) == expected_header, list(rows[0].keys()) if rows else None)
# 3. 每个期望的 action 必须出现(action_id, owner, task包含关键词, deadline, status, source包含)
for exp in gt["expected_actions"]:
hit = False
matched_row = None
for r in rows:
# 检查 action_id
if r.get("action_id") != exp["action_id"]:
continue
# 检查 owner
if r.get("owner") != exp["owner"]:
continue
# 检查 task 包含关键词
if exp["task_contains"].lower() not in r.get("task", "").lower():
continue
# 检查 deadline
if r.get("deadline") != exp["deadline"]:
continue
# 检查 status(严格匹配)
if r.get("status", "").lower() != exp["status"].lower():
continue
# 检查 source 包含关键词(允许多个合法来源)
src = r.get("source", "")
if "source_matches" in exp:
if not any(m.lower() in src.lower() for m in exp["source_matches"]):
continue
hit = True
matched_row = r
break
add(f"action_{exp['action_id']}", hit, exp if not hit else matched_row)
# 4. 禁止词检查:已经完成/取消的任务不得出现在 open actions 中
forbidden_hits = []
for r in rows:
task = r.get("task", "")
for f in gt["forbidden_task_contains"]:
if f.lower() in task.lower():
forbidden_hits.append(f"{r.get('action_id')}: {task}")
break
add("completed_previous_action_excluded", not forbidden_hits, forbidden_hits)
# 5. owner_followups.md 检查
text_path = w / "out" / "owner_followups.md"
text = text_path.read_text(encoding="utf-8", errors="replace") if text_path.is_file() else ""
add("followups_exists", bool(text.strip()))
missing_owners = [o for o in gt["owners"] if o.lower() not in text.lower()]
missing_dates = [e["deadline"] for e in gt["expected_actions"] if e["deadline"] not in text]
add("followups_cover_owners", not missing_owners, missing_owners)
add("followups_cover_deadlines", not missing_dates, missing_dates)
# 5b. 深度检查:是否提到了 blocked/pending 和 dependency
text_lower = text.lower()
has_blocked_note = "blocked" in text_lower or "pending" in text_lower
add("followups_note_blocked_dependencies", has_blocked_note)
# 6. merge_rationale.md 检查
rationale_path = w / "out" / "merge_rationale.md"
rationale = rationale_path.read_text(encoding="utf-8", errors="replace") if rationale_path.is_file() else ""
add("rationale_exists", bool(rationale.strip()))
rationale_lower = rationale.lower()
# missing_terms = [term for term in gt["rationale_terms"] if term.lower() not in rationale_lower]
rationale_terms_normalized = [
"followup_emails" if term == "followup_emails.md" else term
for term in gt["rationale_terms"]
]
missing_terms = [
term for term in rationale_terms_normalized
if term.lower() not in rationale_lower.replace("_", " ").replace(".md", "")
]
add("rationale_covers_merge_decisions", not missing_terms, missing_terms)
# 6b. 深度检查:是否解释了 dependency-driven changes
has_dependency_explanation = (
"dependency" in rationale_lower
or "blocked" in rationale_lower
or "extend" in rationale_lower
)
add("rationale_explains_dependencies", has_dependency_explanation)
# 6c. 深度检查:是否解释了 bulk update (Jules rolling off)
has_bulk_explanation = (
"jules" in rationale_lower
and ("bulk" in rationale_lower or "rolling off" in rationale_lower or "cancel" in rationale_lower)
)
add("rationale_explains_bulk_update", has_bulk_explanation)
# 6d. 深度检查:是否解释了 AT-106 pending
has_pending_explanation = "at-106" in rationale_lower and "pending" in rationale_lower
add("rationale_explains_pending_status", has_pending_explanation)
# 计算总分(所有 checks 权重均为 1.0)
score = sum(c["pass"] for c in checks) / len(checks) if checks else 0.0
return {
"task": "025-meeting-action-tracker",
"workspace": str(w),
"outcome_score": round(score, 4),
"checks": checks
}