Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptSoftware Engineering & Codebase Maintenance ยท Task 15
Use the release inputs in $WORKSPACE/in/release to create three artifacts:
$WORKSPACE/out/CHANGELOG.md
- Markdown release notes grouped into Added, Changed, Fixed, Security, and Breaking Changes when applicable.
- Include issue IDs and human-readable descriptions.
- Do not list reverted fixes as shipped changes.
- Do not list deferred features as shipped changes.
- Documentation-only work may be mentioned as supporting documentation, but it must not count as a shipped product issue.
- If an issue has multiple commits, summarize the issue once.
- For embargoed security work, include a safe high-level note without exploit or implementation details.
$WORKSPACE/out/release_summary.json
- JSON with keys:
version,date,highlights,breaking_changes,issue_count,risk_notes. issue_countshould count shipped non-revertedISSUE-*product issue IDs only; do not countSEC-*advisory IDs.- Do not count deferred
ISSUE-*IDs or documentation-onlyISSUE-*IDs inissue_count.
$WORKSPACE/out/upgrade_notes.md
- Include migration guidance for each breaking change.
- Mention reverted work separately from shipped changes.
- Do not disclose embargoed security details.
$WORKSPACE/out/release_decisions.json
- JSON with keys: shipped_issue_ids, reverted_issue_ids, deferred_issue_ids, docs_only_issue_ids, security_advisory_ids, embargoed_security_advisory_ids, public_security_advisory_ids, breaking_change_ids, duplicate_commit_issue_ids.
- Each value must be an array sorted ascending.
- shipped_issue_ids counts shipped non-reverted
ISSUE-*product issue IDs only. - reverted_issue_ids lists reverted
ISSUE-*product issue IDs that must not appear as shipped changes. - deferred_issue_ids lists
ISSUE-*feature work deferred before release and not shipped. - docs_only_issue_ids lists
ISSUE-*IDs whose shipped work is documentation-only and should not affect issue_count. - security_advisory_ids lists
SEC-*advisory IDs separately from product issues. - embargoed_security_advisory_ids and public_security_advisory_ids split security_advisory_ids by whether details are embargoed.
- breaking_change_ids lists every shipped issue ID that requires migration guidance.
- duplicate_commit_issue_ids lists shipped issue IDs that appear in more than one non-revert, non-defer commit and should still be summarized once.
$WORKSPACE/out/release_audit.json
- JSON with keys: issue_status_by_id, shipped_issue_commit_counts, non_counted_issue_ids, migration_required_issue_ids.
- issue_status_by_id must map every ID from issues.csv to exactly one status: shipped, reverted, deferred, docs_only, embargoed_security, public_security.
- shipped_issue_commit_counts must map each shipped
ISSUE-*product issue ID to the number of non-revert, non-defer, non-docs commits for that issue. - non_counted_issue_ids must list all IDs that must not contribute to release_summary.issue_count: reverted issues, deferred issues, docs-only issues, and SEC-* advisory IDs.
- migration_required_issue_ids must match breaking_change_ids and be sorted ascending.
Do not invent commits or issues. Do not use the current date; use the date in commits.json. Do not modify fixture files.
Input Files3 files
in/release/breaking_changes.md
in/release/commits.json
in/release/issues.csv
LLM Rubric
_REF = """
Commits/tags checked by oracle; qualitative: separates user-facing highlights vs internals, cites SHAs cleanly, tone consistent.
""".strip()
USER_TEMPLATE = (
"Task name: {task_name}\n\n"
+ _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
+ "\n\n"
+ _REF
+ "\n\n"
+ "\n\n"
+ _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
+ "\n\n"
+ _dr.RUBRIC_USER_JSON_NO_QUALITY
+ "\n\n"
"Return ONLY JSON:\n"
'{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, '
'"security_gate": 1, "notes": "one line"}}\n\n'
"--- PROXY TRACE JSON BELOW ---\n"
"{payload}"
)Completion Grader
def _md5(path: Path) -> str:
return hashlib.md5(path.read_bytes()).hexdigest()
def _has_forbidden_changelog_content(text: str) -> bool:
lower_lines = text.lower().splitlines()
for term in _GT["forbidden_changelog_terms"]:
t = term.lower()
for line in lower_lines:
if t not in line:
continue
if t in {"technical exploit", "proof-of-concept", "payload", "vulnerable endpoint"}:
return True
if not any(ctx in line for ctx in _REVERT_CONTEXT_TERMS):
return True
return False
def score_workspace(workspace: Path) -> dict[str, Any]:
w = Path(workspace).resolve()
out = w / "out"
checks: list[dict[str, Any]] = []
def add(cid: str, ok: bool, weight: float, detail: Any = None) -> None:
checks.append({"id": cid, "pass": bool(ok), "weight": weight, "detail": detail})
forbidden_leak = False
issue_count_exact_ok = False
changelog_score = 0.0
ch_path = out / "CHANGELOG.md"
if ch_path.is_file():
text = ch_path.read_text(encoding="utf-8", errors="replace")
hits = sum(term.lower() in text.lower() for term in _GT["required_changelog_terms"])
version_ok = _GT["version"] in text and _GT["date"] in text
forbidden_ok = not _has_forbidden_changelog_content(text)
forbidden_leak = forbidden_leak or not forbidden_ok
changelog_score = 0.70 * min(hits / len(_GT["required_changelog_terms"]), 1.0) + 0.20 * version_ok + 0.10 * forbidden_ok
add("changelog", changelog_score >= 0.75, _GT["scoring"]["weights"]["changelog"], {"hits": hits, "version_ok": version_ok})
else:
add("changelog", False, _GT["scoring"]["weights"]["changelog"], "missing")
summary_score = 0.0
js_path = out / "release_summary.json"
if js_path.is_file():
try:
data = json.loads(js_path.read_text(encoding="utf-8"))
keys_ok = all(k in data for k in ("version", "date", "highlights", "breaking_changes", "issue_count", "risk_notes"))
issue_count_exact_ok = int(data.get("issue_count", 0)) == int(_GT["issue_count"])
fields_ok = data.get("version") == _GT["version"] and data.get("date") == _GT["date"] and issue_count_exact_ok
text = json.dumps(data, ensure_ascii=False).lower()
forbidden_leak = forbidden_leak or any(term.lower() in text for term in _LEAK_TERMS)
term_hits = sum(term.lower() in text for term in _GT["required_summary_terms"])
summary_score = 0.25 * keys_ok + 0.35 * fields_ok + 0.40 * min(term_hits / len(_GT["required_summary_terms"]), 1.0)
add("summary_json", summary_score >= 0.75, _GT["scoring"]["weights"]["summary_json"], {"keys_ok": keys_ok, "fields_ok": fields_ok, "term_hits": term_hits})
add("issue_count_exact", issue_count_exact_ok, 0.05, {"got": data.get("issue_count")})
except Exception as exc:
add("summary_json", False, _GT["scoring"]["weights"]["summary_json"], str(exc))
add("issue_count_exact", False, 0.05, str(exc))
else:
add("summary_json", False, _GT["scoring"]["weights"]["summary_json"], "missing")
add("issue_count_exact", False, 0.05, "missing")
upgrade_score = 0.0
up_path = out / "upgrade_notes.md"
if up_path.is_file():
up_text = up_path.read_text(encoding="utf-8", errors="replace").lower()
hits = sum(term.lower() in up_text for term in _GT["required_upgrade_terms"])
safe = not any(term in up_text for term in _LEAK_TERMS)
forbidden_leak = forbidden_leak or not safe
upgrade_score = 0.85 * min(hits / len(_GT["required_upgrade_terms"]), 1.0) + 0.15 * safe
add("upgrade_notes", upgrade_score >= 0.75, _GT["scoring"]["weights"]["upgrade_notes"], {"hits": hits, "safe": safe})
else:
add("upgrade_notes", False, _GT["scoring"]["weights"]["upgrade_notes"], "missing")
decisions_score = 0.0
decisions_path = out / "release_decisions.json"
if decisions_path.is_file():
try:
decisions = json.loads(decisions_path.read_text(encoding="utf-8"))
expected_decisions = _GT.get("release_decisions", {})
keys_ok = isinstance(decisions, dict) and set(decisions) == set(expected_decisions)
values_ok = keys_ok and all(decisions.get(key) == value for key, value in expected_decisions.items())
decisions_score = 1.0 if values_ok else 0.0
add("release_decisions", values_ok, 0.25, {"keys_ok": keys_ok, "got": decisions})
except Exception as exc:
add("release_decisions", False, 0.25, str(exc))
else:
add("release_decisions", False, 0.25, "missing")
audit_score = 0.0
audit_path = out / "release_audit.json"
if audit_path.is_file():
try:
audit = json.loads(audit_path.read_text(encoding="utf-8"))
expected_audit = _GT.get("release_audit", {})
keys_ok = isinstance(audit, dict) and set(audit) == set(expected_audit)
values_ok = keys_ok and all(audit.get(key) == value for key, value in expected_audit.items())
audit_score = 1.0 if values_ok else 0.0
add("release_audit", values_ok, 0.30, {"keys_ok": keys_ok, "got": audit})
except Exception as exc:
add("release_audit", False, 0.30, str(exc))
else:
add("release_audit", False, 0.30, "missing")
base = w / "in" / "release"
intact = [(_md5(base / rel) == digest) for rel, digest in _HASHES.items()]
integrity = sum(intact) / len(intact)
add("fixture_integrity", integrity == 1.0, 0.15, {"score": integrity})
artifacts_present = ch_path.is_file() and js_path.is_file() and up_path.is_file() and decisions_path.is_file() and audit_path.is_file()
add("all_required_artifacts_present", artifacts_present, 0.0, {
"CHANGELOG.md": ch_path.is_file(),
"release_summary.json": js_path.is_file(),
"upgrade_notes.md": up_path.is_file(),
"release_decisions.json": decisions_path.is_file(),
"release_audit.json": audit_path.is_file(),
})
add("no_embargo_or_exploit_leak", not forbidden_leak, 0.0, "embargoed exploit/technical details or reverted changelog leak detected")
add("issue_count_cap_clear", issue_count_exact_ok, 0.0, "issue_count must count shipped non-reverted ISSUE-* product issues only")
total = changelog_score * 0.18 + summary_score * 0.18 + upgrade_score * 0.13 + decisions_score * 0.16 + audit_score * 0.25 + integrity * 0.10
if not artifacts_present:
total = min(total, 0.59)
if forbidden_leak:
total = min(total, 0.59)
if not issue_count_exact_ok:
total = min(total, 0.74)
thresholds = _GT["scoring"]["thresholds"]
level = "excellent" if total >= thresholds["excellent"] else "good" if total >= thresholds["good"] else "pass" if total >= thresholds["pass"] else "fail"
return {"task": "048-release-note-changelog", "outcome_score": round(total, 4), "level": level, "checks": checks}