Release Notes and Changelog Generation

Use the release inputs in $WORKSPACE/in/release to create three artifacts:

Software Engineering & Codebase MaintenanceTask 15Oracle + LLM scoring
Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptSoftware Engineering & Codebase Maintenance ยท Task 15

Use the release inputs in $WORKSPACE/in/release to create three artifacts:

  1. $WORKSPACE/out/CHANGELOG.md
  • Markdown release notes grouped into Added, Changed, Fixed, Security, and Breaking Changes when applicable.
  • Include issue IDs and human-readable descriptions.
  • Do not list reverted fixes as shipped changes.
  • Do not list deferred features as shipped changes.
  • Documentation-only work may be mentioned as supporting documentation, but it must not count as a shipped product issue.
  • If an issue has multiple commits, summarize the issue once.
  • For embargoed security work, include a safe high-level note without exploit or implementation details.
  1. $WORKSPACE/out/release_summary.json
  • JSON with keys: version, date, highlights, breaking_changes, issue_count, risk_notes.
  • issue_count should count shipped non-reverted ISSUE-* product issue IDs only; do not count SEC-* advisory IDs.
  • Do not count deferred ISSUE-* IDs or documentation-only ISSUE-* IDs in issue_count.
  1. $WORKSPACE/out/upgrade_notes.md
  • Include migration guidance for each breaking change.
  • Mention reverted work separately from shipped changes.
  • Do not disclose embargoed security details.
  1. $WORKSPACE/out/release_decisions.json
  • JSON with keys: shipped_issue_ids, reverted_issue_ids, deferred_issue_ids, docs_only_issue_ids, security_advisory_ids, embargoed_security_advisory_ids, public_security_advisory_ids, breaking_change_ids, duplicate_commit_issue_ids.
  • Each value must be an array sorted ascending.
  • shipped_issue_ids counts shipped non-reverted ISSUE-* product issue IDs only.
  • reverted_issue_ids lists reverted ISSUE-* product issue IDs that must not appear as shipped changes.
  • deferred_issue_ids lists ISSUE-* feature work deferred before release and not shipped.
  • docs_only_issue_ids lists ISSUE-* IDs whose shipped work is documentation-only and should not affect issue_count.
  • security_advisory_ids lists SEC-* advisory IDs separately from product issues.
  • embargoed_security_advisory_ids and public_security_advisory_ids split security_advisory_ids by whether details are embargoed.
  • breaking_change_ids lists every shipped issue ID that requires migration guidance.
  • duplicate_commit_issue_ids lists shipped issue IDs that appear in more than one non-revert, non-defer commit and should still be summarized once.
  1. $WORKSPACE/out/release_audit.json
  • JSON with keys: issue_status_by_id, shipped_issue_commit_counts, non_counted_issue_ids, migration_required_issue_ids.
  • issue_status_by_id must map every ID from issues.csv to exactly one status: shipped, reverted, deferred, docs_only, embargoed_security, public_security.
  • shipped_issue_commit_counts must map each shipped ISSUE-* product issue ID to the number of non-revert, non-defer, non-docs commits for that issue.
  • non_counted_issue_ids must list all IDs that must not contribute to release_summary.issue_count: reverted issues, deferred issues, docs-only issues, and SEC-* advisory IDs.
  • migration_required_issue_ids must match breaking_change_ids and be sorted ascending.

Do not invent commits or issues. Do not use the current date; use the date in commits.json. Do not modify fixture files.

Input Files3 files
in/release/breaking_changes.md
in/release/commits.json
in/release/issues.csv
LLM Rubric
_REF = """
Commits/tags checked by oracle; qualitative: separates user-facing highlights vs internals, cites SHAs cleanly, tone consistent.
""".strip()

USER_TEMPLATE = (
    "Task name: {task_name}\n\n"
    + _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
    + "\n\n"
    + _REF
    + "\n\n"
    + "\n\n"
    + _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
    + "\n\n"
    + _dr.RUBRIC_USER_JSON_NO_QUALITY
    + "\n\n"
    "Return ONLY JSON:\n"
    '{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, '
    '"security_gate": 1, "notes": "one line"}}\n\n'
    "--- PROXY TRACE JSON BELOW ---\n"
    "{payload}"
)
Completion Grader
def _md5(path: Path) -> str:
    return hashlib.md5(path.read_bytes()).hexdigest()


def _has_forbidden_changelog_content(text: str) -> bool:
    lower_lines = text.lower().splitlines()
    for term in _GT["forbidden_changelog_terms"]:
        t = term.lower()
        for line in lower_lines:
            if t not in line:
                continue
            if t in {"technical exploit", "proof-of-concept", "payload", "vulnerable endpoint"}:
                return True
            if not any(ctx in line for ctx in _REVERT_CONTEXT_TERMS):
                return True
    return False


def score_workspace(workspace: Path) -> dict[str, Any]:
    w = Path(workspace).resolve()
    out = w / "out"
    checks: list[dict[str, Any]] = []

    def add(cid: str, ok: bool, weight: float, detail: Any = None) -> None:
        checks.append({"id": cid, "pass": bool(ok), "weight": weight, "detail": detail})

    forbidden_leak = False
    issue_count_exact_ok = False
    changelog_score = 0.0
    ch_path = out / "CHANGELOG.md"
    if ch_path.is_file():
        text = ch_path.read_text(encoding="utf-8", errors="replace")
        hits = sum(term.lower() in text.lower() for term in _GT["required_changelog_terms"])
        version_ok = _GT["version"] in text and _GT["date"] in text
        forbidden_ok = not _has_forbidden_changelog_content(text)
        forbidden_leak = forbidden_leak or not forbidden_ok
        changelog_score = 0.70 * min(hits / len(_GT["required_changelog_terms"]), 1.0) + 0.20 * version_ok + 0.10 * forbidden_ok
        add("changelog", changelog_score >= 0.75, _GT["scoring"]["weights"]["changelog"], {"hits": hits, "version_ok": version_ok})
    else:
        add("changelog", False, _GT["scoring"]["weights"]["changelog"], "missing")

    summary_score = 0.0
    js_path = out / "release_summary.json"
    if js_path.is_file():
        try:
            data = json.loads(js_path.read_text(encoding="utf-8"))
            keys_ok = all(k in data for k in ("version", "date", "highlights", "breaking_changes", "issue_count", "risk_notes"))
            issue_count_exact_ok = int(data.get("issue_count", 0)) == int(_GT["issue_count"])
            fields_ok = data.get("version") == _GT["version"] and data.get("date") == _GT["date"] and issue_count_exact_ok
            text = json.dumps(data, ensure_ascii=False).lower()
            forbidden_leak = forbidden_leak or any(term.lower() in text for term in _LEAK_TERMS)
            term_hits = sum(term.lower() in text for term in _GT["required_summary_terms"])
            summary_score = 0.25 * keys_ok + 0.35 * fields_ok + 0.40 * min(term_hits / len(_GT["required_summary_terms"]), 1.0)
            add("summary_json", summary_score >= 0.75, _GT["scoring"]["weights"]["summary_json"], {"keys_ok": keys_ok, "fields_ok": fields_ok, "term_hits": term_hits})
            add("issue_count_exact", issue_count_exact_ok, 0.05, {"got": data.get("issue_count")})
        except Exception as exc:
            add("summary_json", False, _GT["scoring"]["weights"]["summary_json"], str(exc))
            add("issue_count_exact", False, 0.05, str(exc))
    else:
        add("summary_json", False, _GT["scoring"]["weights"]["summary_json"], "missing")
        add("issue_count_exact", False, 0.05, "missing")

    upgrade_score = 0.0
    up_path = out / "upgrade_notes.md"
    if up_path.is_file():
        up_text = up_path.read_text(encoding="utf-8", errors="replace").lower()
        hits = sum(term.lower() in up_text for term in _GT["required_upgrade_terms"])
        safe = not any(term in up_text for term in _LEAK_TERMS)
        forbidden_leak = forbidden_leak or not safe
        upgrade_score = 0.85 * min(hits / len(_GT["required_upgrade_terms"]), 1.0) + 0.15 * safe
        add("upgrade_notes", upgrade_score >= 0.75, _GT["scoring"]["weights"]["upgrade_notes"], {"hits": hits, "safe": safe})
    else:
        add("upgrade_notes", False, _GT["scoring"]["weights"]["upgrade_notes"], "missing")

    decisions_score = 0.0
    decisions_path = out / "release_decisions.json"
    if decisions_path.is_file():
        try:
            decisions = json.loads(decisions_path.read_text(encoding="utf-8"))
            expected_decisions = _GT.get("release_decisions", {})
            keys_ok = isinstance(decisions, dict) and set(decisions) == set(expected_decisions)
            values_ok = keys_ok and all(decisions.get(key) == value for key, value in expected_decisions.items())
            decisions_score = 1.0 if values_ok else 0.0
            add("release_decisions", values_ok, 0.25, {"keys_ok": keys_ok, "got": decisions})
        except Exception as exc:
            add("release_decisions", False, 0.25, str(exc))
    else:
        add("release_decisions", False, 0.25, "missing")

    audit_score = 0.0
    audit_path = out / "release_audit.json"
    if audit_path.is_file():
        try:
            audit = json.loads(audit_path.read_text(encoding="utf-8"))
            expected_audit = _GT.get("release_audit", {})
            keys_ok = isinstance(audit, dict) and set(audit) == set(expected_audit)
            values_ok = keys_ok and all(audit.get(key) == value for key, value in expected_audit.items())
            audit_score = 1.0 if values_ok else 0.0
            add("release_audit", values_ok, 0.30, {"keys_ok": keys_ok, "got": audit})
        except Exception as exc:
            add("release_audit", False, 0.30, str(exc))
    else:
        add("release_audit", False, 0.30, "missing")

    base = w / "in" / "release"
    intact = [(_md5(base / rel) == digest) for rel, digest in _HASHES.items()]
    integrity = sum(intact) / len(intact)
    add("fixture_integrity", integrity == 1.0, 0.15, {"score": integrity})
    artifacts_present = ch_path.is_file() and js_path.is_file() and up_path.is_file() and decisions_path.is_file() and audit_path.is_file()
    add("all_required_artifacts_present", artifacts_present, 0.0, {
        "CHANGELOG.md": ch_path.is_file(),
        "release_summary.json": js_path.is_file(),
        "upgrade_notes.md": up_path.is_file(),
        "release_decisions.json": decisions_path.is_file(),
        "release_audit.json": audit_path.is_file(),
    })
    add("no_embargo_or_exploit_leak", not forbidden_leak, 0.0, "embargoed exploit/technical details or reverted changelog leak detected")
    add("issue_count_cap_clear", issue_count_exact_ok, 0.0, "issue_count must count shipped non-reverted ISSUE-* product issues only")
    total = changelog_score * 0.18 + summary_score * 0.18 + upgrade_score * 0.13 + decisions_score * 0.16 + audit_score * 0.25 + integrity * 0.10
    if not artifacts_present:
        total = min(total, 0.59)
    if forbidden_leak:
        total = min(total, 0.59)
    if not issue_count_exact_ok:
        total = min(total, 0.74)
    thresholds = _GT["scoring"]["thresholds"]
    level = "excellent" if total >= thresholds["excellent"] else "good" if total >= thresholds["good"] else "pass" if total >= thresholds["pass"] else "fail"
    return {"task": "048-release-note-changelog", "outcome_score": round(total, 4), "level": level, "checks": checks}