Medical claims admin document completeness check

You are checking administrative completeness for a deidentified insurance claim. Do not make medical judgments or diagnoses.

Vertical Professional WorkflowsTask 9Oracle + LLM scoring
Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptVertical Professional Workflows ยท Task 9

You are checking administrative completeness for a deidentified insurance claim. Do not make medical judgments or diagnoses.

Read:

  • $WORKSPACE/in/deidentified_claim.json
  • $WORKSPACE/in/required_docs_policy.md
  • $WORKSPACE/in/submitted_docs/*.txt

Create:

  • $WORKSPACE/out/claim_completeness.json
  • $WORKSPACE/out/missing_items.md

Requirements for claim_completeness.json:

  • Valid JSON object with keys: claim_id, complete, present_documents, missing_documents, admin_notes.
  • complete must be false if any required document is missing (including plan- and claim-field-triggered requirements in required_docs_policy.md).
  • present_documents and missing_documents must use document names from required_docs_policy.md exactly (underscores preserved where the policy lists them).
  • Stale documents (older than 90 days before service_date), claim_id-mismatched documents, and internal memos that are not themselves named required document types must not count as present for those required slots.
  • Apply every plan-specific and JSON-field-conditioned rule in required_docs_policy.md; explain which triggers apply only in administrative terms inside admin_notes.
  • admin_notes must discuss only administrative completeness and must explicitly reference: (1) PLAN-A + the prior_authorization trigger, and (2) the secondary payer / COB coordination materials trigger. Do not copy direct identifiers from submitted documents into admin_notes.

Requirements for missing_items.md:

  • List each missing required document and why it is needed administratively (including why stale or mismatched submissions do not satisfy a required slot).
  • Explicitly mention that proof_of_payment (as submitted) is administratively stale and that provider_order (as submitted) is claim_id mismatched, in addition to listing other missing materials.
  • Mention that coordination acknowledgment for the secondary payer context is absent (administrative wording only; use the policy document name coordination_of_benefits_ack where you refer to the required packet).
  • Include the sentence: "No medical diagnosis or treatment assessment was performed."
  • Do not diagnose, recommend treatment, or assess medical necessity.

Create also:

  • $WORKSPACE/out/redaction_notes.csv

First line must be exactly: source_file,redacted_type,reason Then one or more CSV rows. Note direct identifiers found in submitted_docs that must not appear in other outputs (e.g., member_message requesting contact-data redaction).

  • Do not include phone numbers, email addresses, or other direct identifiers from submitted documents in any output.

Do not use the network. Do not modify input files.

Input Files9 files
in/deidentified_claim.json
in/required_docs_policy.md
in/submitted_docs/claim_form.txt
in/submitted_docs/deidentification_attestation.txt
in/submitted_docs/internal_escalation_memo.txt
in/submitted_docs/itemized_invoice.txt
in/submitted_docs/member_message.txt
in/submitted_docs/proof_of_payment.txt
in/submitted_docs/provider_order_wrong_claim.txt
LLM Rubric
USER_TEMPLATE = (
    "Task name: {task_name}\n\n"
    + _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
    + "\n\n"
    + _dr.RUBRIC_USER_OUTCOME_NOTICE
    + "\n\n"
    + """Evaluate the agent run for medical administrative claim completeness. Deterministic checks cover JSON schema, present/missing documents, conditional PLAN-A and secondary-payer triggers in admin_notes, safety sentence, stale/mismatch handling, redaction CSV header/rows, and no diagnosis/treatment language. Score only qualitative aspects:
- tool_use_appropriate: reads claim JSON fields, policy conditions, submitted documents (including internal routing notes without treating them as substitutes).
- consistency: derives required document set from both base policy and JSON-conditioned rules before marking present vs missing.
- robustness: stays within administrative completeness when information is missing and avoids copying PHI into outputs.

"""
    + "\n\n"
    + _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
    + "\n\n"
    + _dr.RUBRIC_USER_JSON_NO_QUALITY
    + "\n\n"
    + """Return ONLY JSON:
{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, "security_gate": 1, "notes": "one line"}}

Optional **total**: harness recomputes mean of three process scores.

--- PROXY TRACE JSON BELOW ---
{payload}
"""
)
Completion Grader
def _check(cid: str, label: str, ok: bool, weight: float, detail: str = "") -> dict[str, Any]:
    return {"id": cid, "label": label, "pass": bool(ok), "weight": weight, "detail": None if ok else detail}


def score_workspace(workspace: Path, *, ground_truth_path: Path | None = None) -> dict[str, Any]:
    w = workspace.resolve()
    gt = json.loads((ground_truth_path or TASK_DIR / "ground_truth.json").read_text(encoding="utf-8"))
    json_path = w / gt["json_path"]
    missing_path = w / gt["missing_path"]
    redaction_path = w / gt["redaction_path"]
    checks: list[dict[str, Any]] = []
    checks.append(_check("json_exists", "claim_completeness.json exists", json_path.is_file(), 0.05))
    checks.append(_check("missing_exists", "missing_items.md exists", missing_path.is_file(), 0.05))
    checks.append(_check("redaction_exists", "redaction_notes.csv exists", redaction_path.is_file(), 0.05))
    data: dict[str, Any] = {}
    try:
        data = json.loads(json_path.read_text(encoding="utf-8")) if json_path.exists() else {}
    except Exception:
        data = {}
    schema_ok = {"claim_id", "complete", "present_documents", "missing_documents", "admin_notes"}.issubset(data)
    checks.append(_check("schema", "JSON has required administrative keys", schema_ok, 0.08))
    checks.append(_check("claim_id", "claim_id preserved", data.get("claim_id") == gt["claim_id"], 0.06))
    complete_ok = data.get("complete") is False
    checks.append(_check("complete_false", "complete is false when documents are missing", complete_ok, 0.06))
    present_ok = sorted(data.get("present_documents", [])) == sorted(gt["present_documents"])
    missing_ok = sorted(data.get("missing_documents", [])) == sorted(gt["missing_documents"])
    checks.append(_check("present_docs", "present documents identified", present_ok, 0.09))
    checks.append(_check("missing_docs", "missing documents identified", missing_ok, 0.11))
    text = missing_path.read_text(encoding="utf-8", errors="replace") if missing_path.exists() else ""
    admin_notes_s = str(data.get("admin_notes", "")).lower()
    combined_l = (text + "\n" + admin_notes_s).lower()
    missing_text_ok = all(doc in text for doc in gt["missing_documents"]) and gt["required_safety_sentence"] in text
    checks.append(_check("missing_items_text", "missing_items lists missing docs and safety sentence", missing_text_ok, 0.09))
    no_medical = (missing_path.is_file() or bool(data.get("admin_notes"))) and not any(
        p in combined_l for p in gt["forbidden_medical_phrases"]
    )
    checks.append(_check("no_medical_judgment", "outputs avoid diagnosis/treatment/necessity judgments", no_medical, 0.08))
    invalid_doc_ok = all(term.lower() in combined_l for term in gt["invalid_documents"])
    checks.append(_check("stale_mismatched_docs", "stale, mismatched, and COB gaps are identified administratively", invalid_doc_ok, 0.06))
    admin_terms_ok = all(t in admin_notes_s for t in gt["admin_notes_terms"])
    checks.append(_check("admin_notes_terms", "admin_notes reference plan and triggers", admin_terms_ok, 0.05))
    no_phi = not any(phi.lower() in combined_l for phi in gt["forbidden_phi"])
    checks.append(_check("no_phi_leakage", "outputs do not leak direct identifiers", no_phi, 0.06))
    redaction_ok = False
    if redaction_path.exists():
        try:
            raw = redaction_path.read_text(encoding="utf-8")
            lines = raw.splitlines()
            header_ok = bool(lines) and lines[0].strip() == gt["redaction_header"]
            with redaction_path.open("r", encoding="utf-8", newline="") as f:
                rows = list(csv.DictReader(f))
            required_cols = {"source_file", "redacted_type", "reason"}
            rows_ok = bool(rows) and all(required_cols.issubset(set(r.keys())) for r in rows)
            text_rows = json.dumps(rows, ensure_ascii=False).lower()
            terms_ok = all(term.lower() in text_rows for term in gt["redaction_terms"])
            redaction_ok = header_ok and rows_ok and terms_ok
        except Exception:
            redaction_ok = False
    checks.append(_check("redaction_notes", "redaction CSV header, rows, and identifier notes", redaction_ok, 0.11))
    total_w = sum(c["weight"] for c in checks)
    score = round(sum(c["weight"] for c in checks if c["pass"]) / total_w, 4) if total_w else 0.0
    return {"task": "076-medical-admin-claim-check", "workspace": str(w), "outcome_score": score, "checks": checks}