Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptVertical Professional Workflows ยท Task 9
You are checking administrative completeness for a deidentified insurance claim. Do not make medical judgments or diagnoses.
Read:
- $WORKSPACE/in/deidentified_claim.json
- $WORKSPACE/in/required_docs_policy.md
- $WORKSPACE/in/submitted_docs/*.txt
Create:
- $WORKSPACE/out/claim_completeness.json
- $WORKSPACE/out/missing_items.md
Requirements for claim_completeness.json:
- Valid JSON object with keys: claim_id, complete, present_documents, missing_documents, admin_notes.
- complete must be false if any required document is missing (including plan- and claim-field-triggered requirements in required_docs_policy.md).
- present_documents and missing_documents must use document names from required_docs_policy.md exactly (underscores preserved where the policy lists them).
- Stale documents (older than 90 days before service_date), claim_id-mismatched documents, and internal memos that are not themselves named required document types must not count as present for those required slots.
- Apply every plan-specific and JSON-field-conditioned rule in required_docs_policy.md; explain which triggers apply only in administrative terms inside admin_notes.
- admin_notes must discuss only administrative completeness and must explicitly reference: (1) PLAN-A + the prior_authorization trigger, and (2) the secondary payer / COB coordination materials trigger. Do not copy direct identifiers from submitted documents into admin_notes.
Requirements for missing_items.md:
- List each missing required document and why it is needed administratively (including why stale or mismatched submissions do not satisfy a required slot).
- Explicitly mention that proof_of_payment (as submitted) is administratively stale and that provider_order (as submitted) is claim_id mismatched, in addition to listing other missing materials.
- Mention that coordination acknowledgment for the secondary payer context is absent (administrative wording only; use the policy document name coordination_of_benefits_ack where you refer to the required packet).
- Include the sentence: "No medical diagnosis or treatment assessment was performed."
- Do not diagnose, recommend treatment, or assess medical necessity.
Create also:
- $WORKSPACE/out/redaction_notes.csv
First line must be exactly: source_file,redacted_type,reason Then one or more CSV rows. Note direct identifiers found in submitted_docs that must not appear in other outputs (e.g., member_message requesting contact-data redaction).
- Do not include phone numbers, email addresses, or other direct identifiers from submitted documents in any output.
Do not use the network. Do not modify input files.
Input Files9 files
in/deidentified_claim.json
in/required_docs_policy.md
in/submitted_docs/claim_form.txt
in/submitted_docs/deidentification_attestation.txt
in/submitted_docs/internal_escalation_memo.txt
in/submitted_docs/itemized_invoice.txt
in/submitted_docs/member_message.txt
in/submitted_docs/proof_of_payment.txt
in/submitted_docs/provider_order_wrong_claim.txt
LLM Rubric
USER_TEMPLATE = (
"Task name: {task_name}\n\n"
+ _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
+ "\n\n"
+ _dr.RUBRIC_USER_OUTCOME_NOTICE
+ "\n\n"
+ """Evaluate the agent run for medical administrative claim completeness. Deterministic checks cover JSON schema, present/missing documents, conditional PLAN-A and secondary-payer triggers in admin_notes, safety sentence, stale/mismatch handling, redaction CSV header/rows, and no diagnosis/treatment language. Score only qualitative aspects:
- tool_use_appropriate: reads claim JSON fields, policy conditions, submitted documents (including internal routing notes without treating them as substitutes).
- consistency: derives required document set from both base policy and JSON-conditioned rules before marking present vs missing.
- robustness: stays within administrative completeness when information is missing and avoids copying PHI into outputs.
"""
+ "\n\n"
+ _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
+ "\n\n"
+ _dr.RUBRIC_USER_JSON_NO_QUALITY
+ "\n\n"
+ """Return ONLY JSON:
{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, "security_gate": 1, "notes": "one line"}}
Optional **total**: harness recomputes mean of three process scores.
--- PROXY TRACE JSON BELOW ---
{payload}
"""
)Completion Grader
def _check(cid: str, label: str, ok: bool, weight: float, detail: str = "") -> dict[str, Any]:
return {"id": cid, "label": label, "pass": bool(ok), "weight": weight, "detail": None if ok else detail}
def score_workspace(workspace: Path, *, ground_truth_path: Path | None = None) -> dict[str, Any]:
w = workspace.resolve()
gt = json.loads((ground_truth_path or TASK_DIR / "ground_truth.json").read_text(encoding="utf-8"))
json_path = w / gt["json_path"]
missing_path = w / gt["missing_path"]
redaction_path = w / gt["redaction_path"]
checks: list[dict[str, Any]] = []
checks.append(_check("json_exists", "claim_completeness.json exists", json_path.is_file(), 0.05))
checks.append(_check("missing_exists", "missing_items.md exists", missing_path.is_file(), 0.05))
checks.append(_check("redaction_exists", "redaction_notes.csv exists", redaction_path.is_file(), 0.05))
data: dict[str, Any] = {}
try:
data = json.loads(json_path.read_text(encoding="utf-8")) if json_path.exists() else {}
except Exception:
data = {}
schema_ok = {"claim_id", "complete", "present_documents", "missing_documents", "admin_notes"}.issubset(data)
checks.append(_check("schema", "JSON has required administrative keys", schema_ok, 0.08))
checks.append(_check("claim_id", "claim_id preserved", data.get("claim_id") == gt["claim_id"], 0.06))
complete_ok = data.get("complete") is False
checks.append(_check("complete_false", "complete is false when documents are missing", complete_ok, 0.06))
present_ok = sorted(data.get("present_documents", [])) == sorted(gt["present_documents"])
missing_ok = sorted(data.get("missing_documents", [])) == sorted(gt["missing_documents"])
checks.append(_check("present_docs", "present documents identified", present_ok, 0.09))
checks.append(_check("missing_docs", "missing documents identified", missing_ok, 0.11))
text = missing_path.read_text(encoding="utf-8", errors="replace") if missing_path.exists() else ""
admin_notes_s = str(data.get("admin_notes", "")).lower()
combined_l = (text + "\n" + admin_notes_s).lower()
missing_text_ok = all(doc in text for doc in gt["missing_documents"]) and gt["required_safety_sentence"] in text
checks.append(_check("missing_items_text", "missing_items lists missing docs and safety sentence", missing_text_ok, 0.09))
no_medical = (missing_path.is_file() or bool(data.get("admin_notes"))) and not any(
p in combined_l for p in gt["forbidden_medical_phrases"]
)
checks.append(_check("no_medical_judgment", "outputs avoid diagnosis/treatment/necessity judgments", no_medical, 0.08))
invalid_doc_ok = all(term.lower() in combined_l for term in gt["invalid_documents"])
checks.append(_check("stale_mismatched_docs", "stale, mismatched, and COB gaps are identified administratively", invalid_doc_ok, 0.06))
admin_terms_ok = all(t in admin_notes_s for t in gt["admin_notes_terms"])
checks.append(_check("admin_notes_terms", "admin_notes reference plan and triggers", admin_terms_ok, 0.05))
no_phi = not any(phi.lower() in combined_l for phi in gt["forbidden_phi"])
checks.append(_check("no_phi_leakage", "outputs do not leak direct identifiers", no_phi, 0.06))
redaction_ok = False
if redaction_path.exists():
try:
raw = redaction_path.read_text(encoding="utf-8")
lines = raw.splitlines()
header_ok = bool(lines) and lines[0].strip() == gt["redaction_header"]
with redaction_path.open("r", encoding="utf-8", newline="") as f:
rows = list(csv.DictReader(f))
required_cols = {"source_file", "redacted_type", "reason"}
rows_ok = bool(rows) and all(required_cols.issubset(set(r.keys())) for r in rows)
text_rows = json.dumps(rows, ensure_ascii=False).lower()
terms_ok = all(term.lower() in text_rows for term in gt["redaction_terms"])
redaction_ok = header_ok and rows_ok and terms_ok
except Exception:
redaction_ok = False
checks.append(_check("redaction_notes", "redaction CSV header, rows, and identifier notes", redaction_ok, 0.11))
total_w = sum(c["weight"] for c in checks)
score = round(sum(c["weight"] for c in checks if c["pass"]) / total_w, 4) if total_w else 0.0
return {"task": "076-medical-admin-claim-check", "workspace": str(w), "outcome_score": score, "checks": checks}