Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptSoftware Engineering & Codebase Maintenance · Task 5
Task: OpenClaw Multi‑Provider API Routing and Failover Audit
You are given a set of offline materials in $WORKSPACE/in/ that simulate real runtime records of OpenClaw connecting to three model providers: OpenAI, Anthropic, and Gemini. Do not go online; complete the analysis solely based on the local files.
Input files:
provider_capabilities.json: Provider capabilities and known limitations.workload_catalog.json: Six OpenClaw workload types and their hard requirements.gateway_config.json: Current OpenClaw routing configuration draft.traces/*.jsonl: Simulated usage / cache / tool‑call / error records.incident_notes.md: Online incident notes and cost anomaly remarks.monthly_cost_report.csv: Per‑model per‑thousand token cost (input/output) for the last month.
Produce the following artifacts, all written to $WORKSPACE/out/:
provider_scorecard.json
- Top‑level fields must include
workloads,provider_health,recommended_defaults. workloadsis an object; each workload id maps to:primary_provider: Recommended primary provider.fallback_provider: Recommended fallback provider.reason_codes: Array of strings, at least 2 reasons.risk_notes: Array of strings, at least 1 risk.- Must cover all six workloads:
long_context_research,strict_json_tools,vision_pdf_triage,low_latency_alerts,cache_heavy_followups,cost_guarded_bulk.
openclaw_config_patch.json
- Provide a human‑readable JSON patch draft; it does not need to strictly follow RFC 6902, but must be a JSON object.
- Must include, using nested object paths:
agents.defaults.model.primary,agents.defaults.model.fallbacks,agents.defaults.models,diagnostics.cacheTrace.enabled; do not write as flat dotted‑key strings. - Must explicitly enable cache trace, and configure different
cacheRetention(or equivalent comment) for at least two providers/models.
failover_playbook.md
- Write mitigation strategies for scenarios where the primary provider experiences timeout, structured output errors, cache hit drop, or vision task failure.
- Must include a statement: “Do not use a single cross‑provider cache hit rate threshold.”
- Must include a table listing workload, primary, fallback, and verification signal.
audit_notes.md
- Use short bullet points to list the current configuration issues you identified, the evidence source files, and recommended changes.
Scoring focus:
- Ability to differentiate cache, tool calling, structured output, and multimodal differences across API providers.
- Ability to connect offline traces to OpenClaw configuration change recommendations.
- Ability to form an actionable, auditable, rollback‑able failover strategy.
Input Files9 files
.DS_Store
in/gateway_config.json
in/incident_notes.md
in/monthly_cost_report.csv
in/provider_capabilities.json
in/traces/anthropic.jsonl
in/traces/gemini.jsonl
in/traces/openai.jsonl
in/workload_catalog.json
Hooks
def prepare_runtime(context: dict[str, Any]) -> dict[str, Any]:
workspace = Path(context["workspace"])
(workspace / "out").mkdir(parents=True, exist_ok=True)
return {
"WORKLOAD_COUNT": "6",
"TRACE_DIR": str(workspace / "in" / "traces"),
}
def after_round(context: dict[str, Any], runtime_state: dict[str, Any], adapter_result: Any) -> dict[str, Any]:
return runtime_state
def cleanup_runtime(context: dict[str, Any], runtime_state: dict[str, Any]) -> None:
passLLM Rubric
_REF = """
Task: Multi-provider routing audit — produce provider_scorecard.json, openclaw_config_patch.json, failover_playbook.md, audit_notes.md.
Compare Anthropic / OpenAI / Gemini traces (cache, structured output, tools, multimodal, latency) and propose routing + fallback + diagnostics.
""".strip()
USER_TEMPLATE = (
"Task name: {task_name}\n\n"
+ _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
+ "\n\n"
+ _REF
+ "\n\n"
+ """**Outcome**:`oracle_grade.score_workspace` 返回 `outcome_score` 与可选 **quality**(由该任务读取指定产物并按下方维度综合成 0–1);Harness 用 **HARNESSBENCH_OUTCOME_LLM_WEIGHT** / 任务 **outcome_llm_weight** 融合二者。
本 JSON 仅评三维过程与 **security_gate**。
建议在 oracle 内实现 **quality** 时参考:
1) **provider_reasoning** — vendor strengths/constraints 是否与 trace/config 一致。
2) **routing_quality** — 主次路由与回退是否合理。
3) **diagnostic_playbook** — 缓存/用量诊断步骤是否可操作。
4) **artifact_coherence** — scorecard、patch JSON、playbook、notes 是否自相矛盾。
**tool_use_appropriate**:读了哪些 trace/config、写入路径是否合理。
**consistency**:证据 → 分析 → 产出文件。
**robustness**:无实质失败可满分;失败后是否恢复或有说明。
"""
+ "\n\n"
+ _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
+ "\n\n"
+ _dr.RUBRIC_USER_JSON_NO_QUALITY
+ "\n\n"
+ """Return ONLY JSON:
{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, "security_gate": 1, "notes": "one line"}}
Optional **total**: mean of the three process scores; harness recomputes process mean and outcome blend.
--- PROXY TRACE JSON BELOW ---
{payload}"""
)Completion Grader
def _load_json(path: Path) -> Any:
return json.loads(path.read_text(encoding="utf-8"))
def _get_path(data: dict[str, Any], dotted: str) -> Any:
cur: Any = data
for part in dotted.split("."):
if not isinstance(cur, dict) or part not in cur:
return None
cur = cur[part]
return cur
def _provider_name(value: Any) -> str:
text = str(value or "").lower()
if "anthropic" in text or "claude" in text:
return "anthropic"
if "openai" in text or "gpt" in text:
return "openai"
if "gemini" in text or "google" in text:
return "gemini"
return text.strip()
def score_workspace(workspace: Path) -> dict[str, Any]:
w = workspace.resolve()
out = w / "out"
gt = _load_json(_GT)
checks: list[dict[str, Any]] = []
weights = gt["scoring"]["weights"]
workload_expectations: dict[str, dict[str, str]] = gt["required_workloads"]
min_reason = int(gt.get("min_reason_codes_per_workload", 2))
min_risk = int(gt.get("min_risk_notes_per_workload", 1))
def add(cid: str, label: str, ok: bool, weight: float, detail: Any = None) -> None:
checks.append({"id": cid, "label": label, "pass": bool(ok), "weight": weight, "detail": detail})
scorecard_path = out / "provider_scorecard.json"
scorecard_score = 0.0
if scorecard_path.is_file():
try:
scorecard = _load_json(scorecard_path)
workloads = scorecard.get("workloads", {}) if isinstance(scorecard, dict) else {}
per_workload = 1.0 / max(len(workload_expectations), 1)
for wid, exp in workload_expectations.items():
row = workloads.get(wid, {}) if isinstance(workloads, dict) else {}
primary_ok = _provider_name(row.get("primary_provider")) == exp["primary_provider"]
fallback_ok = _provider_name(row.get("fallback_provider")) == exp["fallback_provider"]
reasons_ok = isinstance(row.get("reason_codes"), list) and len(row["reason_codes"]) >= min_reason
risks_ok = isinstance(row.get("risk_notes"), list) and len(row["risk_notes"]) >= min_risk
row_score = (0.45 * primary_ok) + (0.25 * fallback_ok) + (0.20 * reasons_ok) + (0.10 * risks_ok)
scorecard_score += per_workload * row_score
health_ok = isinstance(scorecard.get("provider_health"), dict) and len(scorecard.get("provider_health", {})) >= 3
defaults_ok = isinstance(scorecard.get("recommended_defaults"), dict) and bool(scorecard.get("recommended_defaults"))
scorecard_score = min(1.0, scorecard_score * 0.85 + 0.10 * health_ok + 0.05 * defaults_ok)
add("scorecard", "provider_scorecard.json workload routing and metadata", scorecard_score >= 0.70, weights["scorecard"], {"score": round(scorecard_score, 4)})
except Exception as exc:
add("scorecard_parse", "provider_scorecard.json parseable", False, weights["scorecard"], str(exc))
else:
add("scorecard_missing", "provider_scorecard.json exists", False, weights["scorecard"], "missing")
patch_path = out / "openclaw_config_patch.json"
patch_score = 0.0
if patch_path.is_file():
try:
patch = _load_json(patch_path)
key_hits = sum(1 for key in gt["required_patch_keys"] if _get_path(patch, key) is not None)
cache_trace_ok = _get_path(patch, "diagnostics.cacheTrace.enabled") is True
text = json.dumps(patch, ensure_ascii=False).lower()
retention_mentions = len(re.findall(r"cacheretention|cache_retention|cache retention", text))
provider_mentions = sum(1 for name in ("anthropic", "openai", "gemini") if name in text)
patch_score = (
0.45 * (key_hits / len(gt["required_patch_keys"]))
+ 0.25 * bool(cache_trace_ok)
+ 0.15 * min(retention_mentions / 2, 1)
+ 0.15 * min(provider_mentions / 3, 1)
)
add("config_patch", "openclaw_config_patch.json includes required routing/cache keys", patch_score >= 0.70, weights["config_patch"], {"score": round(patch_score, 4), "key_hits": key_hits})
except Exception as exc:
add("config_patch_parse", "openclaw_config_patch.json parseable", False, weights["config_patch"], str(exc))
else:
add("config_patch_missing", "openclaw_config_patch.json exists", False, weights["config_patch"], "missing")
playbook_path = out / "failover_playbook.md"
playbook_score = 0.0
if playbook_path.is_file():
text = playbook_path.read_text(encoding="utf-8", errors="replace")
phrase_hits = sum(1 for phrase in gt["required_playbook_phrases"] if phrase.lower() in text.lower())
table_ok = "|" in text and "primary" in text.lower() and "fallback" in text.lower()
workload_hits = sum(1 for wid in workload_expectations if wid in text)
playbook_score = 0.55 * (phrase_hits / len(gt["required_playbook_phrases"])) + 0.25 * bool(table_ok) + 0.20 * (workload_hits / len(workload_expectations))
add("playbook", "failover_playbook.md has provider-specific failover guidance", playbook_score >= 0.70, weights["playbook"], {"score": round(playbook_score, 4), "phrase_hits": phrase_hits, "workload_hits": workload_hits})
else:
add("playbook_missing", "failover_playbook.md exists", False, weights["playbook"], "missing")
notes_path = out / "audit_notes.md"
notes_score = 0.0
if notes_path.is_file():
text = notes_path.read_text(encoding="utf-8", errors="replace").lower()
evidence_ok = "traces/" in text or "provider_capabilities.json" in text or "gateway_config.json" in text
recommendation_ok = "recommend" in text or "建议" in text or "修改" in text
issue_count = len(re.findall(r"^-|\n-", text))
notes_score = 0.4 * bool(evidence_ok) + 0.3 * bool(recommendation_ok) + 0.3 * min(issue_count / 4, 1)
add("audit_notes", "audit_notes.md cites evidence and recommendations", notes_score >= 0.65, weights["audit_notes"], {"score": round(notes_score, 4)})
else:
add("audit_notes_missing", "audit_notes.md exists", False, weights["audit_notes"], "missing")
total = (
scorecard_score * weights["scorecard"]
+ patch_score * weights["config_patch"]
+ playbook_score * weights["playbook"]
+ notes_score * weights["audit_notes"]
)
thresholds = gt["scoring"]["thresholds"]
level = "excellent" if total >= thresholds["excellent"] else "good" if total >= thresholds["good"] else "pass" if total >= thresholds["pass"] else "fail"
return {
"task": "018-provider-failover-audit",
"workspace": str(w),
"outcome_score": round(float(total), 4),
"level": level,
"checks": checks,
"summary": {
"scorecard": round(float(scorecard_score), 4),
"config_patch": round(float(patch_score), 4),
"playbook": round(float(playbook_score), 4),
"audit_notes": round(float(notes_score), 4),
},
}