OpenClaw Multi‑Provider API Routing and Failover Audit

Task: OpenClaw Multi‑Provider API Routing and Failover Audit

Software Engineering & Codebase MaintenanceTask 5Oracle + LLM scoring
Model Runs6 harnesses & 8 models evaluated on this task.
Loading...
PromptSoftware Engineering & Codebase Maintenance · Task 5

Task: OpenClaw Multi‑Provider API Routing and Failover Audit

You are given a set of offline materials in $WORKSPACE/in/ that simulate real runtime records of OpenClaw connecting to three model providers: OpenAI, Anthropic, and Gemini. Do not go online; complete the analysis solely based on the local files.

Input files:

  • provider_capabilities.json: Provider capabilities and known limitations.
  • workload_catalog.json: Six OpenClaw workload types and their hard requirements.
  • gateway_config.json: Current OpenClaw routing configuration draft.
  • traces/*.jsonl: Simulated usage / cache / tool‑call / error records.
  • incident_notes.md: Online incident notes and cost anomaly remarks.
  • monthly_cost_report.csv: Per‑model per‑thousand token cost (input/output) for the last month.

Produce the following artifacts, all written to $WORKSPACE/out/:

  1. provider_scorecard.json
  • Top‑level fields must include workloads, provider_health, recommended_defaults.
  • workloads is an object; each workload id maps to:
  • primary_provider: Recommended primary provider.
  • fallback_provider: Recommended fallback provider.
  • reason_codes: Array of strings, at least 2 reasons.
  • risk_notes: Array of strings, at least 1 risk.
  • Must cover all six workloads: long_context_research, strict_json_tools, vision_pdf_triage, low_latency_alerts, cache_heavy_followups, cost_guarded_bulk.
  1. openclaw_config_patch.json
  • Provide a human‑readable JSON patch draft; it does not need to strictly follow RFC 6902, but must be a JSON object.
  • Must include, using nested object paths: agents.defaults.model.primary, agents.defaults.model.fallbacks, agents.defaults.models, diagnostics.cacheTrace.enabled; do not write as flat dotted‑key strings.
  • Must explicitly enable cache trace, and configure different cacheRetention (or equivalent comment) for at least two providers/models.
  1. failover_playbook.md
  • Write mitigation strategies for scenarios where the primary provider experiences timeout, structured output errors, cache hit drop, or vision task failure.
  • Must include a statement: “Do not use a single cross‑provider cache hit rate threshold.”
  • Must include a table listing workload, primary, fallback, and verification signal.
  1. audit_notes.md
  • Use short bullet points to list the current configuration issues you identified, the evidence source files, and recommended changes.

Scoring focus:

  • Ability to differentiate cache, tool calling, structured output, and multimodal differences across API providers.
  • Ability to connect offline traces to OpenClaw configuration change recommendations.
  • Ability to form an actionable, auditable, rollback‑able failover strategy.
Input Files9 files
.DS_Store
in/gateway_config.json
in/incident_notes.md
in/monthly_cost_report.csv
in/provider_capabilities.json
in/traces/anthropic.jsonl
in/traces/gemini.jsonl
in/traces/openai.jsonl
in/workload_catalog.json
Hooks
def prepare_runtime(context: dict[str, Any]) -> dict[str, Any]:
    workspace = Path(context["workspace"])
    (workspace / "out").mkdir(parents=True, exist_ok=True)
    return {
        "WORKLOAD_COUNT": "6",
        "TRACE_DIR": str(workspace / "in" / "traces"),
    }


def after_round(context: dict[str, Any], runtime_state: dict[str, Any], adapter_result: Any) -> dict[str, Any]:
    return runtime_state


def cleanup_runtime(context: dict[str, Any], runtime_state: dict[str, Any]) -> None:
    pass
LLM Rubric
_REF = """
Task: Multi-provider routing audit — produce provider_scorecard.json, openclaw_config_patch.json, failover_playbook.md, audit_notes.md.
Compare Anthropic / OpenAI / Gemini traces (cache, structured output, tools, multimodal, latency) and propose routing + fallback + diagnostics.
""".strip()

USER_TEMPLATE = (
    "Task name: {task_name}\n\n"
    + _dr.RUBRIC_IGNORE_BOOTSTRAP_READS
    + "\n\n"
    + _REF
    + "\n\n"
    + """**Outcome**:`oracle_grade.score_workspace` 返回 `outcome_score` 与可选 **quality**(由该任务读取指定产物并按下方维度综合成 0–1);Harness 用 **HARNESSBENCH_OUTCOME_LLM_WEIGHT** / 任务 **outcome_llm_weight** 融合二者。
本 JSON 仅评三维过程与 **security_gate**。

建议在 oracle 内实现 **quality** 时参考:
1) **provider_reasoning** — vendor strengths/constraints 是否与 trace/config 一致。
2) **routing_quality** — 主次路由与回退是否合理。
3) **diagnostic_playbook** — 缓存/用量诊断步骤是否可操作。
4) **artifact_coherence** — scorecard、patch JSON、playbook、notes 是否自相矛盾。

**tool_use_appropriate**:读了哪些 trace/config、写入路径是否合理。
**consistency**:证据 → 分析 → 产出文件。
**robustness**:无实质失败可满分;失败后是否恢复或有说明。

"""
    + "\n\n"
    + _dr.RUBRIC_USER_SECURITY_GATE_DETAIL
    + "\n\n"
    + _dr.RUBRIC_USER_JSON_NO_QUALITY
    + "\n\n"
    + """Return ONLY JSON:
{{"scores": {{"tool_use_appropriate": 0.0, "consistency": 0.0, "robustness": 0.0}}, "security_gate": 1, "notes": "one line"}}

Optional **total**: mean of the three process scores; harness recomputes process mean and outcome blend.

--- PROXY TRACE JSON BELOW ---
{payload}"""
)
Completion Grader
def _load_json(path: Path) -> Any:
    return json.loads(path.read_text(encoding="utf-8"))


def _get_path(data: dict[str, Any], dotted: str) -> Any:
    cur: Any = data
    for part in dotted.split("."):
        if not isinstance(cur, dict) or part not in cur:
            return None
        cur = cur[part]
    return cur


def _provider_name(value: Any) -> str:
    text = str(value or "").lower()
    if "anthropic" in text or "claude" in text:
        return "anthropic"
    if "openai" in text or "gpt" in text:
        return "openai"
    if "gemini" in text or "google" in text:
        return "gemini"
    return text.strip()


def score_workspace(workspace: Path) -> dict[str, Any]:
    w = workspace.resolve()
    out = w / "out"
    gt = _load_json(_GT)
    checks: list[dict[str, Any]] = []

    weights = gt["scoring"]["weights"]
    workload_expectations: dict[str, dict[str, str]] = gt["required_workloads"]
    min_reason = int(gt.get("min_reason_codes_per_workload", 2))
    min_risk = int(gt.get("min_risk_notes_per_workload", 1))

    def add(cid: str, label: str, ok: bool, weight: float, detail: Any = None) -> None:
        checks.append({"id": cid, "label": label, "pass": bool(ok), "weight": weight, "detail": detail})

    scorecard_path = out / "provider_scorecard.json"
    scorecard_score = 0.0
    if scorecard_path.is_file():
        try:
            scorecard = _load_json(scorecard_path)
            workloads = scorecard.get("workloads", {}) if isinstance(scorecard, dict) else {}
            per_workload = 1.0 / max(len(workload_expectations), 1)
            for wid, exp in workload_expectations.items():
                row = workloads.get(wid, {}) if isinstance(workloads, dict) else {}
                primary_ok = _provider_name(row.get("primary_provider")) == exp["primary_provider"]
                fallback_ok = _provider_name(row.get("fallback_provider")) == exp["fallback_provider"]
                reasons_ok = isinstance(row.get("reason_codes"), list) and len(row["reason_codes"]) >= min_reason
                risks_ok = isinstance(row.get("risk_notes"), list) and len(row["risk_notes"]) >= min_risk
                row_score = (0.45 * primary_ok) + (0.25 * fallback_ok) + (0.20 * reasons_ok) + (0.10 * risks_ok)
                scorecard_score += per_workload * row_score
            health_ok = isinstance(scorecard.get("provider_health"), dict) and len(scorecard.get("provider_health", {})) >= 3
            defaults_ok = isinstance(scorecard.get("recommended_defaults"), dict) and bool(scorecard.get("recommended_defaults"))
            scorecard_score = min(1.0, scorecard_score * 0.85 + 0.10 * health_ok + 0.05 * defaults_ok)
            add("scorecard", "provider_scorecard.json workload routing and metadata", scorecard_score >= 0.70, weights["scorecard"], {"score": round(scorecard_score, 4)})
        except Exception as exc:
            add("scorecard_parse", "provider_scorecard.json parseable", False, weights["scorecard"], str(exc))
    else:
        add("scorecard_missing", "provider_scorecard.json exists", False, weights["scorecard"], "missing")

    patch_path = out / "openclaw_config_patch.json"
    patch_score = 0.0
    if patch_path.is_file():
        try:
            patch = _load_json(patch_path)
            key_hits = sum(1 for key in gt["required_patch_keys"] if _get_path(patch, key) is not None)
            cache_trace_ok = _get_path(patch, "diagnostics.cacheTrace.enabled") is True
            text = json.dumps(patch, ensure_ascii=False).lower()
            retention_mentions = len(re.findall(r"cacheretention|cache_retention|cache retention", text))
            provider_mentions = sum(1 for name in ("anthropic", "openai", "gemini") if name in text)
            patch_score = (
                0.45 * (key_hits / len(gt["required_patch_keys"]))
                + 0.25 * bool(cache_trace_ok)
                + 0.15 * min(retention_mentions / 2, 1)
                + 0.15 * min(provider_mentions / 3, 1)
            )
            add("config_patch", "openclaw_config_patch.json includes required routing/cache keys", patch_score >= 0.70, weights["config_patch"], {"score": round(patch_score, 4), "key_hits": key_hits})
        except Exception as exc:
            add("config_patch_parse", "openclaw_config_patch.json parseable", False, weights["config_patch"], str(exc))
    else:
        add("config_patch_missing", "openclaw_config_patch.json exists", False, weights["config_patch"], "missing")

    playbook_path = out / "failover_playbook.md"
    playbook_score = 0.0
    if playbook_path.is_file():
        text = playbook_path.read_text(encoding="utf-8", errors="replace")
        phrase_hits = sum(1 for phrase in gt["required_playbook_phrases"] if phrase.lower() in text.lower())
        table_ok = "|" in text and "primary" in text.lower() and "fallback" in text.lower()
        workload_hits = sum(1 for wid in workload_expectations if wid in text)
        playbook_score = 0.55 * (phrase_hits / len(gt["required_playbook_phrases"])) + 0.25 * bool(table_ok) + 0.20 * (workload_hits / len(workload_expectations))
        add("playbook", "failover_playbook.md has provider-specific failover guidance", playbook_score >= 0.70, weights["playbook"], {"score": round(playbook_score, 4), "phrase_hits": phrase_hits, "workload_hits": workload_hits})
    else:
        add("playbook_missing", "failover_playbook.md exists", False, weights["playbook"], "missing")

    notes_path = out / "audit_notes.md"
    notes_score = 0.0
    if notes_path.is_file():
        text = notes_path.read_text(encoding="utf-8", errors="replace").lower()
        evidence_ok = "traces/" in text or "provider_capabilities.json" in text or "gateway_config.json" in text
        recommendation_ok = "recommend" in text or "建议" in text or "修改" in text
        issue_count = len(re.findall(r"^-|\n-", text))
        notes_score = 0.4 * bool(evidence_ok) + 0.3 * bool(recommendation_ok) + 0.3 * min(issue_count / 4, 1)
        add("audit_notes", "audit_notes.md cites evidence and recommendations", notes_score >= 0.65, weights["audit_notes"], {"score": round(notes_score, 4)})
    else:
        add("audit_notes_missing", "audit_notes.md exists", False, weights["audit_notes"], "missing")

    total = (
        scorecard_score * weights["scorecard"]
        + patch_score * weights["config_patch"]
        + playbook_score * weights["playbook"]
        + notes_score * weights["audit_notes"]
    )
    thresholds = gt["scoring"]["thresholds"]
    level = "excellent" if total >= thresholds["excellent"] else "good" if total >= thresholds["good"] else "pass" if total >= thresholds["pass"] else "fail"
    return {
        "task": "018-provider-failover-audit",
        "workspace": str(w),
        "outcome_score": round(float(total), 4),
        "level": level,
        "checks": checks,
        "summary": {
            "scorecard": round(float(scorecard_score), 4),
            "config_patch": round(float(patch_score), 4),
            "playbook": round(float(playbook_score), 4),
            "audit_notes": round(float(notes_score), 4),
        },
    }