Architecture

Reliability Flywheel (Eval + Risk Gates)

Synced from github.com/CoWork-OS/CoWork-OS/docs

This document describes the reliability system added to CoWork OS to turn production failures into repeatable regressions, gate risky task completions, and harden releases.

Goals

  • Increase task reliability by replaying known failures continuously.
  • Apply stronger review only when risk justifies it.
  • Keep all reliability data local-first (no telemetry upload path required).
  • Convert reliability policy from docs-only guidance into merge and release gates.

Scope Implemented

  • Phase 1 foundation: eval schema, local corpus, replay runner, baseline metrics.
  • Phase 2 foundation: risk scoring and policy-driven tiered review gate.
  • Phase 3 foundation: prompt architecture modularization, deduped shared policy blocks, skill routing budgets.
  • Phase 4 foundation: nightly hardening workflow, PR-targeted eval gate, release hardening gate.
  • Reliability V2 hardening: balanced fail-closed completion for required contracts, split KPI tracking, and regression tags for contract/verification/dependency failures.

Architecture

Eval Data Model (SQLite)

Schema and migrations are in src/electron/database/schema.ts.

Added task-level reliability metadata:

  • tasks.risk_level
  • tasks.eval_case_id
  • tasks.eval_run_id

Added eval tables:

  • eval_cases
  • eval_suites
  • eval_runs
  • eval_case_runs

Added indexes:

  • idx_tasks_risk_level
  • idx_tasks_eval_case_id
  • idx_tasks_eval_run_id

Shared Types and IPC

Added in src/shared/types.ts:

  • AgentConfig.reviewPolicy?: "off" | "balanced" | "strict"
  • AgentConfig.entropySweepPolicy?: "off" | "balanced" | "strict"
  • AgentConfig.stepIntentAlignmentPolicy?: "off" | "balanced" | "strict"
  • AgentConfig.stepDecompositionPolicy?: "off" | "balanced" | "strict"
  • Task.riskLevel?: "low" | "medium" | "high"
  • Task.evalCaseId?: string
  • Task.evalRunId?: string

IPC channels:

  • eval:listSuites
  • eval:runSuite
  • eval:getRun
  • eval:getCase
  • eval:createCaseFromTask

Wired in:

  • src/electron/ipc/handlers.ts
  • src/electron/preload.ts

Eval Runtime Services

Deterministic local eval service:

  • src/electron/eval/EvalService.ts

Risk scoring + gate decision matrix:

  • src/electron/eval/risk.ts

Risk scoring defaults:

  • +2 shell/git mutation evidence
  • +2 more than 5 changed files
  • +2 tests expected but missing test evidence
  • +1 repeated tool failures (>2)

Risk levels:

  • 0-2: low
  • 3-5: medium
  • 6+: high

Review policies:

  • off: no extra gate behavior
  • balanced: quality pass for mutating tasks, strict contract for medium/high, verification agent for high
  • strict: quality pass for all, strict contract for all, verification/evidence for medium/high

Post-completion entropy sweep:

  • off: disabled
  • balanced: run for high-risk or clearly mutating tasks
  • strict: run for mutating tasks and non-low-risk tasks
  • Default resolution: explicit config, then COWORK_ENTROPY_SWEEP_DEFAULT, then reviewPolicy

Daemon Enforcement Path

Completion flow computes risk and applies gate policy in:

  • src/electron/agent/daemon.ts

Completion flow also passes verified-mode evidence bundles into the quality gate and post-completion verifier, so deterministic checks are visible to the final audit path.

After completion, the daemon may launch a read-only entropy sweep for the task's blast radius to look for stale docs, contradictions, and dead-code hints. The sweep is non-blocking and only reports findings.

Optional auto-policy defaults (for code/operations domains) can be enabled via env vars:

  • COWORK_REVIEW_POLICY_ENABLE_AUTO
  • COWORK_REVIEW_POLICY_AUTO_DEFAULT (balanced or strict)

Eval Corpus and Replay Workflows

Corpus Build

Script:

  • scripts/qa/build_eval_corpus.cjs

Behavior:

  • Extracts failed/partial/failure-class tasks into eval_cases.
  • Sanitizes prompts for secrets/PII before storing sanitized_prompt.
  • Links source task to case via tasks.eval_case_id.
  • Adds case to reliability-regressions suite.

Suite Replay

Script:

  • scripts/qa/run_eval_suite.cjs

Modes:

  • deterministic: evaluates case assertions against source task/events.
  • hooks: triggers replay tasks through hooks, then evaluates assertions.

Both scripts use the sqlite3 CLI (not better-sqlite3) and fail fast when the CLI is missing.

Reliability V2 tags promoted into eval assertions/metadata:

  • contract_unmet_write_required
  • missing_required_workspace_artifact
  • verification_required_fail
  • dependency_unavailable

Baseline Metrics

Computed in EvalService.getBaselineMetrics(...):

  • taskSuccessRate
  • toolFailureRateByTool
  • retriesPerTask
  • approvalDeadEndRate
  • verificationPassRate
  • agent_core_success_rate
  • dependency_availability_rate
  • verification_block_rate
  • artifact_contract_failure_rate

Prompt and Skill Reliability Hardening

Modular Prompt Composition

Added shared prompt section module:

  • src/electron/agent/executor-prompt-sections.ts

Capabilities:

  • section-level token budgets
  • session-scoped memoization for stable sections
  • turn-scoped recomputation for dynamic sections
  • provider-aware stable-prefix prompt caching derived from section scopes
  • total prompt budget composition
  • optional section dropping by priority
  • truncation and dropped-section reporting
  • shared mode/domain policy builder

Executor Integration

Wired into src/electron/agent/executor.ts:

  • shared policy core reused across planning/execution/follow-up prompts
  • shared section builder reused by execution and follow-up turns
  • explicit section budgets (role/context/memory/playbook/infra/personality/guidelines/tool descriptions)
  • plan prompt total budget (PLAN_SYSTEM_PROMPT_TOTAL_BUDGET)
  • execution/follow-up system prompt total budget (EXECUTION_SYSTEM_PROMPT_TOTAL_BUDGET)

Tool Prompt Rendering

Tool guidance now follows a shared render pipeline instead of duplicating routing hints across prompt templates.

  • tool-local internal prompt metadata is attached to LLMTool
  • one render source produces both compact planning text and final provider-facing descriptions
  • rendering happens only after tool visibility and policy filtering
  • rendered tool arrays are cached against tool-catalog version plus stable render context

This keeps prompt guidance closer to the tool definition while reducing executor-prompt duplication.

The same prompt architecture now also feeds provider-side prompt caching: stable session sections form the cacheable prefix, dynamic turn sections stay uncached, and cache telemetry (cachedTokens, cacheWriteTokens) is available to cost accounting when providers expose it.

Adaptive Output Budget Recovery

Execution and follow-up turns now share one provider-aware output-budget policy instead of relying on provider defaults or one-off token floors.

Added policy module:

  • src/electron/agent/llm/output-token-policy.ts

Wired through:

  • src/electron/agent/executor-llm-turn-utils.ts
  • src/electron/agent/runtime/SessionRuntime.ts
  • src/electron/agent/runtime/turn-kernel.ts
  • src/electron/agent/executor-loop-utils.ts

Capabilities:

  • provider-family budgeting for Anthropic, Bedrock Claude, OpenAI, Azure OpenAI, Gemini, OpenRouter, and a conservative generic fallback
  • centralized transport-field mapping for max_tokens, max_completion_tokens, and max_output_tokens
  • explicit output budgets on agentic turns in adaptive mode instead of backend defaults
  • one same-request escalation on truncation before continuation prompting
  • truncation classification that distinguishes visible partial output from reasoning-budget exhaustion
  • targeted exhaustion guidance when a retry still produces no usable answer text
  • structured runtime logging for budget choice, escalation, truncation classification, and continuation fallback decisions

This improves truncation recovery while keeping adapter hard caps, task-level overrides, and context-headroom clamps in place.

Skill Routing Controls

Skill shortlist and budget controls are in:

  • src/electron/agent/custom-skill-loader.ts
  • src/electron/agent/tools/registry.ts

Defaults:

  • shortlist size 20
  • low-confidence threshold 0.55
  • fallback instruction to use skill_list
  • hard cap on injected skill text

CI, Nightly, and Release Gates

PR Regression Policy Gate

New CI job in .github/workflows/ci.yml:

  • Regression Policy Gate

Enforcement script:

  • scripts/qa/enforce_eval_regression_policy.cjs

Policy:

  • if PR indicates a production failure/incident fix, at least one eval case JSON under scripts/qa/eval-cases/ must be added/updated.

PR template updated in:

  • .github/PULL_REQUEST_TEMPLATE.md

Targeted Eval Gate

Existing targeted eval gate now runs with Node 24 and installs sqlite3 CLI before replay:

  • .github/workflows/ci.yml

Path trigger:

  • src/electron/agent/**
  • src/electron/agent/tools/**

Nightly Hardening

Workflow:

  • .github/workflows/nightly-hardening.yml

Runs:

  • eval corpus build
  • deterministic eval suite
  • battery suite (when hooks secrets exist)

Artifacts:

  • grouped human-readable summary (summary.md)
  • machine-readable report (report.json)

Stability-window behavior:

  • non-blocking before cutoff
  • blocking after cutoff (HARDENING_REQUIRED_AFTER_UTC)

Release Hardening Gate

Workflow:

  • .github/workflows/release.yml

Added job:

  • Hardening Release Gate

Behavior:

  • runs deterministic eval and battery checks
  • applies same date-based strictness window
  • blocks release after cutoff when hardening fails

Local Developer Commands

# Build eval corpus from recent failures
npm run qa:eval:build -- --window-days 30 --limit 300 --suite reliability-regressions

# Run deterministic replay
npm run qa:eval:run -- --suite reliability-regressions --mode deterministic

# Enforce PR production-failure regression policy (CI uses PR event context)
npm run qa:eval:enforce-regressions

# Full reliability loop (eval + battery)
npm run qa:reliability

Optional DB override:

COWORK_DB_PATH=/tmp/cowork-eval.db npm run qa:eval:run -- --suite reliability-regressions --mode deterministic

Local-Only Data Policy

  • Reliability data is stored in local SQLite (userData/cowork-os.db).
  • Eval corpus entries are sanitized before persistence.
  • No required telemetry upload path is introduced by this reliability system.

Remaining Non-Code Work

The following require runtime history, not new code:

  • 90-day KPI attainment proof (+15% eval pass, -30% repeated tool failure loops, -25% verification-failed-after-complete).
  • Trend monitoring and policy tuning over real task volume.

Source Map

Core implementation files:

  • src/electron/eval/EvalService.ts
  • src/electron/eval/risk.ts
  • src/electron/agent/daemon.ts
  • src/electron/agent/executor.ts
  • src/electron/agent/executor-prompt-sections.ts
  • src/electron/database/schema.ts
  • src/electron/database/repositories.ts
  • src/electron/ipc/handlers.ts
  • src/electron/preload.ts
  • src/shared/types.ts

Operational scripts and workflows:

  • scripts/qa/build_eval_corpus.cjs
  • scripts/qa/run_eval_suite.cjs
  • scripts/qa/enforce_eval_regression_policy.cjs
  • .github/workflows/ci.yml
  • .github/workflows/nightly-hardening.yml
  • .github/workflows/release.yml