Architecture
Reliability Flywheel (Eval + Risk Gates)
Synced from github.com/CoWork-OS/CoWork-OS/docs
This document describes the reliability system added to CoWork OS to turn production failures into repeatable regressions, gate risky task completions, and harden releases.
Goals
- Increase task reliability by replaying known failures continuously.
- Apply stronger review only when risk justifies it.
- Keep all reliability data local-first (no telemetry upload path required).
- Convert reliability policy from docs-only guidance into merge and release gates.
Scope Implemented
- Phase 1 foundation: eval schema, local corpus, replay runner, baseline metrics.
- Phase 2 foundation: risk scoring and policy-driven tiered review gate.
- Phase 3 foundation: prompt architecture modularization, deduped shared policy blocks, skill routing budgets.
- Phase 4 foundation: nightly hardening workflow, PR-targeted eval gate, release hardening gate.
- Reliability V2 hardening: balanced fail-closed completion for required contracts, split KPI tracking, and regression tags for contract/verification/dependency failures.
Architecture
Eval Data Model (SQLite)
Schema and migrations are in src/electron/database/schema.ts.
Added task-level reliability metadata:
tasks.risk_leveltasks.eval_case_idtasks.eval_run_id
Added eval tables:
eval_caseseval_suiteseval_runseval_case_runs
Added indexes:
idx_tasks_risk_levelidx_tasks_eval_case_ididx_tasks_eval_run_id
Shared Types and IPC
Added in src/shared/types.ts:
AgentConfig.reviewPolicy?: "off" | "balanced" | "strict"AgentConfig.entropySweepPolicy?: "off" | "balanced" | "strict"AgentConfig.stepIntentAlignmentPolicy?: "off" | "balanced" | "strict"AgentConfig.stepDecompositionPolicy?: "off" | "balanced" | "strict"Task.riskLevel?: "low" | "medium" | "high"Task.evalCaseId?: stringTask.evalRunId?: string
IPC channels:
eval:listSuiteseval:runSuiteeval:getRuneval:getCaseeval:createCaseFromTask
Wired in:
src/electron/ipc/handlers.tssrc/electron/preload.ts
Eval Runtime Services
Deterministic local eval service:
src/electron/eval/EvalService.ts
Risk scoring + gate decision matrix:
src/electron/eval/risk.ts
Risk scoring defaults:
+2shell/git mutation evidence+2more than 5 changed files+2tests expected but missing test evidence+1repeated tool failures (>2)
Risk levels:
0-2: low3-5: medium6+: high
Review policies:
off: no extra gate behaviorbalanced: quality pass for mutating tasks, strict contract for medium/high, verification agent for highstrict: quality pass for all, strict contract for all, verification/evidence for medium/high
Post-completion entropy sweep:
off: disabledbalanced: run for high-risk or clearly mutating tasksstrict: run for mutating tasks and non-low-risk tasks- Default resolution: explicit config, then
COWORK_ENTROPY_SWEEP_DEFAULT, thenreviewPolicy
Daemon Enforcement Path
Completion flow computes risk and applies gate policy in:
src/electron/agent/daemon.ts
Completion flow also passes verified-mode evidence bundles into the quality gate and post-completion verifier, so deterministic checks are visible to the final audit path.
After completion, the daemon may launch a read-only entropy sweep for the task's blast radius to look for stale docs, contradictions, and dead-code hints. The sweep is non-blocking and only reports findings.
Optional auto-policy defaults (for code/operations domains) can be enabled via env vars:
COWORK_REVIEW_POLICY_ENABLE_AUTOCOWORK_REVIEW_POLICY_AUTO_DEFAULT(balancedorstrict)
Eval Corpus and Replay Workflows
Corpus Build
Script:
scripts/qa/build_eval_corpus.cjs
Behavior:
- Extracts failed/partial/failure-class tasks into
eval_cases. - Sanitizes prompts for secrets/PII before storing
sanitized_prompt. - Links source task to case via
tasks.eval_case_id. - Adds case to
reliability-regressionssuite.
Suite Replay
Script:
scripts/qa/run_eval_suite.cjs
Modes:
deterministic: evaluates case assertions against source task/events.hooks: triggers replay tasks through hooks, then evaluates assertions.
Both scripts use the sqlite3 CLI (not better-sqlite3) and fail fast when the CLI is missing.
Reliability V2 tags promoted into eval assertions/metadata:
contract_unmet_write_requiredmissing_required_workspace_artifactverification_required_faildependency_unavailable
Baseline Metrics
Computed in EvalService.getBaselineMetrics(...):
taskSuccessRatetoolFailureRateByToolretriesPerTaskapprovalDeadEndRateverificationPassRateagent_core_success_ratedependency_availability_rateverification_block_rateartifact_contract_failure_rate
Prompt and Skill Reliability Hardening
Modular Prompt Composition
Added shared prompt section module:
src/electron/agent/executor-prompt-sections.ts
Capabilities:
- section-level token budgets
- session-scoped memoization for stable sections
- turn-scoped recomputation for dynamic sections
- provider-aware stable-prefix prompt caching derived from section scopes
- total prompt budget composition
- optional section dropping by priority
- truncation and dropped-section reporting
- shared mode/domain policy builder
Executor Integration
Wired into src/electron/agent/executor.ts:
- shared policy core reused across planning/execution/follow-up prompts
- shared section builder reused by execution and follow-up turns
- explicit section budgets (role/context/memory/playbook/infra/personality/guidelines/tool descriptions)
- plan prompt total budget (
PLAN_SYSTEM_PROMPT_TOTAL_BUDGET) - execution/follow-up system prompt total budget (
EXECUTION_SYSTEM_PROMPT_TOTAL_BUDGET)
Tool Prompt Rendering
Tool guidance now follows a shared render pipeline instead of duplicating routing hints across prompt templates.
- tool-local internal prompt metadata is attached to
LLMTool - one render source produces both compact planning text and final provider-facing descriptions
- rendering happens only after tool visibility and policy filtering
- rendered tool arrays are cached against tool-catalog version plus stable render context
This keeps prompt guidance closer to the tool definition while reducing executor-prompt duplication.
The same prompt architecture now also feeds provider-side prompt caching: stable session sections form the cacheable prefix, dynamic turn sections stay uncached, and cache telemetry (cachedTokens, cacheWriteTokens) is available to cost accounting when providers expose it.
Adaptive Output Budget Recovery
Execution and follow-up turns now share one provider-aware output-budget policy instead of relying on provider defaults or one-off token floors.
Added policy module:
src/electron/agent/llm/output-token-policy.ts
Wired through:
src/electron/agent/executor-llm-turn-utils.tssrc/electron/agent/runtime/SessionRuntime.tssrc/electron/agent/runtime/turn-kernel.tssrc/electron/agent/executor-loop-utils.ts
Capabilities:
- provider-family budgeting for Anthropic, Bedrock Claude, OpenAI, Azure OpenAI, Gemini, OpenRouter, and a conservative generic fallback
- centralized transport-field mapping for
max_tokens,max_completion_tokens, andmax_output_tokens - explicit output budgets on agentic turns in adaptive mode instead of backend defaults
- one same-request escalation on truncation before continuation prompting
- truncation classification that distinguishes visible partial output from reasoning-budget exhaustion
- targeted exhaustion guidance when a retry still produces no usable answer text
- structured runtime logging for budget choice, escalation, truncation classification, and continuation fallback decisions
This improves truncation recovery while keeping adapter hard caps, task-level overrides, and context-headroom clamps in place.
Skill Routing Controls
Skill shortlist and budget controls are in:
src/electron/agent/custom-skill-loader.tssrc/electron/agent/tools/registry.ts
Defaults:
- shortlist size
20 - low-confidence threshold
0.55 - fallback instruction to use
skill_list - hard cap on injected skill text
CI, Nightly, and Release Gates
PR Regression Policy Gate
New CI job in .github/workflows/ci.yml:
Regression Policy Gate
Enforcement script:
scripts/qa/enforce_eval_regression_policy.cjs
Policy:
- if PR indicates a production failure/incident fix, at least one eval case JSON under
scripts/qa/eval-cases/must be added/updated.
PR template updated in:
.github/PULL_REQUEST_TEMPLATE.md
Targeted Eval Gate
Existing targeted eval gate now runs with Node 24 and installs sqlite3 CLI before replay:
.github/workflows/ci.yml
Path trigger:
src/electron/agent/**src/electron/agent/tools/**
Nightly Hardening
Workflow:
.github/workflows/nightly-hardening.yml
Runs:
- eval corpus build
- deterministic eval suite
- battery suite (when hooks secrets exist)
Artifacts:
- grouped human-readable summary (
summary.md) - machine-readable report (
report.json)
Stability-window behavior:
- non-blocking before cutoff
- blocking after cutoff (
HARDENING_REQUIRED_AFTER_UTC)
Release Hardening Gate
Workflow:
.github/workflows/release.yml
Added job:
Hardening Release Gate
Behavior:
- runs deterministic eval and battery checks
- applies same date-based strictness window
- blocks release after cutoff when hardening fails
Local Developer Commands
# Build eval corpus from recent failures
npm run qa:eval:build -- --window-days 30 --limit 300 --suite reliability-regressions
# Run deterministic replay
npm run qa:eval:run -- --suite reliability-regressions --mode deterministic
# Enforce PR production-failure regression policy (CI uses PR event context)
npm run qa:eval:enforce-regressions
# Full reliability loop (eval + battery)
npm run qa:reliability
Optional DB override:
COWORK_DB_PATH=/tmp/cowork-eval.db npm run qa:eval:run -- --suite reliability-regressions --mode deterministic
Local-Only Data Policy
- Reliability data is stored in local SQLite (
userData/cowork-os.db). - Eval corpus entries are sanitized before persistence.
- No required telemetry upload path is introduced by this reliability system.
Remaining Non-Code Work
The following require runtime history, not new code:
- 90-day KPI attainment proof (
+15% eval pass,-30% repeated tool failure loops,-25% verification-failed-after-complete). - Trend monitoring and policy tuning over real task volume.
Source Map
Core implementation files:
src/electron/eval/EvalService.tssrc/electron/eval/risk.tssrc/electron/agent/daemon.tssrc/electron/agent/executor.tssrc/electron/agent/executor-prompt-sections.tssrc/electron/database/schema.tssrc/electron/database/repositories.tssrc/electron/ipc/handlers.tssrc/electron/preload.tssrc/shared/types.ts
Operational scripts and workflows:
scripts/qa/build_eval_corpus.cjsscripts/qa/run_eval_suite.cjsscripts/qa/enforce_eval_regression_policy.cjs.github/workflows/ci.yml.github/workflows/nightly-hardening.yml.github/workflows/release.yml