Evaluation: 2026-03-05 Baseline
Analyst: aegis/crew/stryder Date: 2026-03-05 Dataset: 13,925 desires, 13,806 invocations (2026-02-09 to 2026-03-05) Source: 100% Claude Code PostToolUseFailure hooks across Gas Town multi-agent system
Executive Summary
First comprehensive evaluation of desirepath data from a production multi-agent system. 13,925 tool failures recorded across 25 days from ~15 agents. Three key findings: (1) CLI misuse dominates failures at 93%, (2) only 3 aliases exist covering a fraction of correctable errors, (3) MCP infrastructure downtime causes 440 silent failures with no alerting.
Dataset Overview
| Metric | Value |
|---|---|
| Total desires | 13,925 |
| Unique tool names | 29 |
| Date range | 2026-02-09 to 2026-03-05 (25 days) |
| Daily average | 557 failures/day |
| Source | claude-code (13,903), transcript-analysis (22) |
| Aliases configured | 3 |
Failure Distribution by Tool
| Tool | Count | % | Category |
|---|---|---|---|
| Bash | 12,932 | 92.9% | Command execution |
| Read | 502 | 3.6% | File access |
| mcp__homelab__batch_probe | 193 | 1.4% | MCP infrastructure |
| mcp__homelab__prometheus_query | 49 | 0.4% | MCP infrastructure |
| WebFetch | 47 | 0.3% | Network |
| mcp__homelab__container_status | 43 | 0.3% | MCP infrastructure |
| mcp__homelab__service_health | 41 | 0.3% | MCP infrastructure |
| Other MCP tools | 76 | 0.5% | MCP infrastructure |
| Other | 42 | 0.3% | Various |
Analysis by Error Category
1. CLI Misuse (Bash) — 93% of all failures
The 12,932 Bash failures break down into subcategories:
| Subcategory | Count | % of Bash | Actionable? |
|---|---|---|---|
| gt unknown commands | 743 | 5.7% | Yes — document or implement |
| bd unknown flags | 364 | 2.8% | Yes — aliases or flag additions |
| Command not found | 178 | 1.4% | Yes — install or alias |
| Git push rejected | 334 | 2.6% | Partially — workflow issue |
| Not a git repo | 188 | 1.5% | Yes — cwd detection |
| Git unstaged changes | 62 | 0.5% | Partially — workflow issue |
| bd sync required | 112 | 0.9% | Yes — auto-sync or docs |
| Normal dev errors | ~10,951 | 84.6% | No — expected during development |
Key insight: ~15% of Bash failures (1,981) are correctable through aliases, documentation, or tooling improvements. The remaining 85% are normal development friction (test failures, build errors, typos).
2. Top Non-Existent GT Commands
Agents repeatedly try commands that don’t exist:
| Command | Count | What Agent Expected |
|---|---|---|
| gt deacon pending | 175 | Check deacon task queue |
| gt await-signal | 101 | Wait for async event |
| gt mol hook | 43 | Hook a molecule (correct: gt hook) |
| gt health | 41 | System health check |
| gt plugin status | 39 | Check plugin state |
| gt mq integration list | 28 | List MQ integrations |
| gt wisp | 26 | Manage wisps directly |
| gt plugin due | 25 | Check plugin schedule |
| gt sessions | 20 | List active sessions (correct: gt session) |
| gt rig health | 19 | Rig health check |
Recommendation: File desire-path beads for the top 5. Either implement or create aliases with helpful error messages.
3. Top Non-Existent BD Flags
| Flag Attempted | Count | Correct Alternative |
|---|---|---|
| –gated | 64 | (removed feature) |
| –wisp | 35 | (not a filter) |
| –assign | 27 | –assignee (-a) |
| –rig | 23 | (use prefix routing) |
| –comment | 21 | –append-notes |
| –prefix | 14 | (use prefix routing) |
| –mol | 11 | (not a filter) |
| –stdin | 11 | (pipe via heredoc) |
| –owner | 10 | –assignee (-a) |
| –epic | 7 | (not implemented) |
Current alias coverage: Only 3 aliases exist:
--assign→--assignee(bd flag) — covers 27 failures--owner→--assignee(bd flag) — covers 10 failuresbd note X→bd update X --append-notes— covers ~8 failures
Gap: --comment → --append-notes could prevent 21 more failures/month.
--gated appears 64 times but was a removed feature — needs a helpful error message.
4. Read Tool Failures
| Error Type | Count | Root Cause |
|---|---|---|
| EISDIR (read directory) | 162 | Agent used Read instead of ls/Bash |
| File not found | 218 | Agent guessed wrong path |
| File too large | 8 | Exceeded 25K token limit |
Recommendation: Bobbin could inject directory tree output on EISDIR errors
(bead aegis-qalm1v filed). tree command now installed on luvu + kota.
5. MCP Server Downtime
| MCP Tool | Failures | Error |
|---|---|---|
| batch_probe | 193 | no available server |
| prometheus_query | 49 | no available server |
| container_status | 43 | no available server |
| service_health | 41 | no available server |
| list_containers | 21 | no available server |
| container_logs | 20 | no available server |
| Other MCP | 73 | no available server |
Total: 440 MCP failures, all “no available server” — homelab-mcp was down. No alerts fired. Bead aegis-ixx1e9 filed for maldoon to add monitoring.
6. Env-Need Analysis (dp env-needs output)
dp env-needs reports 43 “missing tools” but many are false positives.
The env-need categorizer incorrectly flags shell builtins and installed tools:
| Reported Missing | Actual Status | Issue |
|---|---|---|
| ls, cd, cat, echo | Shell builtins | False positive — these are Bash builtins/coreutils, always available |
| ssh, git, grep | Installed | False positive — exit code != “not found” |
| just | Not installed | True positive — just (justfile runner) not on luvu |
| dig, nslookup, host | Not installed | True positive — DNS tools missing |
| sqlite3 | Not installed | True positive — was missing, now installed |
| python3 | Installed | False positive — python3 exists, python doesn’t |
Recommendation: env-need categorizer needs refinement. Should check if the command actually produced “command not found” vs other errors. High false positive rate (>50%) reduces trust in the output.
Agent Workspace Analysis
| Workspace | Failures | Primary Issues |
|---|---|---|
| deacon | 3,007 | gt mol squash wrong flags, patrol loop errors |
| aegis/crew/ellie | 1,042 | gt mol squash, read failures |
| deacon/dogs/boot | 982 | Patrol loop startup errors |
| aegis/crew/malcolm | 627 | CLI misuse, path guessing |
| aegis/witness | 603 | Patrol loop errors |
| aegis/crew/goldblum | 603 | Build errors, flag guessing |
| aegis/refinery/rig | 510 | Merge queue errors |
| aegis/crew/ian | 443 | Build/test errors |
| bucket/refinery/rig | 418 | Merge queue errors |
| mayor | 394 | CLI dispatch errors |
Insight: Deacon + dogs account for 29% of all failures. Most are repetitive patrol loop errors (gt mol squash with wrong flags). A single alias or patrol fix would eliminate thousands of failures.
Turn Pattern Analysis
dp turns shows tool call sequences. The dominant pattern is long Bash-only
turns (264, 152, 151 calls). This indicates agents spending many turns retrying
failed Bash commands rather than changing approach.
Recommendation: Consider a “struggling detection” feature — if an agent has >5 consecutive Bash failures on similar commands, surface documentation or suggest an alternative approach.
Alias Effectiveness
Current Aliases (3)
| Alias | Type | Estimated Monthly Prevents |
|---|---|---|
| –assign → –assignee | flag (bd) | ~27 |
| –owner → –assignee | flag (bd) | ~10 |
| bd note → bd update –append-notes | regex | ~8 |
| Total | ~45 |
Recommended New Aliases (5)
| Alias | Type | Estimated Monthly Prevents |
|---|---|---|
| –comment → –append-notes | flag (bd) | ~21 |
| gt sessions → gt session | command | ~20 |
| gt mol hook → gt hook | command | ~43 |
| gt health → gt rig status | command | ~41 |
| –no-digest → (removed, explain) | flag (bd) | ~8 |
| Total | ~133 |
Unrealized Value
With current 3 aliases: ~45 prevents/month (0.3% of failures) With 8 aliases: ~178 prevents/month (1.3% of failures) With doc-mapping for top 50 patterns: ~800 informed/month (5.7% of failures)
dp Feature Utilization Assessment
| dp Feature | Currently Used? | Value | Action Needed |
|---|---|---|---|
dp record | Yes (PostToolUseFailure hook) | High | Working well |
dp ingest | Yes (via record) | High | Working well |
dp stats | Manually by operators | Medium | Could auto-report |
dp paths | Manually by operators | Medium | Good for evaluation |
dp aliases | Yes (3 configured) | High | Add 5 more aliases |
dp pave --hook | Yes (PreToolUse) | High | Working well |
dp pave --agents-md | Not used | Medium | Should generate rules |
dp env-needs | Not used (high false positives) | Low | Needs refinement |
dp turns | Not used | Low | Useful for evaluation only |
dp similar | Not used | Low | Niche use case |
dp suggest | Not implemented yet | High | Would synthesize all signals |
dp serve | Not used | Low | No consumer yet |
Immediate Actions
- Add 5 new aliases —
dp aliasfor –comment, gt sessions, gt mol hook, gt health, –no-digest - Run
dp pave --agents-md— Generate AGENTS.md rules from alias data and append to agent instruction files - Fix env-needs false positives — Check for “command not found” in error text, not just exit code
Future Features Needed
- dp suggest (planned, dp-9 design exists) — Synthesize all data sources into prioritized recommendations
- dp map (bead aegis-420cz6) — Map documentation to failing tool patterns
- Struggling detection — Identify agents retrying same failure pattern
- Recovery tracking (bead aegis-gvr2vh) — Detect when fixes reduce failures
Methodology Notes
- Data extracted via Python sqlite3 queries against ~/.dp/desires.db
- dp CLI commands (
stats,paths,env-needs,aliases,turns) used for built-in analytics - Error categorization done via substring matching on error text
- Agent identification via cwd field (workspace path → agent name)
- All counts are raw (no dedup by session or time window)
Next Evaluation
Schedule next evaluation for 2026-03-15 (10 days). Track:
- Did the 5 new aliases reduce failures?
- Did MCP monitoring (aegis-ixx1e9) reduce downtime?
- Did bobbin tagging sweep improve injection quality (feedback noise ratio)?
- Total desire count growth rate
- New error patterns emerging