Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Evaluation: 2026-03-05 Baseline

Analyst: aegis/crew/stryder Date: 2026-03-05 Dataset: 13,925 desires, 13,806 invocations (2026-02-09 to 2026-03-05) Source: 100% Claude Code PostToolUseFailure hooks across Gas Town multi-agent system

Executive Summary

First comprehensive evaluation of desirepath data from a production multi-agent system. 13,925 tool failures recorded across 25 days from ~15 agents. Three key findings: (1) CLI misuse dominates failures at 93%, (2) only 3 aliases exist covering a fraction of correctable errors, (3) MCP infrastructure downtime causes 440 silent failures with no alerting.

Dataset Overview

MetricValue
Total desires13,925
Unique tool names29
Date range2026-02-09 to 2026-03-05 (25 days)
Daily average557 failures/day
Sourceclaude-code (13,903), transcript-analysis (22)
Aliases configured3

Failure Distribution by Tool

ToolCount%Category
Bash12,93292.9%Command execution
Read5023.6%File access
mcp__homelab__batch_probe1931.4%MCP infrastructure
mcp__homelab__prometheus_query490.4%MCP infrastructure
WebFetch470.3%Network
mcp__homelab__container_status430.3%MCP infrastructure
mcp__homelab__service_health410.3%MCP infrastructure
Other MCP tools760.5%MCP infrastructure
Other420.3%Various

Analysis by Error Category

1. CLI Misuse (Bash) — 93% of all failures

The 12,932 Bash failures break down into subcategories:

SubcategoryCount% of BashActionable?
gt unknown commands7435.7%Yes — document or implement
bd unknown flags3642.8%Yes — aliases or flag additions
Command not found1781.4%Yes — install or alias
Git push rejected3342.6%Partially — workflow issue
Not a git repo1881.5%Yes — cwd detection
Git unstaged changes620.5%Partially — workflow issue
bd sync required1120.9%Yes — auto-sync or docs
Normal dev errors~10,95184.6%No — expected during development

Key insight: ~15% of Bash failures (1,981) are correctable through aliases, documentation, or tooling improvements. The remaining 85% are normal development friction (test failures, build errors, typos).

2. Top Non-Existent GT Commands

Agents repeatedly try commands that don’t exist:

CommandCountWhat Agent Expected
gt deacon pending175Check deacon task queue
gt await-signal101Wait for async event
gt mol hook43Hook a molecule (correct: gt hook)
gt health41System health check
gt plugin status39Check plugin state
gt mq integration list28List MQ integrations
gt wisp26Manage wisps directly
gt plugin due25Check plugin schedule
gt sessions20List active sessions (correct: gt session)
gt rig health19Rig health check

Recommendation: File desire-path beads for the top 5. Either implement or create aliases with helpful error messages.

3. Top Non-Existent BD Flags

Flag AttemptedCountCorrect Alternative
–gated64(removed feature)
–wisp35(not a filter)
–assign27–assignee (-a)
–rig23(use prefix routing)
–comment21–append-notes
–prefix14(use prefix routing)
–mol11(not a filter)
–stdin11(pipe via heredoc)
–owner10–assignee (-a)
–epic7(not implemented)

Current alias coverage: Only 3 aliases exist:

  1. --assign--assignee (bd flag) — covers 27 failures
  2. --owner--assignee (bd flag) — covers 10 failures
  3. bd note Xbd update X --append-notes — covers ~8 failures

Gap: --comment--append-notes could prevent 21 more failures/month. --gated appears 64 times but was a removed feature — needs a helpful error message.

4. Read Tool Failures

Error TypeCountRoot Cause
EISDIR (read directory)162Agent used Read instead of ls/Bash
File not found218Agent guessed wrong path
File too large8Exceeded 25K token limit

Recommendation: Bobbin could inject directory tree output on EISDIR errors (bead aegis-qalm1v filed). tree command now installed on luvu + kota.

5. MCP Server Downtime

MCP ToolFailuresError
batch_probe193no available server
prometheus_query49no available server
container_status43no available server
service_health41no available server
list_containers21no available server
container_logs20no available server
Other MCP73no available server

Total: 440 MCP failures, all “no available server” — homelab-mcp was down. No alerts fired. Bead aegis-ixx1e9 filed for maldoon to add monitoring.

6. Env-Need Analysis (dp env-needs output)

dp env-needs reports 43 “missing tools” but many are false positives. The env-need categorizer incorrectly flags shell builtins and installed tools:

Reported MissingActual StatusIssue
ls, cd, cat, echoShell builtinsFalse positive — these are Bash builtins/coreutils, always available
ssh, git, grepInstalledFalse positive — exit code != “not found”
justNot installedTrue positive — just (justfile runner) not on luvu
dig, nslookup, hostNot installedTrue positive — DNS tools missing
sqlite3Not installedTrue positive — was missing, now installed
python3InstalledFalse positive — python3 exists, python doesn’t

Recommendation: env-need categorizer needs refinement. Should check if the command actually produced “command not found” vs other errors. High false positive rate (>50%) reduces trust in the output.

Agent Workspace Analysis

WorkspaceFailuresPrimary Issues
deacon3,007gt mol squash wrong flags, patrol loop errors
aegis/crew/ellie1,042gt mol squash, read failures
deacon/dogs/boot982Patrol loop startup errors
aegis/crew/malcolm627CLI misuse, path guessing
aegis/witness603Patrol loop errors
aegis/crew/goldblum603Build errors, flag guessing
aegis/refinery/rig510Merge queue errors
aegis/crew/ian443Build/test errors
bucket/refinery/rig418Merge queue errors
mayor394CLI dispatch errors

Insight: Deacon + dogs account for 29% of all failures. Most are repetitive patrol loop errors (gt mol squash with wrong flags). A single alias or patrol fix would eliminate thousands of failures.

Turn Pattern Analysis

dp turns shows tool call sequences. The dominant pattern is long Bash-only turns (264, 152, 151 calls). This indicates agents spending many turns retrying failed Bash commands rather than changing approach.

Recommendation: Consider a “struggling detection” feature — if an agent has >5 consecutive Bash failures on similar commands, surface documentation or suggest an alternative approach.

Alias Effectiveness

Current Aliases (3)

AliasTypeEstimated Monthly Prevents
–assign → –assigneeflag (bd)~27
–owner → –assigneeflag (bd)~10
bd note → bd update –append-notesregex~8
Total~45
AliasTypeEstimated Monthly Prevents
–comment → –append-notesflag (bd)~21
gt sessions → gt sessioncommand~20
gt mol hook → gt hookcommand~43
gt health → gt rig statuscommand~41
–no-digest → (removed, explain)flag (bd)~8
Total~133

Unrealized Value

With current 3 aliases: ~45 prevents/month (0.3% of failures) With 8 aliases: ~178 prevents/month (1.3% of failures) With doc-mapping for top 50 patterns: ~800 informed/month (5.7% of failures)

dp Feature Utilization Assessment

dp FeatureCurrently Used?ValueAction Needed
dp recordYes (PostToolUseFailure hook)HighWorking well
dp ingestYes (via record)HighWorking well
dp statsManually by operatorsMediumCould auto-report
dp pathsManually by operatorsMediumGood for evaluation
dp aliasesYes (3 configured)HighAdd 5 more aliases
dp pave --hookYes (PreToolUse)HighWorking well
dp pave --agents-mdNot usedMediumShould generate rules
dp env-needsNot used (high false positives)LowNeeds refinement
dp turnsNot usedLowUseful for evaluation only
dp similarNot usedLowNiche use case
dp suggestNot implemented yetHighWould synthesize all signals
dp serveNot usedLowNo consumer yet

Immediate Actions

  1. Add 5 new aliasesdp alias for –comment, gt sessions, gt mol hook, gt health, –no-digest
  2. Run dp pave --agents-md — Generate AGENTS.md rules from alias data and append to agent instruction files
  3. Fix env-needs false positives — Check for “command not found” in error text, not just exit code

Future Features Needed

  1. dp suggest (planned, dp-9 design exists) — Synthesize all data sources into prioritized recommendations
  2. dp map (bead aegis-420cz6) — Map documentation to failing tool patterns
  3. Struggling detection — Identify agents retrying same failure pattern
  4. Recovery tracking (bead aegis-gvr2vh) — Detect when fixes reduce failures

Methodology Notes

  • Data extracted via Python sqlite3 queries against ~/.dp/desires.db
  • dp CLI commands (stats, paths, env-needs, aliases, turns) used for built-in analytics
  • Error categorization done via substring matching on error text
  • Agent identification via cwd field (workspace path → agent name)
  • All counts are raw (no dedup by session or time window)

Next Evaluation

Schedule next evaluation for 2026-03-15 (10 days). Track:

  • Did the 5 new aliases reduce failures?
  • Did MCP monitoring (aegis-ixx1e9) reduce downtime?
  • Did bobbin tagging sweep improve injection quality (feedback noise ratio)?
  • Total desire count growth rate
  • New error patterns emerging