Evaluation: 2026-03-05 Baseline

Analyst: aegis/crew/stryder Date: 2026-03-05 Dataset: 13,925 desires, 13,806 invocations (2026-02-09 to 2026-03-05) Source: 100% Claude Code PostToolUseFailure hooks across Gas Town multi-agent system

Executive Summary

First comprehensive evaluation of desirepath data from a production multi-agent system. 13,925 tool failures recorded across 25 days from ~15 agents. Three key findings: (1) CLI misuse dominates failures at 93%, (2) only 3 aliases exist covering a fraction of correctable errors, (3) MCP infrastructure downtime causes 440 silent failures with no alerting.

Dataset Overview

Metric	Value
Total desires	13,925
Unique tool names	29
Date range	2026-02-09 to 2026-03-05 (25 days)
Daily average	557 failures/day
Source	claude-code (13,903), transcript-analysis (22)
Aliases configured	3

Failure Distribution by Tool

Tool	Count	%	Category
Bash	12,932	92.9%	Command execution
Read	502	3.6%	File access
mcp__homelab__batch_probe	193	1.4%	MCP infrastructure
mcp__homelab__prometheus_query	49	0.4%	MCP infrastructure
WebFetch	47	0.3%	Network
mcp__homelab__container_status	43	0.3%	MCP infrastructure
mcp__homelab__service_health	41	0.3%	MCP infrastructure
Other MCP tools	76	0.5%	MCP infrastructure
Other	42	0.3%	Various

Analysis by Error Category

1. CLI Misuse (Bash) — 93% of all failures

The 12,932 Bash failures break down into subcategories:

Subcategory	Count	% of Bash	Actionable?
gt unknown commands	743	5.7%	Yes — document or implement
bd unknown flags	364	2.8%	Yes — aliases or flag additions
Command not found	178	1.4%	Yes — install or alias
Git push rejected	334	2.6%	Partially — workflow issue
Not a git repo	188	1.5%	Yes — cwd detection
Git unstaged changes	62	0.5%	Partially — workflow issue
bd sync required	112	0.9%	Yes — auto-sync or docs
Normal dev errors	~10,951	84.6%	No — expected during development

Key insight: ~15% of Bash failures (1,981) are correctable through aliases, documentation, or tooling improvements. The remaining 85% are normal development friction (test failures, build errors, typos).

2. Top Non-Existent GT Commands

Agents repeatedly try commands that don’t exist:

Command	Count	What Agent Expected
gt deacon pending	175	Check deacon task queue
gt await-signal	101	Wait for async event
gt mol hook	43	Hook a molecule (correct: gt hook)
gt health	41	System health check
gt plugin status	39	Check plugin state
gt mq integration list	28	List MQ integrations
gt wisp	26	Manage wisps directly
gt plugin due	25	Check plugin schedule
gt sessions	20	List active sessions (correct: gt session)
gt rig health	19	Rig health check

Recommendation: File desire-path beads for the top 5. Either implement or create aliases with helpful error messages.

3. Top Non-Existent BD Flags

Flag Attempted	Count	Correct Alternative
–gated	64	(removed feature)
–wisp	35	(not a filter)
–assign	27	–assignee (-a)
–rig	23	(use prefix routing)
–comment	21	–append-notes
–prefix	14	(use prefix routing)
–mol	11	(not a filter)
–stdin	11	(pipe via heredoc)
–owner	10	–assignee (-a)
–epic	7	(not implemented)

Current alias coverage: Only 3 aliases exist:

--assign → --assignee (bd flag) — covers 27 failures
--owner → --assignee (bd flag) — covers 10 failures
bd note X → bd update X --append-notes — covers ~8 failures

Gap: --comment → --append-notes could prevent 21 more failures/month. --gated appears 64 times but was a removed feature — needs a helpful error message.

4. Read Tool Failures

Error Type	Count	Root Cause
EISDIR (read directory)	162	Agent used Read instead of ls/Bash
File not found	218	Agent guessed wrong path
File too large	8	Exceeded 25K token limit

Recommendation: Bobbin could inject directory tree output on EISDIR errors (bead aegis-qalm1v filed). tree command now installed on luvu + kota.

5. MCP Server Downtime

MCP Tool	Failures	Error
batch_probe	193	no available server
prometheus_query	49	no available server
container_status	43	no available server
service_health	41	no available server
list_containers	21	no available server
container_logs	20	no available server
Other MCP	73	no available server

Total: 440 MCP failures, all “no available server” — homelab-mcp was down. No alerts fired. Bead aegis-ixx1e9 filed for maldoon to add monitoring.

6. Env-Need Analysis (dp env-needs output)

dp env-needs reports 43 “missing tools” but many are false positives. The env-need categorizer incorrectly flags shell builtins and installed tools:

Reported Missing	Actual Status	Issue
ls, cd, cat, echo	Shell builtins	False positive — these are Bash builtins/coreutils, always available
ssh, git, grep	Installed	False positive — exit code != “not found”
just	Not installed	True positive — `just` (justfile runner) not on luvu
dig, nslookup, host	Not installed	True positive — DNS tools missing
sqlite3	Not installed	True positive — was missing, now installed
python3	Installed	False positive — python3 exists, `python` doesn’t

Recommendation: env-need categorizer needs refinement. Should check if the command actually produced “command not found” vs other errors. High false positive rate (>50%) reduces trust in the output.

Agent Workspace Analysis

Workspace	Failures	Primary Issues
deacon	3,007	gt mol squash wrong flags, patrol loop errors
aegis/crew/ellie	1,042	gt mol squash, read failures
deacon/dogs/boot	982	Patrol loop startup errors
aegis/crew/malcolm	627	CLI misuse, path guessing
aegis/witness	603	Patrol loop errors
aegis/crew/goldblum	603	Build errors, flag guessing
aegis/refinery/rig	510	Merge queue errors
aegis/crew/ian	443	Build/test errors
bucket/refinery/rig	418	Merge queue errors
mayor	394	CLI dispatch errors

Insight: Deacon + dogs account for 29% of all failures. Most are repetitive patrol loop errors (gt mol squash with wrong flags). A single alias or patrol fix would eliminate thousands of failures.

Alias	Type	Estimated Monthly Prevents
–assign → –assignee	flag (bd)	~27
–owner → –assignee	flag (bd)	~10
bd note → bd update –append-notes	regex	~8
Total		~45

Recommended New Aliases (5)

Alias	Type	Estimated Monthly Prevents
–comment → –append-notes	flag (bd)	~21
gt sessions → gt session	command	~20
gt mol hook → gt hook	command	~43
gt health → gt rig status	command	~41
–no-digest → (removed, explain)	flag (bd)	~8
Total		~133

Unrealized Value

With current 3 aliases: ~45 prevents/month (0.3% of failures) With 8 aliases: ~178 prevents/month (1.3% of failures) With doc-mapping for top 50 patterns: ~800 informed/month (5.7% of failures)

dp Feature Utilization Assessment

dp Feature	Currently Used?	Value	Action Needed
`dp record`	Yes (PostToolUseFailure hook)	High	Working well
`dp ingest`	Yes (via record)	High	Working well
`dp stats`	Manually by operators	Medium	Could auto-report
`dp paths`	Manually by operators	Medium	Good for evaluation
`dp aliases`	Yes (3 configured)	High	Add 5 more aliases
`dp pave --hook`	Yes (PreToolUse)	High	Working well
`dp pave --agents-md`	Not used	Medium	Should generate rules
`dp env-needs`	Not used (high false positives)	Low	Needs refinement
`dp turns`	Not used	Low	Useful for evaluation only
`dp similar`	Not used	Low	Niche use case
`dp suggest`	Not implemented yet	High	Would synthesize all signals
`dp serve`	Not used	Low	No consumer yet

Immediate Actions

Add 5 new aliases — dp alias for –comment, gt sessions, gt mol hook, gt health, –no-digest
Run dp pave --agents-md — Generate AGENTS.md rules from alias data and append to agent instruction files
Fix env-needs false positives — Check for “command not found” in error text, not just exit code

Future Features Needed

dp suggest (planned, dp-9 design exists) — Synthesize all data sources into prioritized recommendations
dp map (bead aegis-420cz6) — Map documentation to failing tool patterns
Struggling detection — Identify agents retrying same failure pattern
Recovery tracking (bead aegis-gvr2vh) — Detect when fixes reduce failures

Methodology Notes

Data extracted via Python sqlite3 queries against ~/.dp/desires.db
dp CLI commands (stats, paths, env-needs, aliases, turns) used for built-in analytics
Error categorization done via substring matching on error text
Agent identification via cwd field (workspace path → agent name)
All counts are raw (no dedup by session or time window)

Next Evaluation

Schedule next evaluation for 2026-03-15 (10 days). Track:

Did the 5 new aliases reduce failures?
Did MCP monitoring (aegis-ixx1e9) reduce downtime?
Did bobbin tagging sweep improve injection quality (feedback noise ratio)?
Total desire count growth rate
New error patterns emerging

Desire Path Documentation