Evaluation Methodology

Bobbin’s evaluation framework measures how semantic code context affects AI agent performance on real bug fixes across open-source projects.

Approach: Commit-Revert

Each evaluation task is based on a real bug fix from a well-tested open-source project:

Select a commit that fixes a bug and has a passing test suite
Check out the parent of that commit (the broken state)
Give the agent the bug description and test command
Measure whether the agent can reproduce the fix

This approach has several advantages:

Ground truth exists — the actual commit shows exactly what needed to change
Tests are authoritative — the project’s own test suite validates correctness
Difficulty is natural — real bugs have realistic complexity and cross-file dependencies

Two Approaches

Each task is run twice under controlled conditions:

Approach	Description	Settings
no-bobbin	Agent works with only its built-in knowledge and the prompt	Empty hooks (isolated from user config)
with-bobbin	Agent receives semantic code context via bobbin’s hook system	`bobbin hook inject-context` on each prompt

The with-bobbin approach injects relevant code snippets automatically when the agent processes its prompt, giving it awareness of related files, function signatures, and code patterns.

Each run uses an independent, freshly cloned workspace in a temporary directory. The no-bobbin approach uses an explicit empty settings file (settings-no-bobbin.json) to prevent contamination from user-level Claude Code hooks. This ensures the control group never receives bobbin context.

Scoring Dimensions

Test Pass Rate

Does the agent’s fix make the test suite pass? This is the primary success metric — a fix that doesn’t pass tests is a failed attempt, regardless of how close the code looks.

File-Level Precision

Definition: Of the files the agent modified, what fraction were also modified in the ground truth commit?

Precision = |agent_files ∩ ground_truth_files| / |agent_files|

What it measures: Surgical accuracy. High precision (close to 1.0) means the agent only touched files that actually needed changing. Low precision means the agent made unnecessary modifications — touching files that weren’t part of the real fix.

Example: Ground truth modifies files A, B, C. Agent modifies A, B, D, E.

Precision = |{A,B}| / |{A,B,D,E}| = 2/4 = 0.50
The agent found 2 correct files but also touched 2 unnecessary ones.

File-Level Recall

Definition: Of the files modified in the ground truth commit, what fraction did the agent also modify?

Recall = |agent_files ∩ ground_truth_files| / |ground_truth_files|

What it measures: Completeness. High recall (close to 1.0) means the agent found all files that needed changing. Low recall means the agent missed some required files.

Example: Ground truth modifies files A, B, C. Agent modifies A, B, D, E.

Recall = |{A,B}| / |{A,B,C}| = 2/3 = 0.67
The agent found 2 of the 3 required files but missed file C.

F1 Score

Definition: The harmonic mean of precision and recall.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why harmonic mean? Unlike an arithmetic mean, the harmonic mean penalizes extreme imbalances. An agent that touches every file in the repo would have recall = 1.0 but precision ≈ 0.0, and F1 would correctly be near 0 rather than 0.5.

Interpretation guide:

F1 Range	Meaning
1.0	Perfect — agent modified exactly the same files as the ground truth
0.7-0.9	Strong — agent found most files with minimal extras
0.4-0.6	Partial — agent found some files but missed others or added extras
0.0-0.3	Weak — agent’s changes have little overlap with the ground truth

Why F1 matters for context engines: Bobbin’s value proposition is that semantic context helps agents find the right files to modify. Without context, agents often explore broadly (low precision) or miss related files (low recall). F1 captures both failure modes in a single number.

Duration

Wall-clock time for the agent to complete its work. Includes thinking, tool calls, and compilation. Faster is better, all else equal, but correctness always trumps speed.

GPU-Accelerated Indexing

The with-bobbin approach requires indexing the target codebase before the agent starts. Bobbin automatically detects NVIDIA CUDA GPUs and uses them for embedding inference:

Project	Files	Chunks	CPU Index Time	GPU Index Time
flask	210	~700	~2s	~2s
polars	3,089	~50K	Pending	Pending
ruff	5,874	~57K	>30 min (timeout)	~83s

GPU acceleration makes large-codebase evaluation practical. Without it, indexing ruff’s 57K chunks was the primary bottleneck — consistently timing out at the 30-minute mark. With GPU (RTX 4070 Super), embedding throughput jumps from ~100 chunks/s to ~2,400 chunks/s.

The GPU is only used during the indexing phase. Search queries are sub-100ms regardless.

Native Metrics

Each with-bobbin run captures detailed observability data via bobbin’s metrics infrastructure:

BOBBIN_METRICS_SOURCE — The eval runner sets this env var before each agent invocation, tagging all metric events with a unique run identifier (e.g., ruff-001_with-bobbin_1).
.bobbin/metrics.jsonl — Append-only log of metric events emitted by bobbin commands and hooks during the run.

Events captured include:

Event	Source	What It Captures
`command`	CLI dispatch	Every `bobbin` invocation: command name, duration, success/failure
`hook_injection`	inject-context hook	Files returned, chunks returned, top semantic score, budget lines used
`hook_gate_skip`	inject-context hook	Query text, top score, gate threshold (when injection is skipped due to weak match)
`hook_dedup_skip`	inject-context hook	When injection is skipped because context hasn’t changed since last prompt

After each run, the eval runner reads the metrics log and computes:

Injection count — How many times bobbin injected context during the run
Gate skip count — How many prompts were below the relevance threshold
Injection-to-ground-truth overlap — Precision and recall of the files bobbin injected vs. the files that actually needed changing

This data appears in the bobbin_metrics field of each result JSON, enabling analysis of why bobbin helped (or didn’t) on a given task.

LLM Judge

Optionally, an LLM judge performs pairwise comparison of agent diffs across three dimensions:

Consistency: Does the solution follow codebase conventions?
Completeness: Are edge cases handled?
Minimality: Is the diff surgical, or does it include unnecessary changes?

The judge uses a flip-and-draw protocol (running comparison in both orders) to detect and mitigate position bias.

Task Selection Criteria

Tasks are curated to be:

Self-contained — fixable without external documentation or API access
Well-tested — the project’s test suite reliably catches the bug
Cross-file — the fix typically touches 2-5 files (not trivial single-line changes)
Diverse — spanning multiple languages, frameworks, and bug categories

Current task suites:

Suite	Project	Language	Files	Code Lines	Tasks	Difficulty
flask	pallets/flask	Python	210	26K	5	easy-medium
polars	pola-rs/polars	Rust+Python	3,089	606K	5	easy-medium
ruff	astral-sh/ruff	Rust+Python	5,874	696K	5	easy-medium

See Project Catalog for full LOC breakdowns and index statistics.

Adding a New Project

To add a new evaluation project:

Find bug-fix commits — Look for commits in well-tested repos where the test suite catches the bug. The fix should touch 2-5 files.

Create task YAML files — Add eval/tasks/<project>-NNN.yaml with:

id: project-001
repo: org/repo
commit: <full-sha>
description: |
  Description of the bug and what the agent should fix.
  Implement the fix. Run the test suite with the test command to verify.
setup_command: "<build/install steps>"
test_command: "<specific test command>"
language: rust
difficulty: easy
tags: [bug-fix, ...]

Run tokei on the cloned repo at the pinned commit and add a section to Project Catalog.
Create a results page — Add eval/<project>.md to the book with task descriptions and a placeholder for results.
Update SUMMARY.md — Add the new results page to the Evaluation section.
Run evals — just eval-task <project>-001 runs a single task. Results are written to eval/results/runs/.

Bobbin Documentation