Evaluation Methodology
Bobbin’s evaluation framework measures how semantic code context affects AI agent performance on real bug fixes across open-source projects.
Approach: Commit-Revert
Each evaluation task is based on a real bug fix from a well-tested open-source project:
- Select a commit that fixes a bug and has a passing test suite
- Check out the parent of that commit (the broken state)
- Give the agent the bug description and test command
- Measure whether the agent can reproduce the fix
This approach has several advantages:
- Ground truth exists — the actual commit shows exactly what needed to change
- Tests are authoritative — the project’s own test suite validates correctness
- Difficulty is natural — real bugs have realistic complexity and cross-file dependencies
Two Approaches
Each task is run twice under controlled conditions:
| Approach | Description | Settings |
|---|---|---|
| no-bobbin | Agent works with only its built-in knowledge and the prompt | Empty hooks (isolated from user config) |
| with-bobbin | Agent receives semantic code context via bobbin’s hook system | bobbin hook inject-context on each prompt |
The with-bobbin approach injects relevant code snippets automatically when the agent processes its prompt, giving it awareness of related files, function signatures, and code patterns.
Isolation
Each run uses an independent, freshly cloned workspace in a temporary directory. The no-bobbin approach uses an explicit empty settings file (settings-no-bobbin.json) to prevent contamination from user-level Claude Code hooks. This ensures the control group never receives bobbin context.
Scoring Dimensions
Test Pass Rate
Does the agent’s fix make the test suite pass? This is the primary success metric — a fix that doesn’t pass tests is a failed attempt, regardless of how close the code looks.
File-Level Precision
Definition: Of the files the agent modified, what fraction were also modified in the ground truth commit?
Precision = |agent_files ∩ ground_truth_files| / |agent_files|
What it measures: Surgical accuracy. High precision (close to 1.0) means the agent only touched files that actually needed changing. Low precision means the agent made unnecessary modifications — touching files that weren’t part of the real fix.
Example: Ground truth modifies files A, B, C. Agent modifies A, B, D, E.
Precision = |{A,B}| / |{A,B,D,E}| = 2/4 = 0.50- The agent found 2 correct files but also touched 2 unnecessary ones.
File-Level Recall
Definition: Of the files modified in the ground truth commit, what fraction did the agent also modify?
Recall = |agent_files ∩ ground_truth_files| / |ground_truth_files|
What it measures: Completeness. High recall (close to 1.0) means the agent found all files that needed changing. Low recall means the agent missed some required files.
Example: Ground truth modifies files A, B, C. Agent modifies A, B, D, E.
Recall = |{A,B}| / |{A,B,C}| = 2/3 = 0.67- The agent found 2 of the 3 required files but missed file C.
F1 Score
Definition: The harmonic mean of precision and recall.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Why harmonic mean? Unlike an arithmetic mean, the harmonic mean penalizes extreme imbalances. An agent that touches every file in the repo would have recall = 1.0 but precision ≈ 0.0, and F1 would correctly be near 0 rather than 0.5.
Interpretation guide:
| F1 Range | Meaning |
|---|---|
| 1.0 | Perfect — agent modified exactly the same files as the ground truth |
| 0.7-0.9 | Strong — agent found most files with minimal extras |
| 0.4-0.6 | Partial — agent found some files but missed others or added extras |
| 0.0-0.3 | Weak — agent’s changes have little overlap with the ground truth |
Why F1 matters for context engines: Bobbin’s value proposition is that semantic context helps agents find the right files to modify. Without context, agents often explore broadly (low precision) or miss related files (low recall). F1 captures both failure modes in a single number.
Duration
Wall-clock time for the agent to complete its work. Includes thinking, tool calls, and compilation. Faster is better, all else equal, but correctness always trumps speed.
GPU-Accelerated Indexing
The with-bobbin approach requires indexing the target codebase before the agent starts. Bobbin automatically detects NVIDIA CUDA GPUs and uses them for embedding inference:
| Project | Files | Chunks | CPU Index Time | GPU Index Time |
|---|---|---|---|---|
| flask | 210 | ~700 | ~2s | ~2s |
| polars | 3,089 | ~50K | Pending | Pending |
| ruff | 5,874 | ~57K | >30 min (timeout) | ~83s |
GPU acceleration makes large-codebase evaluation practical. Without it, indexing ruff’s 57K chunks was the primary bottleneck — consistently timing out at the 30-minute mark. With GPU (RTX 4070 Super), embedding throughput jumps from ~100 chunks/s to ~2,400 chunks/s.
The GPU is only used during the indexing phase. Search queries are sub-100ms regardless.
Native Metrics
Each with-bobbin run captures detailed observability data via bobbin’s metrics infrastructure:
BOBBIN_METRICS_SOURCE— The eval runner sets this env var before each agent invocation, tagging all metric events with a unique run identifier (e.g.,ruff-001_with-bobbin_1)..bobbin/metrics.jsonl— Append-only log of metric events emitted by bobbin commands and hooks during the run.
Events captured include:
| Event | Source | What It Captures |
|---|---|---|
command | CLI dispatch | Every bobbin invocation: command name, duration, success/failure |
hook_injection | inject-context hook | Files returned, chunks returned, top semantic score, budget lines used |
hook_gate_skip | inject-context hook | Query text, top score, gate threshold (when injection is skipped due to weak match) |
hook_dedup_skip | inject-context hook | When injection is skipped because context hasn’t changed since last prompt |
After each run, the eval runner reads the metrics log and computes:
- Injection count — How many times bobbin injected context during the run
- Gate skip count — How many prompts were below the relevance threshold
- Injection-to-ground-truth overlap — Precision and recall of the files bobbin injected vs. the files that actually needed changing
This data appears in the bobbin_metrics field of each result JSON, enabling analysis of why bobbin helped (or didn’t) on a given task.
LLM Judge
Optionally, an LLM judge performs pairwise comparison of agent diffs across three dimensions:
- Consistency: Does the solution follow codebase conventions?
- Completeness: Are edge cases handled?
- Minimality: Is the diff surgical, or does it include unnecessary changes?
The judge uses a flip-and-draw protocol (running comparison in both orders) to detect and mitigate position bias.
Task Selection Criteria
Tasks are curated to be:
- Self-contained — fixable without external documentation or API access
- Well-tested — the project’s test suite reliably catches the bug
- Cross-file — the fix typically touches 2-5 files (not trivial single-line changes)
- Diverse — spanning multiple languages, frameworks, and bug categories
Current task suites:
| Suite | Project | Language | Files | Code Lines | Tasks | Difficulty |
|---|---|---|---|---|---|---|
| flask | pallets/flask | Python | 210 | 26K | 5 | easy-medium |
| polars | pola-rs/polars | Rust+Python | 3,089 | 606K | 5 | easy-medium |
| ruff | astral-sh/ruff | Rust+Python | 5,874 | 696K | 5 | easy-medium |
See Project Catalog for full LOC breakdowns and index statistics.
Adding a New Project
To add a new evaluation project:
-
Find bug-fix commits — Look for commits in well-tested repos where the test suite catches the bug. The fix should touch 2-5 files.
-
Create task YAML files — Add
eval/tasks/<project>-NNN.yamlwith:id: project-001 repo: org/repo commit: <full-sha> description: | Description of the bug and what the agent should fix. Implement the fix. Run the test suite with the test command to verify. setup_command: "<build/install steps>" test_command: "<specific test command>" language: rust difficulty: easy tags: [bug-fix, ...] -
Run
tokeion the cloned repo at the pinned commit and add a section to Project Catalog. -
Create a results page — Add
eval/<project>.mdto the book with task descriptions and a placeholder for results. -
Update SUMMARY.md — Add the new results page to the Evaluation section.
-
Run evals —
just eval-task <project>-001runs a single task. Results are written toeval/results/runs/.