Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Evaluation Methodology

Bobbin’s evaluation framework measures how semantic code context affects AI agent performance on real bug fixes across open-source projects.

Approach: Commit-Revert

Each evaluation task is based on a real bug fix from a well-tested open-source project:

  1. Select a commit that fixes a bug and has a passing test suite
  2. Check out the parent of that commit (the broken state)
  3. Give the agent the bug description and test command
  4. Measure whether the agent can reproduce the fix

This approach has several advantages:

  • Ground truth exists — the actual commit shows exactly what needed to change
  • Tests are authoritative — the project’s own test suite validates correctness
  • Difficulty is natural — real bugs have realistic complexity and cross-file dependencies

Two Approaches

Each task is run twice under controlled conditions:

ApproachDescriptionSettings
no-bobbinAgent works with only its built-in knowledge and the promptEmpty hooks (isolated from user config)
with-bobbinAgent receives semantic code context via bobbin’s hook systembobbin hook inject-context on each prompt

The with-bobbin approach injects relevant code snippets automatically when the agent processes its prompt, giving it awareness of related files, function signatures, and code patterns.

Isolation

Each run uses an independent, freshly cloned workspace in a temporary directory. The no-bobbin approach uses an explicit empty settings file (settings-no-bobbin.json) to prevent contamination from user-level Claude Code hooks. This ensures the control group never receives bobbin context.

Scoring Dimensions

Test Pass Rate

Does the agent’s fix make the test suite pass? This is the primary success metric — a fix that doesn’t pass tests is a failed attempt, regardless of how close the code looks.

File-Level Precision

Definition: Of the files the agent modified, what fraction were also modified in the ground truth commit?

Precision = |agent_files ∩ ground_truth_files| / |agent_files|

What it measures: Surgical accuracy. High precision (close to 1.0) means the agent only touched files that actually needed changing. Low precision means the agent made unnecessary modifications — touching files that weren’t part of the real fix.

Example: Ground truth modifies files A, B, C. Agent modifies A, B, D, E.

  • Precision = |{A,B}| / |{A,B,D,E}| = 2/4 = 0.50
  • The agent found 2 correct files but also touched 2 unnecessary ones.

File-Level Recall

Definition: Of the files modified in the ground truth commit, what fraction did the agent also modify?

Recall = |agent_files ∩ ground_truth_files| / |ground_truth_files|

What it measures: Completeness. High recall (close to 1.0) means the agent found all files that needed changing. Low recall means the agent missed some required files.

Example: Ground truth modifies files A, B, C. Agent modifies A, B, D, E.

  • Recall = |{A,B}| / |{A,B,C}| = 2/3 = 0.67
  • The agent found 2 of the 3 required files but missed file C.

F1 Score

Definition: The harmonic mean of precision and recall.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why harmonic mean? Unlike an arithmetic mean, the harmonic mean penalizes extreme imbalances. An agent that touches every file in the repo would have recall = 1.0 but precision ≈ 0.0, and F1 would correctly be near 0 rather than 0.5.

Interpretation guide:

F1 RangeMeaning
1.0Perfect — agent modified exactly the same files as the ground truth
0.7-0.9Strong — agent found most files with minimal extras
0.4-0.6Partial — agent found some files but missed others or added extras
0.0-0.3Weak — agent’s changes have little overlap with the ground truth

Why F1 matters for context engines: Bobbin’s value proposition is that semantic context helps agents find the right files to modify. Without context, agents often explore broadly (low precision) or miss related files (low recall). F1 captures both failure modes in a single number.

Duration

Wall-clock time for the agent to complete its work. Includes thinking, tool calls, and compilation. Faster is better, all else equal, but correctness always trumps speed.

GPU-Accelerated Indexing

The with-bobbin approach requires indexing the target codebase before the agent starts. Bobbin automatically detects NVIDIA CUDA GPUs and uses them for embedding inference:

ProjectFilesChunksCPU Index TimeGPU Index Time
flask210~700~2s~2s
polars3,089~50KPendingPending
ruff5,874~57K>30 min (timeout)~83s

GPU acceleration makes large-codebase evaluation practical. Without it, indexing ruff’s 57K chunks was the primary bottleneck — consistently timing out at the 30-minute mark. With GPU (RTX 4070 Super), embedding throughput jumps from ~100 chunks/s to ~2,400 chunks/s.

The GPU is only used during the indexing phase. Search queries are sub-100ms regardless.

Native Metrics

Each with-bobbin run captures detailed observability data via bobbin’s metrics infrastructure:

  • BOBBIN_METRICS_SOURCE — The eval runner sets this env var before each agent invocation, tagging all metric events with a unique run identifier (e.g., ruff-001_with-bobbin_1).
  • .bobbin/metrics.jsonl — Append-only log of metric events emitted by bobbin commands and hooks during the run.

Events captured include:

EventSourceWhat It Captures
commandCLI dispatchEvery bobbin invocation: command name, duration, success/failure
hook_injectioninject-context hookFiles returned, chunks returned, top semantic score, budget lines used
hook_gate_skipinject-context hookQuery text, top score, gate threshold (when injection is skipped due to weak match)
hook_dedup_skipinject-context hookWhen injection is skipped because context hasn’t changed since last prompt

After each run, the eval runner reads the metrics log and computes:

  • Injection count — How many times bobbin injected context during the run
  • Gate skip count — How many prompts were below the relevance threshold
  • Injection-to-ground-truth overlap — Precision and recall of the files bobbin injected vs. the files that actually needed changing

This data appears in the bobbin_metrics field of each result JSON, enabling analysis of why bobbin helped (or didn’t) on a given task.

LLM Judge

Optionally, an LLM judge performs pairwise comparison of agent diffs across three dimensions:

  • Consistency: Does the solution follow codebase conventions?
  • Completeness: Are edge cases handled?
  • Minimality: Is the diff surgical, or does it include unnecessary changes?

The judge uses a flip-and-draw protocol (running comparison in both orders) to detect and mitigate position bias.

Task Selection Criteria

Tasks are curated to be:

  • Self-contained — fixable without external documentation or API access
  • Well-tested — the project’s test suite reliably catches the bug
  • Cross-file — the fix typically touches 2-5 files (not trivial single-line changes)
  • Diverse — spanning multiple languages, frameworks, and bug categories

Current task suites:

SuiteProjectLanguageFilesCode LinesTasksDifficulty
flaskpallets/flaskPython21026K5easy-medium
polarspola-rs/polarsRust+Python3,089606K5easy-medium
ruffastral-sh/ruffRust+Python5,874696K5easy-medium

See Project Catalog for full LOC breakdowns and index statistics.

Adding a New Project

To add a new evaluation project:

  1. Find bug-fix commits — Look for commits in well-tested repos where the test suite catches the bug. The fix should touch 2-5 files.

  2. Create task YAML files — Add eval/tasks/<project>-NNN.yaml with:

    id: project-001
    repo: org/repo
    commit: <full-sha>
    description: |
      Description of the bug and what the agent should fix.
      Implement the fix. Run the test suite with the test command to verify.
    setup_command: "<build/install steps>"
    test_command: "<specific test command>"
    language: rust
    difficulty: easy
    tags: [bug-fix, ...]
    
  3. Run tokei on the cloned repo at the pinned commit and add a section to Project Catalog.

  4. Create a results page — Add eval/<project>.md to the book with task descriptions and a placeholder for results.

  5. Update SUMMARY.md — Add the new results page to the Evaluation section.

  6. Run evalsjust eval-task <project>-001 runs a single task. Results are written to eval/results/runs/.