Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Results Summary

Overall Comparison

Metricno-bobbinwith-bobbinwith-bobbin+blame_bridging=falsewith-bobbin+coupling_depth=0with-bobbin+doc_demotion=0.0with-bobbin+gate_threshold=1.0with-bobbin+recency_weight=0.0with-bobbin+semantic_weight=0.0
Runs3036333334
Test Pass Rate66.7%58.3%100.0%100.0%100.0%100.0%100.0%100.0%
Avg Precision85.1%90.1%33.3%55.6%55.6%77.8%77.8%23.6%
Avg Recall60.2%64.5%33.3%33.3%55.6%55.6%55.6%33.3%
Avg F168.3%71.9%33.3%38.9%55.6%61.1%61.1%25.2%
Avg Duration4.2m3.6m4.6m4.7m5.0m5.4m4.9m4.7m
Avg Cost$1.24$1.44$1.25$1.45$1.48$1.39$1.43$1.45
Avg Input Tokens1,250,1221,788,6331,517,0021,780,5061,882,0031,711,9551,813,4061,812,429
Avg Output Tokens7,4118,9187,5518,7238,9338,5728,4139,374

Metric Overview

summary_metrics.svg

F1 Score by Task

summary_f1_by_task.svg

Score Distribution

summary_f1_boxplot.svg

Duration

summary_duration.svg

Recent Trend

summary_trend.svg

Full historical trends

Per-Task Results

TaskLanguageDifficultyApproachTestsPrecisionRecallF1DurationCost
cargo-001rusteasyno-bobbin100.0%100.0%100.0%100.0%5.2m$1.04
cargo-001rusteasywith-bobbin100.0%100.0%100.0%100.0%4.6m$1.03
flask-001no-bobbin0.0%100.0%33.3%50.0%1.3m$0.00
flask-001with-bobbin0.0%100.0%33.3%50.0%1.3m$0.00
flask-002no-bobbin0.0%100.0%66.7%80.0%3.1m$0.00
flask-002with-bobbin0.0%100.0%55.6%70.0%3.5m$0.00
flask-003no-bobbin0.0%100.0%60.0%75.0%2.2m$0.00
flask-003with-bobbin0.0%100.0%60.0%75.0%2.4m$0.00
flask-004no-bobbin0.0%100.0%70.0%81.9%3.3m$0.00
flask-004with-bobbin0.0%100.0%60.0%75.0%3.2m$0.00
flask-005no-bobbin0.0%100.0%50.0%66.7%2.6m$0.00
flask-005with-bobbin0.0%100.0%58.3%73.0%1.9m$0.00
polars-004rustmediumno-bobbin100.0%100.0%66.7%80.0%4.4m$0.81
polars-005rustmediumno-bobbin100.0%100.0%66.7%79.4%6.4m$1.74
ruff-001rustmediumno-bobbin100.0%31.7%33.3%32.4%4.3m$1.23
ruff-001rustmediumwith-bobbin100.0%70.2%61.9%63.6%4.4m$1.52
ruff-001rustmediumwith-bobbin+blame_bridging=false100.0%33.3%33.3%33.3%4.6m$1.25
ruff-001rustmediumwith-bobbin+coupling_depth=0100.0%55.6%33.3%38.9%4.7m$1.45
ruff-001rustmediumwith-bobbin+doc_demotion=0.0100.0%55.6%55.6%55.6%5.0m$1.48
ruff-001rustmediumwith-bobbin+gate_threshold=1.0100.0%77.8%55.6%61.1%5.4m$1.39
ruff-001rustmediumwith-bobbin+recency_weight=0.0100.0%77.8%55.6%61.1%4.9m$1.43
ruff-001rustmediumwith-bobbin+semantic_weight=0.0100.0%23.6%33.3%25.2%4.7m$1.45
ruff-002rusteasyno-bobbin100.0%100.0%40.0%57.1%4.8m$0.00
ruff-002rusteasywith-bobbin100.0%100.0%40.0%57.1%4.3m$1.38
ruff-003rustmediumno-bobbin100.0%100.0%83.3%90.0%9.2m$0.00
ruff-003rustmediumwith-bobbin100.0%100.0%77.8%86.7%6.3m$1.92
ruff-004rusteasyno-bobbin100.0%46.7%66.7%54.2%3.9m$0.00
ruff-004rusteasywith-bobbin100.0%63.3%83.3%70.8%4.5m$1.67
ruff-005rusteasyno-bobbin100.0%100.0%100.0%100.0%3.6m$0.00
ruff-005rusteasywith-bobbin100.0%100.0%100.0%100.0%2.8m$0.63