Storage & Data Flow
Bobbin uses a dual-storage architecture: LanceDB as the primary store for chunks, embeddings, and full-text search, with SQLite handling temporal coupling data and global metadata.
LanceDB (Primary Storage)
All chunk data, embeddings, and full-text search live in LanceDB:
chunks table:
- id: string # SHA256-based unique chunk ID
- vector: float[384] # MiniLM embedding
- repo: string # Repository name (for multi-repo support)
- file_path: string # Relative file path
- file_hash: string # Content hash for incremental indexing
- language: string # Programming language
- chunk_type: string # function, method, class, section, etc.
- chunk_name: string? # Function/class/section name (nullable)
- start_line: uint32 # Starting line number
- end_line: uint32 # Ending line number
- content: string # Original chunk content
- full_context: string? # Context-enriched text used for embedding (nullable)
- indexed_at: string # Timestamp
LanceDB also maintains an FTS index on the content field for keyword search.
SQLite (Auxiliary)
SQLite only stores temporal coupling data and global metadata:
-- Temporal coupling (git co-change relationships)
CREATE TABLE coupling (
file_a TEXT NOT NULL,
file_b TEXT NOT NULL,
score REAL,
co_changes INTEGER,
last_co_change INTEGER,
PRIMARY KEY (file_a, file_b)
);
-- Global metadata
CREATE TABLE meta (
key TEXT PRIMARY KEY,
value TEXT
);
Key Types
Chunk
A semantic unit extracted from source code:
#![allow(unused)]
fn main() {
struct Chunk {
id: String, // SHA256-based unique ID
file_path: String, // Source file path
chunk_type: ChunkType,// function, class, struct, etc.
name: Option<String>, // Function/class name
start_line: u32, // Starting line number
end_line: u32, // Ending line number
content: String, // Actual code content
language: String, // Programming language
}
}
SearchResult
#![allow(unused)]
fn main() {
struct SearchResult {
chunk: Chunk, // The matched chunk
score: f32, // Relevance score
match_type: Option<MatchType>, // How it was matched
}
}
Hybrid Search (RRF)
The hybrid search combines semantic (vector) and keyword (FTS) results using Reciprocal Rank Fusion:
RRF_score = semantic_weight / (k + semantic_rank) + keyword_weight / (k + keyword_rank)
Where:
k = 60(standard RRF constant)semantic_weightfrom config (default 0.7)keyword_weight = 1 - semantic_weight
Results that appear in both searches get boosted scores and are marked as [hybrid] matches.