Disk-Native BM25 Backend¶
barrel_vectordb provides a disk-native BM25 backend optimized for large-scale text search with predictable memory usage and fast query performance.
Overview¶
The bm25_disk backend stores inverted indexes on SSD with Block-Max MaxScore
for early termination, similar to how production search engines like Elasticsearch
and Lucene operate.
Key Features¶
- Disk-native storage - Postings on SSD, only metadata in RAM
- Block-Max MaxScore - Early termination skips 95%+ of postings
- Hot layer - Sub-millisecond writes, background compaction
- O(1) startup - Lazy loading, no full index load
- Cluster-ready - Scatter-gather search across shards
- Hybrid search - Combine BM25 + vector search with RRF or linear fusion
When to Use¶
| Use Case | Recommended Backend |
|---|---|
| Small corpus (<100K docs) | memory - simpler, lower latency |
| Large corpus (>100K docs) | disk - predictable memory |
| Memory-constrained environments | disk |
| Write-heavy workloads | disk with large hot layer |
| Hybrid search (BM25 + vectors) | Either - both support hybrid |
Memory Usage¶
| Component | Size | Notes |
|---|---|---|
| Lexicon | ~40 bytes/term | Term → integer ID in RocksDB |
| Block-max index | ~50 bytes/block | mmap'd, OS cached |
| Doc stats | ~8 bytes/doc | In ETS |
| Hot layer | configurable | Default 50K docs max |
Example sizing for 1M documents with 500K unique terms:
- Memory backend: ~500 MB - 1 GB
- Disk backend: ~50 MB + OS page cache
Configuration¶
Standalone Store¶
%% Create store with disk BM25
{ok, _} = barrel_vectordb:start_link(#{
name => my_store,
path => "/data/my_store",
dimensions => 768,
bm25_backend => disk,
bm25_disk => #{
base_path => "/data/my_store/bm25",
hot_max_size => 50000,
hot_compaction_threshold => 0.8,
k1 => 1.2,
b => 0.75
}
}).
Configuration Options¶
| Option | Default | Description |
|---|---|---|
base_path |
{store_path}/bm25 |
Directory for BM25 files |
hot_max_size |
50000 |
Max docs in hot layer before compaction |
hot_compaction_threshold |
0.8 |
Trigger compaction at this % of max |
k1 |
1.2 |
BM25 term frequency saturation |
b |
0.75 |
BM25 document length normalization |
block_size |
128 |
Documents per posting block |
BM25 Parameters¶
The BM25 scoring formula is:
- k1 (default: 1.2) - Controls term frequency saturation. Higher values give more weight to repeated terms.
- b (default: 0.75) - Controls document length normalization. 0 = no normalization, 1 = full normalization.
API Reference¶
Adding Documents¶
Documents are automatically indexed in BM25 when added to the store:
%% Add single document - automatically indexed in BM25
ok = barrel_vectordb:add(Store, <<"doc1">>, <<"Hello world">>, #{}).
%% Add with metadata
ok = barrel_vectordb:add(Store, <<"doc2">>, <<"Erlang is great">>,
#{type => article, author => <<"joe">>}).
%% Batch add (more efficient)
{ok, _} = barrel_vectordb:add_batch(Store, [
{<<"doc3">>, <<"First document">>, #{}},
{<<"doc4">>, <<"Second document">>, #{}}
]).
BM25 Search¶
%% Basic BM25 search
{ok, Results} = barrel_vectordb:search_bm25(Store, <<"erlang programming">>, #{k => 10}).
%% Returns: [{<<"doc2">>, 2.45}, {<<"doc1">>, 1.82}, ...]
%% With more results
{ok, Results} = barrel_vectordb:search_bm25(Store, <<"erlang">>, #{k => 50}).
Hybrid Search (BM25 + Vector)¶
Combine keyword matching with semantic similarity:
%% Combined search with RRF fusion (recommended)
{ok, Results} = barrel_vectordb:search_hybrid(Store, <<"erlang concurrency">>, #{
k => 10,
bm25_weight => 0.5,
vector_weight => 0.5,
fusion => rrf
}).
%% Linear fusion (simpler, score-based)
{ok, Results} = barrel_vectordb:search_hybrid(Store, <<"erlang">>, #{
k => 10,
bm25_weight => 0.3,
vector_weight => 0.7,
fusion => linear
}).
Fusion Algorithms:
| Algorithm | Description | Best For |
|---|---|---|
rrf |
Reciprocal Rank Fusion with k=60 | General use, rank-based |
linear |
Weighted sum of normalized scores | When score magnitude matters |
Index Management¶
%% Force compaction (hot layer -> disk)
ok = barrel_vectordb_server:bm25_compact(Store).
%% Get index statistics
{ok, Stats} = barrel_vectordb_server:bm25_info(Store).
%% Returns: #{
%% backend => disk,
%% total_docs => 100000,
%% vocab_size => 50000,
%% hot_docs => 1234,
%% disk_docs => 98766,
%% avgdl => 150.5
%% }
Cluster Usage¶
Creating a Collection with BM25¶
%% Create a sharded collection with disk BM25
ok = barrel_vectordb:create_collection(<<"articles">>, #{
dimension => 768,
shards => 4,
replication_factor => 2,
backend => diskann,
bm25_backend => disk,
bm25_config => #{
hot_max_size => 100000,
k1 => 1.2,
b => 0.75
}
}).
Each shard gets its own independent BM25 index with:
- Own postings files on disk
- Own block-max index (mmap'd)
- Own hot layer
- Own RocksDB term/doc ID mapping
Cluster BM25 Search¶
%% BM25 search across all shards (scatter-gather)
{ok, Results} = barrel_vectordb:cluster_search_bm25(<<"articles">>, <<"erlang">>, #{
k => 10
}).
%% Returns: [{<<"doc123">>, 3.45}, {<<"doc456">>, 2.89}, ...]
Cluster Hybrid Search¶
%% Hybrid search across all shards
{ok, Results} = barrel_vectordb:cluster_search_hybrid(<<"articles">>, <<"erlang">>, #{
k => 10,
bm25_weight => 0.5,
vector_weight => 0.5,
fusion => rrf
}).
The cluster hybrid search:
- Embeds the query once (using collection's embedder)
- Scatters both BM25 and vector queries to all shards
- Each shard returns its local results
- Gathers and deduplicates results
- Applies fusion algorithm (RRF or linear)
- Returns top-K results
Architecture¶
File Layout¶
/data/my_store/bm25/
├── bm25.meta # Header + global stats (4KB aligned)
├── bm25.postings # Compressed posting blocks (4KB aligned)
├── bm25.blockmax # Block-max index (mmap'd)
└── bm25.ids/ # RocksDB: term/doc ID mapping
├── terms_fwd # term string → integer ID
├── terms_rev # integer ID → term string
├── docs_fwd # doc string ID → integer ID
└── docs_rev # integer ID → doc string ID
Block-Max MaxScore Algorithm¶
The search algorithm efficiently skips blocks that cannot contribute to top-k:
1. Initialize
- Load block-max lists for query terms
- Set threshold = 0
2. For each document position:
a. Find pivot: smallest doc_id where sum(max_impacts) > threshold
b. For each term's block at pivot:
- If block.max_impact < threshold - accumulated: SKIP block
- Otherwise: decompress and score
c. If score > threshold:
- Add to result heap
- Update threshold = heap.min_score
3. Return top-K from heap
This typically skips 95%+ of postings for selective queries.
Hot Layer Architecture¶
New documents go to an in-memory hot layer for fast writes:
Write Request
│
▼
┌─────────────┐
│ Hot Layer │ ← In-memory maps
│ (fast) │ Sub-ms writes
└──────┬──────┘
│ Compaction (at 80% capacity)
▼
┌─────────────┐
│ Disk Layer │ ← Block-max postings
│ (large) │ mmap'd block-max index
└─────────────┘
│
▼
Search combines both layers
Varint Encoding¶
Postings use delta-encoded varints for compact storage:
Original doc IDs: [100, 105, 108, 150]
Delta encoded: [100, 5, 3, 42]
Varint bytes: [0x64, 0x05, 0x03, 0x2A] (4 bytes vs 16 bytes)
Performance Tuning¶
Memory vs Latency Tradeoffs¶
| Setting | Effect |
|---|---|
Increase hot_max_size |
More RAM, fewer compactions, faster writes |
Decrease hot_max_size |
Less RAM, more frequent compactions |
Decrease block_size |
Finer skip granularity, more overhead |
| Larger OS page cache | More hot blocks in RAM, lower disk reads |
Recommended Settings¶
Write-heavy workload:
Read-heavy workload:
Memory-constrained:
High-throughput hybrid search:
Comparison with Memory Backend¶
| Feature | Memory Backend | Disk Backend |
|---|---|---|
| Storage | In-memory maps | Disk + hot layer |
| Startup | Load all data | O(1) lazy load |
| Memory | O(corpus size) | O(vocab + hot) |
| Write latency | ~microseconds | ~milliseconds |
| Search latency | ~milliseconds | ~milliseconds |
| Max corpus | Limited by RAM | Limited by SSD |
| Persistence | No | Yes |
Migration Guide¶
From Memory Backend¶
%% 1. Stop writes to the store
%% 2. Update config to use disk backend
NewConfig = OldConfig#{bm25_backend => disk},
%% 3. Restart store - documents need to be re-indexed
barrel_vectordb:stop(Store),
{ok, _} = barrel_vectordb:start_link(NewConfig).
%% 4. Re-index existing documents if needed
%% (Documents added after restart will be auto-indexed)
From External Search Engine¶
%% Export documents and index them
Documents = external_search:export_all(),
lists:foreach(fun({Id, Text, Meta}) ->
barrel_vectordb:add(Store, Id, Text, Meta)
end, Documents).
Troubleshooting¶
High Memory Usage¶
If memory usage is higher than expected:
- Check hot layer size:
barrel_vectordb_server:bm25_info(Store) - Reduce
hot_max_sizein config - Lower
hot_compaction_thresholdfor more frequent compaction
Slow Search¶
If BM25 searches are slow:
- Ensure OS has sufficient page cache for block-max index
- Check if hot layer is very large (may need compaction)
- For multi-term queries, ensure block-max index is loaded
Disk Space¶
The disk backend uses approximately:
- ~50 bytes per posting (term-doc pair) after compression
- ~20 bytes per block-max entry
- RocksDB overhead for ID mapping
For 1M documents with average 100 terms each: - Postings: ~5 GB - Block-max: ~100 MB - ID mapping: ~50 MB
Testing¶
The BM25 implementation includes tests for formula correctness and relevance quality.
Running BM25 Tests¶
# Formula correctness tests (11 tests)
rebar3 eunit --module=barrel_vectordb_bm25_formula_tests
# Corpus-based relevance tests (10 tests)
rebar3 eunit --module=barrel_vectordb_bm25_corpus_tests
# Both modules
rebar3 eunit --module=barrel_vectordb_bm25_formula_tests,barrel_vectordb_bm25_corpus_tests
# All BM25 tests (includes memory backend tests)
rebar3 eunit --module=barrel_vectordb_bm25_tests,barrel_vectordb_bm25_disk_tests,barrel_vectordb_bm25_formula_tests,barrel_vectordb_bm25_corpus_tests
Formula Correctness Tests¶
Located in test/barrel_vectordb_bm25_formula_tests.erl, these tests verify the BM25 scoring formula with hand-calculated expected values:
| Test | Description |
|---|---|
single_term_single_doc |
Verifies basic BM25 score calculation |
multi_term_score |
Scores should sum across query terms |
tf_impact |
Higher TF should increase score |
idf_impact |
Rare terms should have higher IDF contribution |
doc_length_normalization |
Shorter docs favored with equal TF |
k1_parameter |
k1=0 saturates TF immediately, k1=high allows more TF effect |
b_parameter |
b=0 no length normalization, b=1 full normalization |
tf_saturation |
Score(TF=10) < 2 × Score(TF=5) due to diminishing returns |
edge_cases |
Empty queries, missing terms, single doc corpus |
hand_calculated_example |
Full worked example from documentation |
backend_consistency |
Memory and disk backends produce identical scores |
Corpus-Based Relevance Tests¶
Located in test/barrel_vectordb_bm25_corpus_tests.erl, these tests evaluate search quality using IR metrics:
Test Corpus: 100 documents across 5 topic clusters:
- Erlang/OTP (docs 1-20)
- Python/Data Science (docs 21-40)
- Java/Enterprise (docs 41-60)
- Distributed Systems (docs 61-80)
- General Programming (docs 81-100)
Test Queries: 20 queries with relevance judgments covering:
- Single-term topic queries
- Multi-term queries
- Cross-topic queries
- Rare term queries (high IDF)
- Common term queries (low IDF)
Metrics Evaluated:
| Test | Description |
|---|---|
precision_at_5 |
P@5 averaged across all queries |
precision_at_10 |
P@10 averaged across all queries |
recall_at_10 |
R@10 for highly relevant documents |
ndcg_at_10 |
Normalized DCG with graded relevance |
relevant_in_top_k |
At least one relevant doc in top 5 for topic queries |
ranking_order |
Results sorted by descending score |
cross_topic_queries |
Queries matching multiple topic clusters |
rare_term_queries |
High IDF queries return correct results |
memory_disk_consistency |
Both backends produce similar results |
Test Data Files¶
| File | Description |
|---|---|
test/data/bm25_test_corpus.terms |
100 documents in Erlang term format |
test/data/bm25_test_queries.terms |
20 queries with relevance judgments |
Adding Custom Test Documents¶
The corpus file uses Erlang term format:
%% test/data/bm25_test_corpus.terms
[
{<<"doc1">>, <<"Your document text here">>},
{<<"doc2">>, <<"Another document">>},
...
].
Query judgments specify relevant and partially relevant documents: