barrel_docdb Design Document¶
Overview¶
barrel_docdb is an embeddable document database for Erlang applications. It provides document storage with MVCC (Multi-Version Concurrency Control), binary attachments, secondary indexes, a changes feed, and replication primitives.
Design Goals¶
- Embeddable: Run as part of your Erlang application, no external services
- Reliable: ACID transactions via RocksDB, crash-safe operations
- Replication-ready: CouchDB-compatible revision model for sync
- Distributed: HLC-based ordering for decentralized deployments
- Efficient: Optimized storage for documents and large attachments
- Reactive: Real-time subscriptions for document changes
- Simple API: Clean, intuitive public interface
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Application │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ barrel_docdb (Public API) │
│ put_doc, get_doc, find, subscribe, replicate, get_hlc, ... │
└─────────────────────────────────────────────────────────────────┘
│
┌──────────────┬───────────┼───────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────────┐ ┌────────┐ ┌────────┐ ┌────────────┐
│barrel │ │barrel_view │ │barrel │ │barrel │ │barrel_ │
│db_server│ │(per-view) │ │_rep │ │_sub │ │query_sub │
└────────┘ └────────────┘ └────────┘ └────────┘ └────────────┘
│ │ │ │ │
│ │ │ └──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ barrel_hlc (HLC Clock) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ barrel_store_rocksdb │
│ (storage abstraction) │
└─────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ Document Store │ │ Attachment Store │
│ (RocksDB) │ │ (RocksDB + BlobDB) │
└──────────────────────────┘ └──────────────────────────┘
Core Components¶
1. Database Server (barrel_db_server)¶
Each database runs as a separate gen_server process:
-record(state, {
name :: binary(), %% Database name
db_path :: string(), %% Data directory path
store_ref :: rocksdb:db_handle(),%% Document store reference
att_ref :: rocksdb:db_handle(),%% Attachment store reference
view_sup :: pid() %% View supervisor
}).
Responsibilities: - Document CRUD operations - Revision management - Sequence number generation - Coordination with views and changes
Process Registry:
Databases are registered in persistent_term for fast lookup:
2. Document Model¶
Documents use a CouchDB-compatible revision model:
┌─────────────────────────────────────────────────────────────────┐
│ Document │
├─────────────────────────────────────────────────────────────────┤
│ id : <<"user:alice">> │
│ body : #{<<"name">> => <<"Alice">>, ...} │
├─────────────────────────────────────────────────────────────────┤
│ Revision Tree │
│ │
│ 1-aaa (root) │
│ │ │
│ 2-bbb │
│ ╱ ╲ │
│ 3-ccc 3-ddd (conflict) │
│ │
├─────────────────────────────────────────────────────────────────┤
│ Current Rev : 3-ccc (winning) │
│ Deleted : false │
│ Sequence : {0, 42} │
└─────────────────────────────────────────────────────────────────┘
Revision Format: <generation>-<sha256_hex>
The hash is computed from:
3. Storage Layer¶
Dual-Database Architecture¶
Each barrel database uses two separate RocksDB instances:
| Store | Purpose | Optimization |
|---|---|---|
| Document Store | Docs, metadata, sequences, views | Standard RocksDB, optimized for small values |
| Attachment Store | Binary attachments | BlobDB enabled, optimized for large values |
Rationale: Separating large attachments from small documents avoids write amplification during compaction. RocksDB's BlobDB stores large values in separate blob files.
Key Schema¶
Document Store Keys:
├── doc_info/{db}/{docid} → DocInfo (metadata + revtree)
├── doc_rev/{db}/{docid}/{rev} → Document body
├── doc_hlc/{db}/{hlc} → Change entry (HLC-ordered)
├── path_hlc/{db}/{topic}/{hlc} → Path-indexed change (exact match)
├── local/{db}/{docid} → Local document (not replicated)
├── ars/{db}/{path} → Path index for queries
├── view_meta/{db}/{viewid} → View metadata
├── view_hlc/{db}/{viewid} → View indexed HLC
├── view_index/{db}/{viewid}:{key}:{docid} → View index entry
└── view_by_docid/{db}/{viewid}:{docid} → Reverse index
Posting CF Keys (posting_cf):
├── ars_posting/{db}/{field}/{value} → DocId posting list
└── prefix_changes/{db}/{prefix}/{bucket} → HLC-ordered changes posting list
Attachment Store Keys:
└── att/{db}/{docid}/{attname} → Attachment binary data
Keys are designed for efficient range scans and prefix matching.
Key Prefixes: | Prefix | Hex | Description | |--------|-----|-------------| | DOC_INFO | 0x01 | Document metadata | | DOC_REV | 0x02 | Document body by revision | | DOC_HLC | 0x0D | Changes ordered by HLC | | PATH_HLC | 0x0E | Path-indexed changes (exact match) | | PREFIX_CHANGES | 0x1B | Sharded prefix changes (wildcard) | | ARS | 0x09 | Path index for queries | | LOCAL | 0x03 | Local documents | | VIEW_* | 0x04-0x08 | View storage |
RocksDB Optimizations¶
Both stores share a common LRU block cache managed by barrel_cache:
┌─────────────────────────────────────────────────────────────────┐
│ Shared Block Cache (512MB) │
├─────────────────────────────────────────────────────────────────┤
│ Document Store │ Attachment Store │
│ ├── Data blocks │ ├── Metadata blocks │
│ ├── Index blocks │ └── BlobDB index │
│ └── Bloom filters │ │
└─────────────────────────────────────────────────────────────────┘
Document Store Optimizations:
| Setting | Default | Purpose |
|---------|---------|---------|
| block_cache | 512MB shared | LRU cache for data/index blocks |
| bloom_filter | 10 bits/key | Reduce disk reads for point lookups |
| write_buffer_size | 64MB | Memtable size before flush |
| max_write_buffer_number | 3 | Concurrent memtables |
| compression | snappy | Fast compression for all levels |
| bottommost_compression | snappy | Configurable (zstd if available) |
| level0_file_num_compaction_trigger | 4 | Trigger L0→L1 compaction |
| max_background_jobs | schedulers | Parallel compaction threads |
Attachment Store Optimizations:
| Setting | Default | Purpose |
|---------|---------|---------|
| enable_blob_files | true | Store large values in blob files |
| min_blob_size | 4KB | Threshold for blob storage |
| blob_file_size | 256MB | Maximum blob file size |
| blob_compression_type | snappy | Blob compression (zstd if available) |
| blob_garbage_collection_age_cutoff | 0.25 | GC blobs older than 25% |
| blob_garbage_collection_force_threshold | 0.5 | Force GC at 50% garbage |
Configuration:
%% Application environment (sys.config)
{barrel_docdb, [
{block_cache_size, 536870912}, %% 512MB shared cache
{data_dir, "/var/lib/barrel"}
]}
%% Per-database options
barrel_docdb:create_db(<<"mydb">>, #{
store_opts => #{
write_buffer_size => 128 * 1024 * 1024, %% 128MB
max_open_files => 2000,
rate_limit_bytes_per_sec => 100 * 1024 * 1024 %% 100MB/s
},
att_opts => #{
blob_file_size => 512 * 1024 * 1024, %% 512MB blobs
min_blob_size => 8192 %% 8KB threshold
}
}).
4. HLC (Hybrid Logical Clock)¶
Every document modification is timestamped with an HLC:
HLC Timestamp: {WallTime, LogicalCounter, NodeId}
├── WallTime : Physical time in microseconds
├── LogicalCounter : Logical extension for same-time events
└── NodeId : Unique node identifier
Properties: - Causally consistent ordering across distributed nodes - Monotonically increasing within a node - No central coordinator required - Clock skew detection and handling
API:
%% Get current HLC
Ts = barrel_docdb:get_hlc().
%% Generate new timestamp (advances clock)
NewTs = barrel_docdb:new_hlc().
%% Sync with remote node
{ok, SyncedTs} = barrel_docdb:sync_hlc(RemoteTs).
5. Changes Feed¶
Every document modification generates an HLC-timestamped change entry:
%% Change entry structure
#{
id => DocId,
hlc => HlcTimestamp,
rev => RevId,
deleted => boolean(),
changes => [RevId]
}
Changes are stored by HLC for efficient streaming:
%% Get changes since HLC timestamp
barrel_changes:fold_changes(StoreRef, DbName, SinceHlc, Fun, Acc)
Use Cases: - Real-time notifications - Incremental view updates - Replication source - Path-filtered change feeds
6. Path Subscriptions¶
Subscribe to document changes matching MQTT-style path patterns:
┌─────────────────────────────────────────────────────────────────┐
│ Path Subscription Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Client subscribes with pattern (e.g., "users/+/profile") │
│ │ │
│ ▼ │
│ 2. barrel_sub registers pattern in match_trie │
│ │ │
│ ▼ │
│ 3. On document write, paths extracted from document │
│ │ │
│ ▼ │
│ 4. Paths matched against registered patterns │
│ │ │
│ ▼ │
│ 5. Matching subscribers receive notification message │
│ │
└─────────────────────────────────────────────────────────────────┘
Pattern Syntax:
| Pattern | Description |
|---------|-------------|
| users/alice | Exact match |
| users/+ | Single segment wildcard |
| users/# | Multi-segment wildcard |
| +/orders/# | Mixed wildcards |
7. Query Subscriptions¶
Subscribe to changes for documents matching a query:
Query = #{where => [{path, [<<"type">>], <<"user">>}]},
{ok, SubRef} = barrel_docdb:subscribe_query(DbName, Query).
Optimization: - Query paths are extracted at subscription time - Changes are first filtered by path intersection - Full query evaluation only when paths overlap - Avoids evaluating query for every document change
8. Path-Indexed Changes¶
Changes are indexed by path at write time for efficient filtered queries:
Example document:
Creates index entries:
Benefits: - O(k) query time where k = matching changes (vs O(n) full scan) - Efficient filtered replication - Real-time path subscriptions use same index
8b. Sharded Prefix Changes (Wildcard Queries)¶
For wildcard path queries (e.g., paths => [<<"users/#">>]), a separate sharded posting list index provides efficient HLC-ordered iteration:
Key Format:
Value Format (Posting List):
Sharding Strategy:
- Bucket = wall_time div 3600 (1-hour granularity)
- Bounds posting list growth for high-write workloads
- Range scan discovers only existing buckets (no empty iteration)
Example for document:
Creates entries in multiple prefix buckets:
prefix_changes/mydb/type/user/482345 → [<< hlc1, "doc1", ... >>]
prefix_changes/mydb/status/active/482345 → [<< hlc1, "doc1", ... >>]
Query Execution:
1. Compute start bucket from since HLC
2. Range scan posting_cf for prefix + bucket keys
3. Iterate sorted entries, filter by HLC
4. Collect until limit reached
Performance: - 50x faster than path_hlc prefix scan with deduplication - Native RocksDB merge operator keeps entries sorted - No post-processing or deduplication needed
9. Views (Secondary Indexes)¶
Views are incremental map-reduce indexes:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Changes │──────│ View │──────│ Index │
│ Feed │ │ gen_statem │ │ Storage │
└─────────────┘ └─────────────┘ └─────────────┘
View Lifecycle:
1. Register view with module implementing barrel_view behaviour
2. View process subscribes to changes feed
3. For each change, call Module:map(Doc) to get key-value pairs
4. Store/update index entries
5. Track indexed sequence for incremental updates
Automatic Rebuild:
When Module:version() changes, the view clears and rebuilds from scratch.
10. Replication¶
Replication follows the CouchDB protocol with HLC-based ordering:
┌──────────────────────────────────────────────────────────────┐
│ Replication Flow │
├──────────────────────────────────────────────────────────────┤
│ │
│ 1. Read checkpoint (last_hlc from local doc) │
│ │ │
│ ▼ │
│ 2. Get changes from source since last_hlc │
│ (optionally filtered by paths/query) │
│ │ │
│ ▼ │
│ 3. Sync target HLC with source timestamps │
│ │ │
│ ▼ │
│ 4. For each change: │
│ ├── Call revsdiff(target, docid, revs) │
│ ├── Get missing revisions from source with history │
│ └── Put to target using put_rev(doc, history, deleted) │
│ │ │
│ ▼ │
│ 5. Write checkpoint with new last_hlc │
│ │ │
│ ▼ │
│ 6. Repeat until no more changes │
│ │
└──────────────────────────────────────────────────────────────┘
Filtered Replication: Replicate only documents matching filters:
barrel_rep:replicate(Source, Target, #{
filter => #{
paths => [<<"users/#">>], %% Path pattern filter
query => #{where => [...]} %% Query filter
}
}).
%% Both filters use AND logic when combined
Transport Abstraction:
The barrel_rep_transport behaviour allows pluggable transports:
- barrel_rep_transport_local - Same Erlang VM
- Custom HTTP, TCP, or other transports
Checkpoints: Stored as local documents (not replicated):
Key: <<"replication-checkpoint-{rep_id}">>
Value: #{<<"history">> => [#{<<"source_last_hlc">> => ...}]}
11. Query Parallelization¶
Large queries (>100 documents) use parallel CBOR decode + condition matching via barrel_parallel:
Query Coordinator
│
┌──────────────────┼──────────────────┐
│ │ │
Worker 1 Worker 2 Worker N
(DocIds 1-100) (DocIds 101-200) (DocIds N*100+)
│ │ │
Decode CBOR Decode CBOR Decode CBOR
Match Conds Match Conds Match Conds
│ │ │
└──────────────────┼──────────────────┘
│
Merge Results
(preserve order)
Configuration:
- 4 workers (PostgreSQL-style conservative default)
- Threshold: >100 documents triggers parallel execution
- Worker pool managed by barrel_parallel gen_server
Tradeoffs:
| Workers | Throughput | Reason |
|---|---|---|
| 4 | Best | Optimal for RocksDB |
| 8 | ~50% slower | RocksDB contention, cache thrashing |
| 2 | ~30% slower | Underutilizes CPU |
12. Chunked Query Execution¶
Queries return results in chunks with continuation tokens for memory-efficient iteration:
%% First chunk (default 1000 results)
{ok, Results, #{has_more := true, continuation := Token}} =
barrel_docdb:find(Db, Query).
%% Next chunk
{ok, More, #{has_more := false}} =
barrel_docdb:find(Db, Query, #{continuation => Token}).
Cursor Management:
- Cursors stored in ETS with 60-second TTL
- Each access extends TTL
- Cursors hold RocksDB snapshots for consistent reads
- Snapshots released when has_more => false or cursor expires
Design Decisions¶
Why RocksDB?¶
- Embedded: No external service dependencies
- LSM-tree: Optimized for write-heavy workloads
- Atomic batches: Multiple operations in one atomic write
- BlobDB: Efficient large value storage
- Snapshots: Consistent reads during iteration
Why Revision Trees?¶
- Conflict detection: Multiple concurrent updates create branches
- Replication: Only transfer missing revisions
- History: Track document evolution
- Deterministic winners: Same data = same winning revision
Why Separate Attachment Store?¶
RocksDB stores values inline in SST files. Large values cause: - High write amplification during compaction - Wasted space in block cache - Slower reads due to large block sizes
BlobDB stores large values in separate blob files, solving these issues.
Why Local Documents?¶
Local documents: - Are not replicated - Don't have revision history - Are used for per-database metadata (checkpoints, config)
This separates replication state from user data.
Data Flow Examples¶
Put Document¶
put_doc(Db, Doc)
│
├── Validate document
├── Generate/validate ID
├── Compute new revision hash
│
├── If update:
│ ├── Check existing doc exists
│ ├── Verify _rev matches current
│ └── Extend revision tree
│
├── Get next sequence number
│
├── Atomic batch write:
│ ├── doc_info (metadata + revtree)
│ ├── doc_rev (body at new revision)
│ └── doc_seq (change entry)
│
└── Return {ok, #{id, rev, ok}}
Query View¶
query_view(Db, ViewId, Opts)
│
├── Get view process
├── Ensure view is up-to-date (refresh if needed)
│
├── Build key range from start_key/end_key
│
├── Iterate view_index entries:
│ ├── Decode key and value
│ ├── Apply limit
│ └── Optionally fetch full document
│
└── Return {ok, Results}
Replicate¶
replicate(Source, Target)
│
├── Generate replication ID
├── Read checkpoint (get last_seq)
│
├── Loop:
│ ├── Get changes batch from source
│ │
│ ├── For each change:
│ │ ├── revsdiff(target, docid, revs)
│ │ ├── If missing revs:
│ │ │ ├── get_doc(source, docid, {history: true})
│ │ │ └── put_rev(target, doc, history, deleted)
│ │
│ ├── Write checkpoint
│ └── Continue until no changes
│
└── Return {ok, stats}
Supervision Tree¶
barrel_docdb_sup (one_for_one)
├── barrel_cache # Shared RocksDB block cache
├── barrel_hlc_clock # Global HLC clock
├── barrel_sub # Path subscriptions manager
├── barrel_query_sub # Query subscriptions manager
├── barrel_path_dict # Path ID interning for posting lists
├── barrel_query_cursor # Chunked query cursor management
├── barrel_parallel # Worker pool for parallel queries
└── barrel_db_sup (simple_one_for_one)
├── barrel_db_server (db1)
│ └── barrel_view_sup (simple_one_for_one)
│ ├── barrel_view (view1)
│ └── barrel_view (view2)
├── barrel_db_server (db2)
│ └── ...
└── ...
Performance Considerations¶
Write Path¶
- Batch operations reduce disk I/O
- Sequence numbers enable efficient change tracking
- Revision computation is CPU-bound (SHA-256)
Read Path¶
- Document lookup is O(1) key access
- View queries use RocksDB iterators
- Snapshots provide consistent reads
Memory Usage¶
- RocksDB block cache for hot data
- Views process changes incrementally
- Large attachments stored in blob files
Disk Usage¶
- RocksDB compaction reclaims space
- Old revisions can be pruned
- Attachments use content-addressable storage
Future Considerations¶
Planned Features¶
- Continuous replication
- HTTP transport for replication
- Reduce functions for views
- Conflict resolution helpers
Extension Points¶
- Custom storage backends (via behaviour)
- Custom replication transports
- View behaviours for different index types
File Structure¶
src/
├── barrel_docdb_app.erl # Application callbacks
├── barrel_docdb_sup.erl # Top-level supervisor
├── barrel_docdb.erl # Public API
│
├── barrel_db_server.erl # Per-database gen_server
├── barrel_db_sup.erl # Database supervisor
│
├── barrel_doc.erl # Document utilities
├── barrel_revtree_bin.erl # Revision tree (compact binary encoding)
│
├── barrel_hlc.erl # HLC clock management
│
├── barrel_att.erl # Attachment API
├── barrel_att_store.erl # Attachment storage
│
├── barrel_changes.erl # Changes feed API (HLC-based)
├── barrel_changes_stream.erl # Streaming changes
│
├── barrel_sub.erl # Path subscriptions manager
├── barrel_sub_sup.erl # Subscription supervisor
├── barrel_query_sub.erl # Query subscriptions manager
│
├── barrel_query.erl # Declarative query compiler & executor
├── barrel_query_cursor.erl # Chunked query cursor management
├── barrel_ars.erl # Path index API
├── barrel_ars_index.erl # Path index implementation
├── barrel_path_dict.erl # Path ID interning for posting lists
│
├── barrel_parallel.erl # Parallel map/filtermap with worker pool
├── barrel_doc_body_store.erl # Batch document body operations
│
├── barrel_view.erl # View gen_statem
├── barrel_view_index.erl # View index storage (HLC-tracked)
├── barrel_view_sup.erl # View supervisor
│
├── barrel_cache.erl # Shared RocksDB block cache
├── barrel_store_rocksdb.erl # RocksDB storage (optimized)
├── barrel_store_keys.erl # Key encoding (doc_hlc, path_hlc, ars)
├── barrel_docdb_codec_cbor.erl # CBOR encoding/decoding
│
├── barrel_rep.erl # Replication API (filtered)
├── barrel_rep_alg.erl # Replication algorithm (HLC sync)
├── barrel_rep_checkpoint.erl # Checkpoint management (HLC-based)
├── barrel_rep_transport.erl # Transport behaviour
└── barrel_rep_transport_local.erl # Local transport