barrel_docdb Design Document¶

Overview¶

barrel_docdb is an embeddable document database for Erlang applications. It provides document storage with MVCC (Multi-Version Concurrency Control), binary attachments, secondary indexes, a changes feed, and replication primitives.

Design Goals¶

Embeddable: Run as part of your Erlang application, no external services
Reliable: ACID transactions via RocksDB, crash-safe operations
Replication-ready: CouchDB-compatible revision model for sync
Distributed: HLC-based ordering for decentralized deployments
Efficient: Optimized storage for documents and large attachments
Reactive: Real-time subscriptions for document changes
Simple API: Clean, intuitive public interface

Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                        Application                               │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                     barrel_docdb (Public API)                   │
│  put_doc, get_doc, find, subscribe, replicate, get_hlc, ...    │
└─────────────────────────────────────────────────────────────────┘
                               │
    ┌──────────────┬───────────┼───────────┬──────────────┐
    ▼              ▼           ▼           ▼              ▼
┌────────┐  ┌────────────┐ ┌────────┐ ┌────────┐  ┌────────────┐
│barrel  │  │barrel_view │ │barrel  │ │barrel  │  │barrel_     │
│db_server│  │(per-view)  │ │_rep    │ │_sub    │  │query_sub   │
└────────┘  └────────────┘ └────────┘ └────────┘  └────────────┘
    │              │           │           │              │
    │              │           │           └──────────────┘
    │              │           │                  │
    ▼              ▼           ▼                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                      barrel_hlc (HLC Clock)                     │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   barrel_store_rocksdb                          │
│                   (storage abstraction)                         │
└─────────────────────────────────────────────────────────────────┘
          │                                        │
          ▼                                        ▼
┌──────────────────────────┐         ┌──────────────────────────┐
│    Document Store        │         │   Attachment Store       │
│    (RocksDB)             │         │   (RocksDB + BlobDB)     │
└──────────────────────────┘         └──────────────────────────┘

Core Components¶

1. Database Server (barrel_db_server)¶

Each database runs as a separate gen_server process:

-record(state, {
    name      :: binary(),           %% Database name
    db_path   :: string(),           %% Data directory path
    store_ref :: rocksdb:db_handle(),%% Document store reference
    att_ref   :: rocksdb:db_handle(),%% Attachment store reference
    view_sup  :: pid()               %% View supervisor
}).

Responsibilities: - Document CRUD operations - Revision management - Sequence number generation - Coordination with views and changes

Process Registry: Databases are registered in persistent_term for fast lookup:

persistent_term:put({barrel_db, DbName}, Pid)

2. Document Model¶

Documents use a CouchDB-compatible revision model:

┌─────────────────────────────────────────────────────────────────┐
│                         Document                                 │
├─────────────────────────────────────────────────────────────────┤
│ id          : <<"user:alice">>                                  │
│ body        : #{<<"name">> => <<"Alice">>, ...}                 │
├─────────────────────────────────────────────────────────────────┤
│                       Revision Tree                              │
│                                                                  │
│         1-aaa (root)                                            │
│            │                                                     │
│         2-bbb                                                    │
│        ╱     ╲                                                   │
│     3-ccc   3-ddd  (conflict)                                   │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│ Current Rev : 3-ccc (winning)                                   │
│ Deleted     : false                                             │
│ Sequence    : {0, 42}                                           │
└─────────────────────────────────────────────────────────────────┘

Revision Format: <generation>-<sha256_hex>

The hash is computed from:

crypto:hash(sha256, term_to_binary({DocBody, ParentRev, Deleted}))

3. Storage Layer¶

Dual-Database Architecture¶

Each barrel database uses two separate RocksDB instances:

Store	Purpose	Optimization
Document Store	Docs, metadata, sequences, views	Standard RocksDB, optimized for small values
Attachment Store	Binary attachments	BlobDB enabled, optimized for large values

Rationale: Separating large attachments from small documents avoids write amplification during compaction. RocksDB's BlobDB stores large values in separate blob files.

Key Schema¶

Document Store Keys:
├── doc_info/{db}/{docid}              → DocInfo (metadata + revtree)
├── doc_rev/{db}/{docid}/{rev}         → Document body
├── doc_hlc/{db}/{hlc}                 → Change entry (HLC-ordered)
├── path_hlc/{db}/{topic}/{hlc}        → Path-indexed change (exact match)
├── local/{db}/{docid}                 → Local document (not replicated)
├── ars/{db}/{path}                    → Path index for queries
├── view_meta/{db}/{viewid}            → View metadata
├── view_hlc/{db}/{viewid}             → View indexed HLC
├── view_index/{db}/{viewid}:{key}:{docid} → View index entry
└── view_by_docid/{db}/{viewid}:{docid}    → Reverse index

Posting CF Keys (posting_cf):
├── ars_posting/{db}/{field}/{value}   → DocId posting list
└── prefix_changes/{db}/{prefix}/{bucket} → HLC-ordered changes posting list

Attachment Store Keys:
└── att/{db}/{docid}/{attname}         → Attachment binary data

Keys are designed for efficient range scans and prefix matching.

Key Prefixes: | Prefix | Hex | Description | |--------|-----|-------------| | DOC_INFO | 0x01 | Document metadata | | DOC_REV | 0x02 | Document body by revision | | DOC_HLC | 0x0D | Changes ordered by HLC | | PATH_HLC | 0x0E | Path-indexed changes (exact match) | | PREFIX_CHANGES | 0x1B | Sharded prefix changes (wildcard) | | ARS | 0x09 | Path index for queries | | LOCAL | 0x03 | Local documents | | VIEW_* | 0x04-0x08 | View storage |

RocksDB Optimizations¶

Both stores share a common LRU block cache managed by barrel_cache:

┌─────────────────────────────────────────────────────────────────┐
│                  Shared Block Cache (512MB)                      │
├─────────────────────────────────────────────────────────────────┤
│  Document Store          │  Attachment Store                    │
│  ├── Data blocks         │  ├── Metadata blocks                 │
│  ├── Index blocks        │  └── BlobDB index                    │
│  └── Bloom filters       │                                      │
└─────────────────────────────────────────────────────────────────┘

Document Store Optimizations: | Setting | Default | Purpose | |---------|---------|---------| | block_cache | 512MB shared | LRU cache for data/index blocks | | bloom_filter | 10 bits/key | Reduce disk reads for point lookups | | write_buffer_size | 64MB | Memtable size before flush | | max_write_buffer_number | 3 | Concurrent memtables | | compression | snappy | Fast compression for all levels | | bottommost_compression | snappy | Configurable (zstd if available) | | level0_file_num_compaction_trigger | 4 | Trigger L0→L1 compaction | | max_background_jobs | schedulers | Parallel compaction threads |

Attachment Store Optimizations: | Setting | Default | Purpose | |---------|---------|---------| | enable_blob_files | true | Store large values in blob files | | min_blob_size | 4KB | Threshold for blob storage | | blob_file_size | 256MB | Maximum blob file size | | blob_compression_type | snappy | Blob compression (zstd if available) | | blob_garbage_collection_age_cutoff | 0.25 | GC blobs older than 25% | | blob_garbage_collection_force_threshold | 0.5 | Force GC at 50% garbage |

Configuration:

%% Application environment (sys.config)
{barrel_docdb, [
    {block_cache_size, 536870912},  %% 512MB shared cache
    {data_dir, "/var/lib/barrel"}
]}

%% Per-database options
barrel_docdb:create_db(<<"mydb">>, #{
    store_opts => #{
        write_buffer_size => 128 * 1024 * 1024,  %% 128MB
        max_open_files => 2000,
        rate_limit_bytes_per_sec => 100 * 1024 * 1024  %% 100MB/s
    },
    att_opts => #{
        blob_file_size => 512 * 1024 * 1024,  %% 512MB blobs
        min_blob_size => 8192  %% 8KB threshold
    }
}).

4. HLC (Hybrid Logical Clock)¶

Every document modification is timestamped with an HLC:

HLC Timestamp: {WallTime, LogicalCounter, NodeId}
├── WallTime       : Physical time in microseconds
├── LogicalCounter : Logical extension for same-time events
└── NodeId         : Unique node identifier

Properties: - Causally consistent ordering across distributed nodes - Monotonically increasing within a node - No central coordinator required - Clock skew detection and handling

API:

%% Get current HLC
Ts = barrel_docdb:get_hlc().

%% Generate new timestamp (advances clock)
NewTs = barrel_docdb:new_hlc().

%% Sync with remote node
{ok, SyncedTs} = barrel_docdb:sync_hlc(RemoteTs).

5. Changes Feed¶

Every document modification generates an HLC-timestamped change entry:

%% Change entry structure
#{
    id => DocId,
    hlc => HlcTimestamp,
    rev => RevId,
    deleted => boolean(),
    changes => [RevId]
}

Changes are stored by HLC for efficient streaming:

%% Get changes since HLC timestamp
barrel_changes:fold_changes(StoreRef, DbName, SinceHlc, Fun, Acc)

Use Cases: - Real-time notifications - Incremental view updates - Replication source - Path-filtered change feeds

6. Path Subscriptions¶

Subscribe to document changes matching MQTT-style path patterns:

┌─────────────────────────────────────────────────────────────────┐
│                    Path Subscription Flow                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. Client subscribes with pattern (e.g., "users/+/profile")    │
│                         │                                        │
│                         ▼                                        │
│  2. barrel_sub registers pattern in match_trie                  │
│                         │                                        │
│                         ▼                                        │
│  3. On document write, paths extracted from document            │
│                         │                                        │
│                         ▼                                        │
│  4. Paths matched against registered patterns                   │
│                         │                                        │
│                         ▼                                        │
│  5. Matching subscribers receive notification message           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Pattern Syntax: | Pattern | Description | |---------|-------------| | users/alice | Exact match | | users/+ | Single segment wildcard | | users/# | Multi-segment wildcard | | +/orders/# | Mixed wildcards |

7. Query Subscriptions¶

Subscribe to changes for documents matching a query:

Query = #{where => [{path, [<<"type">>], <<"user">>}]},
{ok, SubRef} = barrel_docdb:subscribe_query(DbName, Query).

Optimization: - Query paths are extracted at subscription time - Changes are first filtered by path intersection - Full query evaluation only when paths overlap - Avoids evaluating query for every document change

8. Path-Indexed Changes¶

Changes are indexed by path at write time for efficient filtered queries:

Key: path_hlc/{db}/{topic}/{hlc} → {doc_id, rev, deleted}

Example document:

#{<<"type">> => <<"user">>, <<"org">> => <<"acme">>}

Creates index entries:

path_hlc/mydb/type/user/{hlc1}
path_hlc/mydb/org/acme/{hlc1}

Benefits: - O(k) query time where k = matching changes (vs O(n) full scan) - Efficient filtered replication - Real-time path subscriptions use same index

8b. Sharded Prefix Changes (Wildcard Queries)¶

For wildcard path queries (e.g., paths => [<<"users/#">>]), a separate sharded posting list index provides efficient HLC-ordered iteration:

Key: prefix_changes/{db}/{prefix}/{bucket} → Posting list of changes

Key Format:

PREFIX_CHANGES (0x1B) | encoded_db | prefix | 0x00 | bucket (4 bytes BE)

Value Format (Posting List):

Each entry: << HLC:12/binary, DocId/binary, Rev/binary, Deleted:1 >>

Sharding Strategy: - Bucket = wall_time div 3600 (1-hour granularity) - Bounds posting list growth for high-write workloads - Range scan discovers only existing buckets (no empty iteration)

Example for document:

#{<<"type">> => <<"user">>, <<"status">> => <<"active">>}

Creates entries in multiple prefix buckets:

prefix_changes/mydb/type/user/482345         → [<< hlc1, "doc1", ... >>]
prefix_changes/mydb/status/active/482345     → [<< hlc1, "doc1", ... >>]

Query Execution: 1. Compute start bucket from since HLC 2. Range scan posting_cf for prefix + bucket keys 3. Iterate sorted entries, filter by HLC 4. Collect until limit reached

Performance: - 50x faster than path_hlc prefix scan with deduplication - Native RocksDB merge operator keeps entries sorted - No post-processing or deduplication needed

9. Views (Secondary Indexes)¶

Views are incremental map-reduce indexes:

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  Changes    │──────│  View       │──────│  Index      │
│  Feed       │      │  gen_statem │      │  Storage    │
└─────────────┘      └─────────────┘      └─────────────┘

View Lifecycle: 1. Register view with module implementing barrel_view behaviour 2. View process subscribes to changes feed 3. For each change, call Module:map(Doc) to get key-value pairs 4. Store/update index entries 5. Track indexed sequence for incremental updates

Automatic Rebuild: When Module:version() changes, the view clears and rebuilds from scratch.

10. Replication¶

Replication follows the CouchDB protocol with HLC-based ordering:

┌──────────────────────────────────────────────────────────────┐
│                    Replication Flow                           │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  1. Read checkpoint (last_hlc from local doc)                │
│                     │                                         │
│                     ▼                                         │
│  2. Get changes from source since last_hlc                   │
│     (optionally filtered by paths/query)                     │
│                     │                                         │
│                     ▼                                         │
│  3. Sync target HLC with source timestamps                   │
│                     │                                         │
│                     ▼                                         │
│  4. For each change:                                         │
│     ├── Call revsdiff(target, docid, revs)                   │
│     ├── Get missing revisions from source with history       │
│     └── Put to target using put_rev(doc, history, deleted)   │
│                     │                                         │
│                     ▼                                         │
│  5. Write checkpoint with new last_hlc                       │
│                     │                                         │
│                     ▼                                         │
│  6. Repeat until no more changes                             │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Filtered Replication: Replicate only documents matching filters:

barrel_rep:replicate(Source, Target, #{
    filter => #{
        paths => [<<"users/#">>],           %% Path pattern filter
        query => #{where => [...]}          %% Query filter
    }
}).
%% Both filters use AND logic when combined

Transport Abstraction: The barrel_rep_transport behaviour allows pluggable transports: - barrel_rep_transport_local - Same Erlang VM - Custom HTTP, TCP, or other transports

Checkpoints: Stored as local documents (not replicated):

Key: <<"replication-checkpoint-{rep_id}">>
Value: #{<<"history">> => [#{<<"source_last_hlc">> => ...}]}

11. Query Parallelization¶

Large queries (>100 documents) use parallel CBOR decode + condition matching via barrel_parallel:

                    Query Coordinator
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
     Worker 1          Worker 2           Worker N
  (DocIds 1-100)    (DocIds 101-200)   (DocIds N*100+)
        │                  │                  │
   Decode CBOR        Decode CBOR        Decode CBOR
   Match Conds        Match Conds        Match Conds
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │
                    Merge Results
                   (preserve order)

Configuration: - 4 workers (PostgreSQL-style conservative default) - Threshold: >100 documents triggers parallel execution - Worker pool managed by barrel_parallel gen_server

Tradeoffs:

Workers	Throughput	Reason
4	Best	Optimal for RocksDB
8	~50% slower	RocksDB contention, cache thrashing
2	~30% slower	Underutilizes CPU

12. Chunked Query Execution¶

Queries return results in chunks with continuation tokens for memory-efficient iteration:

%% First chunk (default 1000 results)
{ok, Results, #{has_more := true, continuation := Token}} =
    barrel_docdb:find(Db, Query).

%% Next chunk
{ok, More, #{has_more := false}} =
    barrel_docdb:find(Db, Query, #{continuation => Token}).

Cursor Management: - Cursors stored in ETS with 60-second TTL - Each access extends TTL - Cursors hold RocksDB snapshots for consistent reads - Snapshots released when has_more => false or cursor expires

Design Decisions¶

Why RocksDB?¶

Embedded: No external service dependencies
LSM-tree: Optimized for write-heavy workloads
Atomic batches: Multiple operations in one atomic write
BlobDB: Efficient large value storage
Snapshots: Consistent reads during iteration

Why Revision Trees?¶

Conflict detection: Multiple concurrent updates create branches
Replication: Only transfer missing revisions
History: Track document evolution
Deterministic winners: Same data = same winning revision

Why Separate Attachment Store?¶

RocksDB stores values inline in SST files. Large values cause: - High write amplification during compaction - Wasted space in block cache - Slower reads due to large block sizes

BlobDB stores large values in separate blob files, solving these issues.

Why Local Documents?¶

Local documents: - Are not replicated - Don't have revision history - Are used for per-database metadata (checkpoints, config)

This separates replication state from user data.

Data Flow Examples¶

Put Document¶

put_doc(Db, Doc)
    │
    ├── Validate document
    ├── Generate/validate ID
    ├── Compute new revision hash
    │
    ├── If update:
    │   ├── Check existing doc exists
    │   ├── Verify _rev matches current
    │   └── Extend revision tree
    │
    ├── Get next sequence number
    │
    ├── Atomic batch write:
    │   ├── doc_info (metadata + revtree)
    │   ├── doc_rev (body at new revision)
    │   └── doc_seq (change entry)
    │
    └── Return {ok, #{id, rev, ok}}

Query View¶

query_view(Db, ViewId, Opts)
    │
    ├── Get view process
    ├── Ensure view is up-to-date (refresh if needed)
    │
    ├── Build key range from start_key/end_key
    │
    ├── Iterate view_index entries:
    │   ├── Decode key and value
    │   ├── Apply limit
    │   └── Optionally fetch full document
    │
    └── Return {ok, Results}

Replicate¶

replicate(Source, Target)
    │
    ├── Generate replication ID
    ├── Read checkpoint (get last_seq)
    │
    ├── Loop:
    │   ├── Get changes batch from source
    │   │
    │   ├── For each change:
    │   │   ├── revsdiff(target, docid, revs)
    │   │   ├── If missing revs:
    │   │   │   ├── get_doc(source, docid, {history: true})
    │   │   │   └── put_rev(target, doc, history, deleted)
    │   │
    │   ├── Write checkpoint
    │   └── Continue until no changes
    │
    └── Return {ok, stats}

Supervision Tree¶

barrel_docdb_sup (one_for_one)
├── barrel_cache              # Shared RocksDB block cache
├── barrel_hlc_clock          # Global HLC clock
├── barrel_sub                # Path subscriptions manager
├── barrel_query_sub          # Query subscriptions manager
├── barrel_path_dict          # Path ID interning for posting lists
├── barrel_query_cursor       # Chunked query cursor management
├── barrel_parallel           # Worker pool for parallel queries
└── barrel_db_sup (simple_one_for_one)
    ├── barrel_db_server (db1)
    │   └── barrel_view_sup (simple_one_for_one)
    │       ├── barrel_view (view1)
    │       └── barrel_view (view2)
    ├── barrel_db_server (db2)
    │   └── ...
    └── ...

Performance Considerations¶

Write Path¶

Batch operations reduce disk I/O
Sequence numbers enable efficient change tracking
Revision computation is CPU-bound (SHA-256)

Read Path¶

Document lookup is O(1) key access
View queries use RocksDB iterators
Snapshots provide consistent reads

Memory Usage¶

RocksDB block cache for hot data
Views process changes incrementally
Large attachments stored in blob files

Disk Usage¶

RocksDB compaction reclaims space
Old revisions can be pruned
Attachments use content-addressable storage

Future Considerations¶

Planned Features¶

Continuous replication
HTTP transport for replication
Reduce functions for views
Conflict resolution helpers

Extension Points¶

Custom storage backends (via behaviour)
Custom replication transports
View behaviours for different index types

File Structure¶

src/
├── barrel_docdb_app.erl        # Application callbacks
├── barrel_docdb_sup.erl        # Top-level supervisor
├── barrel_docdb.erl            # Public API
│
├── barrel_db_server.erl        # Per-database gen_server
├── barrel_db_sup.erl           # Database supervisor
│
├── barrel_doc.erl              # Document utilities
├── barrel_revtree_bin.erl      # Revision tree (compact binary encoding)
│
├── barrel_hlc.erl              # HLC clock management
│
├── barrel_att.erl              # Attachment API
├── barrel_att_store.erl        # Attachment storage
│
├── barrel_changes.erl          # Changes feed API (HLC-based)
├── barrel_changes_stream.erl   # Streaming changes
│
├── barrel_sub.erl              # Path subscriptions manager
├── barrel_sub_sup.erl          # Subscription supervisor
├── barrel_query_sub.erl        # Query subscriptions manager
│
├── barrel_query.erl            # Declarative query compiler & executor
├── barrel_query_cursor.erl     # Chunked query cursor management
├── barrel_ars.erl              # Path index API
├── barrel_ars_index.erl        # Path index implementation
├── barrel_path_dict.erl        # Path ID interning for posting lists
│
├── barrel_parallel.erl         # Parallel map/filtermap with worker pool
├── barrel_doc_body_store.erl   # Batch document body operations
│
├── barrel_view.erl             # View gen_statem
├── barrel_view_index.erl       # View index storage (HLC-tracked)
├── barrel_view_sup.erl         # View supervisor
│
├── barrel_cache.erl            # Shared RocksDB block cache
├── barrel_store_rocksdb.erl    # RocksDB storage (optimized)
├── barrel_store_keys.erl       # Key encoding (doc_hlc, path_hlc, ars)
├── barrel_docdb_codec_cbor.erl # CBOR encoding/decoding
│
├── barrel_rep.erl              # Replication API (filtered)
├── barrel_rep_alg.erl          # Replication algorithm (HLC sync)
├── barrel_rep_checkpoint.erl   # Checkpoint management (HLC-based)
├── barrel_rep_transport.erl    # Transport behaviour
└── barrel_rep_transport_local.erl  # Local transport