Bitcask

Tuesday, June 3, 2025

Phil Eaton issued a HYIBY? – have you implemented bitcask yet? – challenge yesterday. Of course, I immediately realized that I have not and also that it would be fun to build in Factor.

Bitcask is described in the original Bitcask paper as a “log-structured hash-table for fast key/value data”, and was part of the software developed by Basho Technologies as part of the Riak distributed database. Besides the original paper, various developers over the years have bumped into Bitcask and implemented it in different programming languages. Arpit Bhayani, for example, has a nice blog post describing Bitcask that is worth reading for more background.

At its core, Bitcask describes an append-only storage mechanism for building a key-value database. It consists of one-or-more data files, each of which has an optional index file to allow faster recovery when initializing the database, and generally supports GET, PUT, and DELETE operations.

Data Files

Our data file contains a series of entry records. Each record consists of a key length, value length, key bytes, and value bytes. A simple word provides a way to write these bytes to a file:

: write-entry-bytes ( key value -- )
    [ dup length 4 >be write ] bi@ [ write ] bi@ ;

Then, using the serialize vocabulary we can store Factor objects quite simply:

: write-entry ( key value -- )
    [ object>bytes ] bi@ write-entry-bytes ;

We need the ability to store tombstone records which indicate that a key has been deleted from the database. In this case, we choose to store a zero-sized value to indicate that:

: write-tombstone ( key -- )
    object>bytes f write-entry-bytes ;

Assuming that a data file has had it’s seek position moved to the beginning of an entry record, we can read the value that it contains, or return a boolean indicating that it is not found because it was stored as a tombstone:

: read-entry ( -- value/f ? )
    4 read be> 4 read be> [
        drop f f
    ] [
        [ seek-relative seek-input ]
        [ read bytes>object t ] bi*
    ] if-zero ;

Index Files

Our index file contains hints that provide a way to recover the record offsets into the data files. These hints consist of a series of index records. Each record consists of a key length, key bytes, and file offset.

We can write our index mapping of keys to offsets:

: write-index ( index -- )
    [
        [ object>bytes dup length 4 >be write write ]
        [ 4 >be write ] bi*
    ] assoc-each ;

And then read it back into memory:

: read-index ( -- index )
    H{ } clone [
        4 read [
            be> read bytes>object 4 read be>
            swap pick set-at t
        ] [ f ] if*
    ] loop ;

We want to make the index files optional, continuing to recover the index by first seeking to the last entry that we have in the index, and then continuing to iterate across the records in the data file to recover the full index, making sure to delete any items that are subsequently observed to contain tombstone entries:

: recover-index ( index -- index' )
    dup values [ maximum seek-absolute seek-input ] unless-empty
    [
        tell-input 4 read [
            be> 4 read be> [ read bytes>object ] dip
            [ pick delete-at drop ] [
                [ pick set-at ]
                [ seek-relative seek-input ] bi*
            ] if-zero t
        ] [ drop f ] if*
    ] loop ;

Bitcask Implementation

The associative mapping protocol describes the words that an assoc should support. This type of object provides a mapping of key to value, with ways to add, update, and delete these mappings.

We want our bitcask type to use a single data file, reading and recovering from an index file, and then providing ways to modify – by appending to the data file – the database.

TUPLE: bitcask path index ;

:: <bitcask> ( path -- bitcask )
    path dup touch-file
    path ".idx" append dup touch-file
    binary [ read-index ] with-file-reader
    path binary [ recover-index ] with-file-reader
    bitcask boa ;

INSTANCE: bitcask assoc

The application should control when and how these index files are persisted:

: save-index ( bitcask -- )
    dup path>> ".idx" append binary
    [ index>> write-index ] with-file-writer ;

The first operation we support will be set-at, updating the index after writing the entry.

M:: bitcask set-at ( value key bitcask -- )
    bitcask path>> binary [
        tell-output
        key value write-entry
        key bitcask index>> set-at
    ] with-file-appender ;

Next, we support at*, to lookup a value by seeking in the data file and reading the entry:

M:: bitcask at* ( key bitcask -- value/f ? )
    key bitcask index>> at* [
        bitcask path>> binary [
            seek-absolute seek-input read-entry
        ] with-file-reader
    ] [ drop f f ] if ;

And finally, delete-at removes a key from the index after writing a tombstone:

M:: bitcask delete-at ( key bitcask -- )
    key bitcask index>> key? [
        bitcask path>> binary [
            key write-tombstone
            key bitcask index>> delete-at
        ] with-file-appender
    ] when ;

The assoc-size of our database is the size of the index:

M: bitcask assoc-size
    index>> assoc-size ;

It is helpful to implement >alist to provide a conversion to an assocation list, although if the database gets quite large, this might be of less practical value:

M:: bitcask >alist ( bitcask -- alist )
    bitcask path>> binary [
        bitcask index>> [
            seek-absolute seek-input read-entry t assert=
        ] { } assoc-map-as
    ] with-file-reader ;

And a way to clear-assoc by writing tombstones and clearing the index:

M:: bitcask clear-assoc ( bitcask -- )
    bitcask path>> binary [
        bitcask index>>
        dup keys [ write-tombstone ] each
        clear-assoc
    ] with-file-appender ;

There are some elements desirable in a production database that are not implemented, for example:

reducing the amount of opening and closing files to increase performance
controlling when file writes are flushed to disk
storing other metadata such as timestamps for each entry
rolling over data log files as they reach a maximum size
providing a way to vacuum the database files to remove tombstone entries
compressing the database entries or data log files if size is a consideration
protocol for accessing it over the network by other parts of the application

This is now available in the development version in the bitcask vocabulary!