Re: Factor

Factor: the language, the theory, and the practice.

Cuckoo Filters

Monday, August 8, 2016

#math

A Cuckoo filter is a Bloom filter replacement that allows for space-efficient probabilistic membership checks. Cuckoo filters provide the ability to add and remove items dynamically without significantly degrading space and performance. False positive rates are typically low.

This data structure is explained by Bin Fan, Dave Andersen, Michael Kaminsky, and Michael Mitzenmacher in two papers: Cuckoo Filter: Better Than Bloom and Cuckoo Filter: Practically Better Than Bloom. There is also an implementation in C++ that can be referred to.

The Cuckoo filter is basically a dense hash table that can support high load factors (up to 95%) without degraded performance. Instead of storing objects, we will store a hashed fingerprint.

Buckets

First, we need to create a number of buckets. Each bucket will hold 4 fingerprints. Load factors over 96% will cause us to grow our capacity to the next-power-of-2.

! The number of fingerprints to store in each bucket
CONSTANT: bucket-size 4

! The maximum load factor we allow before growing the capacity
CONSTANT: max-load-factor 0.96

: #buckets ( capacity -- #buckets )
    [ bucket-size /i next-power-of-2 ] keep
    over / bucket-size / max-load-factor > [ 2 * ] when ;

Making our buckets is then just an array of arrays:

: <cuckoo-buckets> ( capacity -- buckets )
    #buckets [ bucket-size f <array> ] replicate ;

Given a fingerprint, we can check if it is in a bucket by calling member?:

: bucket-lookup ( fingerprint bucket -- ? )
    member? ;

To insert a fingerprint into the bucket, we find the first empty slot and replace it with the fingerprint. We return a boolean value indicating if we were able to insert it or not:

: bucket-insert ( fingerprint bucket -- ? )
    dup [ not ] find drop [ swap set-nth t ] [ 2drop f ] if* ;

To delete a fingerprint, we find its index (if present) and set it to false.

: bucket-delete ( fingerprint bucket -- ? )
    [ f ] 2dip [ index ] keep over [ set-nth t ] [ 3drop f ] if ;

If the bucket is full, we need to be able to swap a fingerprint into the bucket, replacing/removing an existing one:

: bucket-swap ( fingerprint bucket -- fingerprint' )
    [ length random ] keep [ swap ] change-nth ;

Hashing

Our hashing strategy will be to generate the SHA-1 hash value for a given byte-array, splitting it into two 32-bit values (a 32-bit fingerprint, and a 32-bit index value). We will also generate an alternate index value as well using a constant from the MurmurHash to mix with the primary index:

: hash-index ( hash -- fingerprint index )
    4 over <displaced-alien> [ uint deref ] bi@ ;

: alt-index ( fingerprint index -- alt-index )
    [ 0x5bd1e995 w* ] [ bitxor ] bi* ;

: hash-indices ( bytes -- fingerprint index alt-index )
    sha1 checksum-bytes hash-index 2dup alt-index ;

Insert/Lookup/Delete

Our Cuckoo filter holds our buckets:

TUPLE: cuckoo-filter buckets ;

: <cuckoo-filter> ( capacity -- cuckoo-filter )
    <cuckoo-buckets> cuckoo-filter boa ;

To insert an item into the Cuckoo filter, we calculate its hash-indices and then try inserting it into the bucket specified by the first index, then the bucket specified by the second index. If those buckets are full, we go through a “kickdown” process to move fingerprints from other buckets until we find a bucket that has space, or exceed the maximum number of attempts:

! The maximum number of times we kick down items/displace from
! their buckets
CONSTANT: max-cuckoo-count 500

:: cuckoo-insert ( bytes cuckoo-filter -- ? )
    bytes hash-indices :> ( fp! i1 i2 )
    cuckoo-filter buckets>> :> buckets
    buckets length :> n
    {
        [ fp i1 n mod buckets nth bucket-insert ]
        [ fp i2 n mod buckets nth bucket-insert ]
    } 0|| [
        t
    ] [
        2 random zero? i1 i2 ? :> i!
        max-cuckoo-count [
            drop
            fp i n mod buckets nth bucket-swap fp!
            fp i alt-index i!

            fp i n mod buckets nth bucket-insert
        ] find-integer >boolean
    ] if ;

To lookup an item, we calculate the hash-indices and then check the two buckets to see if the fingerprint can be found.

:: cuckoo-lookup ( bytes cuckoo-filter -- ? )
    bytes hash-indices :> ( fp i1 i2 )
    cuckoo-filter buckets>> :> buckets
    buckets length :> n
    {
        [ fp i1 n mod buckets nth bucket-lookup ]
        [ fp i2 n mod buckets nth bucket-lookup ]
    } 0|| ;

To delete an item, we calculate the hash-indices and then try and remove it from the first index, or the second index if not found in the first bucket.

:: cuckoo-delete ( bytes cuckoo-filter -- ? )
    bytes hash-indices :> ( fp i1 i2 )
    cuckoo-filter buckets>> :> buckets
    buckets length :> n
    {
        [ fp i1 n mod buckets nth bucket-delete ]
        [ fp i2 n mod buckets nth bucket-delete ]
    } 0|| ;

This is available in the cuckoo-filters vocabulary along with some tests, documentation, and a few extra features including the ability to change the checksum being used.