Re: Factor

Factor: the language, the theory, and the practice.

LEB128

Sunday, September 3, 2023

#math

LEB128 – Little Endian Base 128 – is a variable-length encoding format designed to store arbitrarily large integers in a small number of bytes. There are two variations: unsigned LEB128 and signed LEB128. These vary slightly, so a user program that wants to decode LEB128 values would explictly choose the appropriate unsigned or signed methods.

Recently, I implemented these in Factor and wanted to describe the implementation:

Decoding

For decoding unsigned LEB128 values, we accumulate the integer 7 bits at a time:

: uleb128> ( byte-array -- n )
    0 [ [ 7 bits ] [ 7 * shift ] bi* + ] reduce-index ;

For decoding signed LEB128 values, we do the same thing, but the last byte indicates if the value was negative, and if it was we bring the sign bit back in:

: leb128> ( byte-array -- n )
    [ uleb128> ] keep dup last 6 bit?
    [ length 7 * 2^ neg bitor ] [ drop ] if ;

Encoding

For encoding unsigned LEB128 values, we output in 7-bit segments, and then for the final segment use the eighth bit to indicate the end of the stream of bytes.

:: >uleb128 ( n -- byte-array )
    BV{ } clone :> accum
    n assert-non-negative [
        [ -7 shift dup zero? not ] [ 7 bits ] bi
        over [ 0x80 bitor ] when accum push
    ] loop drop accum B{ } like ;

For encoding signed LEB128 values, we output in 7-bit segments, but our exit condition depends on reaching the end of the integer and conditionally whether the sixth bit was set in that segment.

:: >leb128 ( n -- byte-array )
    BV{ } clone :> accum
    n [
        [ -7 shift dup ] [ 7 bits ] bi :> ( i b )
        {
            { [ i  0 = ] [ b 6 bit? not ] }
            { [ i -1 = ] [ b 6 bit? ] }
            [ f ]
        } cond b over [ 0x80 bitor ] when accum push
    ] loop drop accum B{ } like ;

Testing

Some test cases for unsigned LEB128:

[ -1 >uleb128 ] [ non-negative-number-expected? ] must-fail-with
{ B{ 255 255 127 } } [ 0x1fffff >uleb128 ] unit-test
{ 0x1fffff } [ B{ 255 255 127 } uleb128> ] unit-test
{ B{ 0xe5 0x8e 0x26 } } [ 624485 >uleb128 ] unit-test
{ 624485 } [ B{ 0xe5 0x8e 0x26 } uleb128> ] unit-test

Some test cases for signed LEB128:

{ B{ 255 255 255 0 } } [ 0x1fffff >leb128 ] unit-test
{ 0x1fffff } [ B{ 255 255 255 0 } leb128> ] unit-test
{ B{ 0xc0 0xbb 0x78 } } [ -123456 >leb128 ] unit-test
{ -123456 } [ B{ 0xc0 0xbb 0x78 } leb128> ] unit-test

Performance

This is available in a recent nightly build, with some support for reading and writing LEB128 values to streams, and some performance improvements by specifying some type information so the compiler can understand we are working with integers and byte-arrays and produce more optimal code.

My current laptop decodes and encodes around 40 million per second, which is pretty good for a dynamic language. There are some references in the fast decoding section to some papers that present some SIMD techniques called “Masked VByte” and “Stream VByte” for significantly improving upon this simple scalar implementation.