Re: Factor

Factor: the language, the theory, and the practice.

Pickle

Monday, August 18, 2025

#data

Pretty much everything pickle is great: sweet, dill, bread and butter, full sour, half sour, gherkins, achar, even pickleball. In addition to being both yummy and fun and a great Tuesday night on the Playa, pickle is also the name for Python object serialization.

There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

  • Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
  • Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
  • Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
  • Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7.
  • Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.
  • Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.

While recently learning about how the pickle protocol works, I was able to build a basic unpickler in Factor. The implementation is about 300 lines of code, and has some decent tests. There are a few more features we should add for completeness, but it’s a good start!

I thought I’d go over a few parts of the implementation here.

The pickle protocol is stack-based, which we represent by a growable vector, and uses a memoization feature to refer to objects by integer keys when they repeat in the data stream, which we store in a hashtable:

CONSTANT: stack V{ }

CONSTANT: memo H{ }

ERROR: invalid-memo key ;

: get-memo ( i -- )
    memo ?at [ stack push ] [ invalid-memo ] if ;

: put-memo ( i -- )
    [ stack last ] dip memo set-at ;

It also has the concept of markers which can be placed using the +marker+ symbol and then used, for example, to pop all items on the stack until the last marker was seen:

SYMBOL: +marker+

: pop-from-marker ( -- items )
    +marker+ stack last-index
    [ 1 + stack swap tail ] [ stack shorten ] bi ;

Unpickling starts with a dispatch loop that acts on each supported opcode. We can use a +no-return+ symbol to indicate that we are not ready to return an object until the STOP opcode is seen.

ERROR: invalid-opcode opcode ;

SYMBOL: +no-return+

: unpickle-dispatch ( opcode -- value )
    +no-return+ swap {
        ! Protocol 0 and 1
        { CHAR: ( [ load-mark ] }
        { CHAR: . [ drop stack pop ] }
        { CHAR: 0 [ load-pop ] }
        { CHAR: 1 [ load-pop-mark ] }
        { CHAR: 2 [ load-dup ] }
        { CHAR: F [ load-float ] }
        { CHAR: I [ load-int ] }
        { CHAR: J [ load-binint ] }
        { CHAR: K [ load-binint1 ] }
        { CHAR: L [ load-long ] }
        { CHAR: M [ load-binint2 ] }
        { CHAR: N [ load-none ] }
        { CHAR: P [ load-persid ] }
        { CHAR: Q [ load-binpersid ] }
        { CHAR: R [ load-reduce ] }
        { CHAR: S [ load-string ] }
        { CHAR: T [ load-binstring ] }
        { CHAR: U [ load-short-binstring ] }
        { CHAR: V [ load-unicode ] }
        { CHAR: X [ load-binunicode ] }
        { CHAR: a [ load-append ] }
        { CHAR: b [ load-build ] }
        { CHAR: c [ load-global ] }
        { CHAR: d [ load-dict ] }
        { CHAR: } [ load-empty-dict ] }
        { CHAR: e [ load-appends ] }
        { CHAR: g [ load-get ] }
        { CHAR: h [ load-binget ] }
        { CHAR: i [ load-inst ] }
        { CHAR: j [ load-long-binget ] }
        { CHAR: l [ load-list ] }
        { CHAR: ] [ load-empty-list ] }
        { CHAR: o [ load-obj ] }
        { CHAR: p [ load-put ] }
        { CHAR: q [ load-binput ] }
        { CHAR: r [ load-long-binput ] }
        { CHAR: s [ load-setitem ] }
        { CHAR: t [ load-tuple ] }
        { CHAR: ) [ load-empty-tuple ] }
        { CHAR: u [ load-setitems ] }
        { CHAR: G [ load-binfloat ] }

        ! Protocol 2
        { 0x80 [ load-proto ] }
        { 0x81 [ load-newobj ] }
        { 0x82 [ load-ext1 ] }
        { 0x83 [ load-ext2 ] }
        { 0x84 [ load-ext4 ] }
        { 0x85 [ load-tuple1 ] }
        { 0x86 [ load-tuple2 ] }
        { 0x87 [ load-tuple3 ] }
        { 0x88 [ load-true ] }
        { 0x89 [ load-false ] }
        { 0x8a [ load-long1 ] }
        { 0x8b [ load-long4 ] }

        ! Protocol 3 (Python 3.x)
        { CHAR: B [ load-binbytes ] }
        { CHAR: C [ load-short-binbytes ] }

        ! Protocol 4 (Python 3.4-3.7)
        { 0x8c [ load-short-binunicode ] }
        { 0x8d [ load-binunicode8 ] }
        { 0x8e [ load-binbytes8 ] }
        { 0x8f [ load-empty-set ] }
        { 0x90 [ load-additems ] }
        { 0x91 [ load-frozenset ] }
        { 0x92 [ load-newobj-ex ] }
        { 0x93 [ load-stack-global ] }
        { 0x94 [ load-memoize ] }
        { 0x95 [ load-frame ] }

        ! Protocol 5 (Python 3.8+)
        { 0x96 [ load-bytearray8 ] }
        { 0x97 [ load-readonly-buffer ] }
        { 0x98 [ load-next-buffer ] }

        [ invalid-opcode ]
    } case ;

With that, we can build our unpickle word that acts on an input-stream, first clearing state and then looping until we see an object to return:

: unpickle ( -- obj )
    stack delete-all memo clear-assoc
    f [ drop read1 unpickle-dispatch dup +no-return+ = ] loop ;

For convenience, a pickle> word acts on concrete data:

GENERIC: pickle> ( string -- obj )

M: string pickle> [ unpickle ] with-string-reader ;

M: byte-array pickle> binary [ unpickle ] with-byte-reader ;

In addition, we needed to support Python’s string escapes which are slightly different than the ones that Factor defines – mainly \u#### and \U########, and then add support for some of the basic class types that we might encounter such as byte-arrays, decimals, timestamps, etc.

We currently do not support: persistent id’s, readonly vs read/write buffers, out-of-band buffers, the object build opcode, and the extension registry. And of course, this is just unpickling, we do not yet support pickling of Factor objects, although that shouldn’t be too hard to add.

But, despite that, it works pretty well!

Here’s an example where we store some mixed data in a pickles file with Python:

>>> data = ["abc", 123, 4.56, {"a":1}, {17,37,52}]

>>> import pickle

>>> with open('pickles', 'wb') as f:
...     pickle.dump(data, f)
...

And then look at and then load that pickles file with Factor!

IN: scratchpad USE: tools.hexdump

IN: scratchpad "pickles" hexdump-file
00000000  80 04 95 29 00 00 00 00 00 00 00 5d 94 28 8c 03  ...).......].(..
00000010  61 62 63 94 4b 7b 47 40 12 3d 70 a3 d7 0a 3d 7d  abc.K{G@.=p...=}
00000020  94 8c 01 61 94 4b 01 73 8f 94 28 4b 11 4b 34 4b  ...a.K.s..(K.K4K
00000030  25 90 65 2e                                      %.e.
00000034

IN: scratchpad USE: pickle

IN: scratchpad "pickles" binary file-contents pickle> .
V{ "abc" 123 4.56 H{ { "a" 1 } } HS{ 17 52 37 } }

This is available in the latest development version.