Pickle
Monday, August 18, 2025
Pretty much everything pickle is
great: sweet, dill, bread and butter, full sour, half sour, gherkins, achar,
even pickleball. In addition to being both yummy and fun and a great
Tuesday night on the
Playa,
pickle
is also the name for Python object
serialization.
There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.
- Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
- Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
- Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
- Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7.
- Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.
- Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.
While recently learning about how the pickle protocol works, I was able to build a basic unpickler in Factor. The implementation is about 300 lines of code, and has some decent tests. There are a few more features we should add for completeness, but it’s a good start!
I thought I’d go over a few parts of the implementation here.
The pickle protocol is stack-based, which we represent by a growable vector, and uses a memoization feature to refer to objects by integer keys when they repeat in the data stream, which we store in a hashtable:
CONSTANT: stack V{ }
CONSTANT: memo H{ }
ERROR: invalid-memo key ;
: get-memo ( i -- )
memo ?at [ stack push ] [ invalid-memo ] if ;
: put-memo ( i -- )
[ stack last ] dip memo set-at ;
It also has the concept of markers which can be placed using the
+marker+
symbol and then used, for example, to pop all items on the
stack until the last marker was seen:
SYMBOL: +marker+
: pop-from-marker ( -- items )
+marker+ stack last-index
[ 1 + stack swap tail ] [ stack shorten ] bi ;
Unpickling starts with a dispatch loop that acts on each supported opcode.
We can use a +no-return+
symbol to indicate that we are not ready to return
an object until the STOP
opcode is seen.
ERROR: invalid-opcode opcode ;
SYMBOL: +no-return+
: unpickle-dispatch ( opcode -- value )
+no-return+ swap {
! Protocol 0 and 1
{ CHAR: ( [ load-mark ] }
{ CHAR: . [ drop stack pop ] }
{ CHAR: 0 [ load-pop ] }
{ CHAR: 1 [ load-pop-mark ] }
{ CHAR: 2 [ load-dup ] }
{ CHAR: F [ load-float ] }
{ CHAR: I [ load-int ] }
{ CHAR: J [ load-binint ] }
{ CHAR: K [ load-binint1 ] }
{ CHAR: L [ load-long ] }
{ CHAR: M [ load-binint2 ] }
{ CHAR: N [ load-none ] }
{ CHAR: P [ load-persid ] }
{ CHAR: Q [ load-binpersid ] }
{ CHAR: R [ load-reduce ] }
{ CHAR: S [ load-string ] }
{ CHAR: T [ load-binstring ] }
{ CHAR: U [ load-short-binstring ] }
{ CHAR: V [ load-unicode ] }
{ CHAR: X [ load-binunicode ] }
{ CHAR: a [ load-append ] }
{ CHAR: b [ load-build ] }
{ CHAR: c [ load-global ] }
{ CHAR: d [ load-dict ] }
{ CHAR: } [ load-empty-dict ] }
{ CHAR: e [ load-appends ] }
{ CHAR: g [ load-get ] }
{ CHAR: h [ load-binget ] }
{ CHAR: i [ load-inst ] }
{ CHAR: j [ load-long-binget ] }
{ CHAR: l [ load-list ] }
{ CHAR: ] [ load-empty-list ] }
{ CHAR: o [ load-obj ] }
{ CHAR: p [ load-put ] }
{ CHAR: q [ load-binput ] }
{ CHAR: r [ load-long-binput ] }
{ CHAR: s [ load-setitem ] }
{ CHAR: t [ load-tuple ] }
{ CHAR: ) [ load-empty-tuple ] }
{ CHAR: u [ load-setitems ] }
{ CHAR: G [ load-binfloat ] }
! Protocol 2
{ 0x80 [ load-proto ] }
{ 0x81 [ load-newobj ] }
{ 0x82 [ load-ext1 ] }
{ 0x83 [ load-ext2 ] }
{ 0x84 [ load-ext4 ] }
{ 0x85 [ load-tuple1 ] }
{ 0x86 [ load-tuple2 ] }
{ 0x87 [ load-tuple3 ] }
{ 0x88 [ load-true ] }
{ 0x89 [ load-false ] }
{ 0x8a [ load-long1 ] }
{ 0x8b [ load-long4 ] }
! Protocol 3 (Python 3.x)
{ CHAR: B [ load-binbytes ] }
{ CHAR: C [ load-short-binbytes ] }
! Protocol 4 (Python 3.4-3.7)
{ 0x8c [ load-short-binunicode ] }
{ 0x8d [ load-binunicode8 ] }
{ 0x8e [ load-binbytes8 ] }
{ 0x8f [ load-empty-set ] }
{ 0x90 [ load-additems ] }
{ 0x91 [ load-frozenset ] }
{ 0x92 [ load-newobj-ex ] }
{ 0x93 [ load-stack-global ] }
{ 0x94 [ load-memoize ] }
{ 0x95 [ load-frame ] }
! Protocol 5 (Python 3.8+)
{ 0x96 [ load-bytearray8 ] }
{ 0x97 [ load-readonly-buffer ] }
{ 0x98 [ load-next-buffer ] }
[ invalid-opcode ]
} case ;
With that, we can build our unpickle
word that acts on an
input-stream,
first clearing state and then looping until we see an object to return:
: unpickle ( -- obj )
stack delete-all memo clear-assoc
f [ drop read1 unpickle-dispatch dup +no-return+ = ] loop ;
For convenience, a pickle>
word acts on concrete data:
GENERIC: pickle> ( string -- obj )
M: string pickle> [ unpickle ] with-string-reader ;
M: byte-array pickle> binary [ unpickle ] with-byte-reader ;
In addition, we needed to support Python’s string escapes which are slightly
different than the ones that Factor defines – mainly \u####
and
\U########
, and then add support for some of the basic class types that
we might encounter such as byte-arrays, decimals, timestamps, etc.
We currently do not support: persistent id’s, readonly vs read/write buffers, out-of-band buffers, the object build opcode, and the extension registry. And of course, this is just unpickling, we do not yet support pickling of Factor objects, although that shouldn’t be too hard to add.
But, despite that, it works pretty well!
Here’s an example where we store some mixed data in a pickles
file with
Python:
>>> data = ["abc", 123, 4.56, {"a":1}, {17,37,52}]
>>> import pickle
>>> with open('pickles', 'wb') as f:
... pickle.dump(data, f)
...
And then look at and then load that pickles
file with Factor!
IN: scratchpad USE: tools.hexdump
IN: scratchpad "pickles" hexdump-file
00000000 80 04 95 29 00 00 00 00 00 00 00 5d 94 28 8c 03 ...).......].(..
00000010 61 62 63 94 4b 7b 47 40 12 3d 70 a3 d7 0a 3d 7d abc.K{G@.=p...=}
00000020 94 8c 01 61 94 4b 01 73 8f 94 28 4b 11 4b 34 4b ...a.K.s..(K.K4K
00000030 25 90 65 2e %.e.
00000034
IN: scratchpad USE: pickle
IN: scratchpad "pickles" binary file-contents pickle> .
V{ "abc" 123 4.56 H{ { "a" 1 } } HS{ 17 52 37 } }
This is available in the latest development version.