Text-or-Binary?
Wednesday, August 4, 2010
Sometimes it is useful to be able to tell if a file should be treated as a stream of text or binary characters. Rather than use the file extension (which might be missing or wrong), Subversion has a simple heuristic based on the file contents:
Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary.
Someone implemented this in a library written in Clojure. Here’s my take, but in Factor.
Some vocabularies we will use, and a namespace:
USING: io io.encodings.binary io.files kernel math sequences ;
IN: text-or-binary
Checking if any of the bytes are zero:
: includes-zeros? ( seq -- ? )
0 swap member? ;
The first 32 characters (e.g., 0-31) of ASCII are reserved for non-printing control characters. Checking that a majority (over 85%) of characters are printable (and assuming an empty sequence is printable):
: majority-printable? ( seq -- ? )
[ t ] [
[ [ 31 > ] count ] [ length ] bi / 0.85
] if-empty ;
Then, determining a sequence of bytes is text:
: text? ( seq -- ? )
[ includes-zeros? not ] [ majority-printable? ] bi and ;
And implementing the operation to check if a file is text or binary:
: text-file? ( path -- ? )
binary [ 1024 read text? ] with-file-reader ;
Using it is pretty easy:
IN: scratchpad "/usr/share/dict/words" text-file? .
t
IN: scratchpad "/bin/sh" text-file? .
f
The code for this (and some tests) is available on my GitHub.