Wednesday, August 4, 2010
Sometimes it is useful to be able to tell if a file should be treated as a stream of text or binary characters. Rather than use the file extension (which might be missing or wrong), Subversion has a simple heuristic based on the file contents:
Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary.
Some vocabularies we will use, and a namespace:
USING: io io.encodings.binary io.files kernel math sequences ; IN: text-or-binary
Checking if any of the bytes are zero:
: includes-zeros? ( seq -- ? ) 0 swap member? ;
The first 32 characters (e.g., 0-31) of ASCII are reserved for non-printing control characters. Checking that a majority (over 85%) of characters are printable (and assuming an empty sequence is printable):
: majority-printable? ( seq -- ? ) [ t ] [ [ [ 31 > ] count ] [ length ] bi / 0.85 ] if-empty ;
Then, determining a sequence of bytes is text:
: text? ( seq -- ? ) [ includes-zeros? not ] [ majority-printable? ] bi and ;
And implementing the operation to check if a file is text or binary:
: text-file? ( path -- ? ) binary [ 1024 read text? ] with-file-reader ;
Using it is pretty easy:
IN: scratchpad "/usr/share/dict/words" text-file? . t IN: scratchpad "/bin/sh" text-file? . f
The code for this (and some tests) is available on my GitHub.