Re: Factor

Factor: the language, the theory, and the practice.

Text-or-Binary?

Wednesday, August 4, 2010

#text

Sometimes it is useful to be able to tell if a file should be treated as a stream of text or binary characters. Rather than use the file extension (which might be missing or wrong), Subversion has a simple heuristic based on the file contents:

Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary.

Someone implemented this in a library written in Clojure. Here’s my take, but in Factor.

Some vocabularies we will use, and a namespace:

USING: io io.encodings.binary io.files kernel math sequences ;

IN: text-or-binary

Checking if any of the bytes are zero:

: includes-zeros? ( seq -- ? )
    0 swap member? ;

The first 32 characters (e.g., 0-31) of ASCII are reserved for non-printing control characters. Checking that a majority (over 85%) of characters are printable (and assuming an empty sequence is printable):

: majority-printable? ( seq -- ? )
    [ t ] [ 
        [ [ 31 > ] count ] [ length ] bi / 0.85 
    ] if-empty ;

Then, determining a sequence of bytes is text:

: text? ( seq -- ? )
    [ includes-zeros? not ] [ majority-printable? ] bi and ;

And implementing the operation to check if a file is text or binary:

: text-file? ( path -- ? )
    binary [ 1024 read text? ] with-file-reader ;

Using it is pretty easy:

IN: scratchpad "/usr/share/dict/words" text-file? .
t

IN: scratchpad "/bin/sh" text-file? .
f

The code for this (and some tests) is available on my GitHub.