Re: Factor

Factor: the language, the theory, and the practice.

wc -l

Thursday, November 10, 2011

#command-line #simd

The wc program is a utility that can count the number of lines in a file. It has a number of options that are described on its man page that can change its function to count characters or word or produce the length of the longest line.

For the last month, Joe Groff has been improving the performance of Factor’s I/O libraries. Yesterday, we were investigating slow performance when doing lots of small reads from a file. Joe was able to make a number of nice speedups, which will be in the next release of Factor.

After suggesting that we use wc -l as a benchmark to aspire to, we came up with several approaches with varying performance. I want to demonstrate these, using timing information from running it on my computer. Although the Factor image file is binary data, we are going to count the number of newline characters in it.

Note: some of these require the latest development version of Factor to run.

Our “gold standard” will be wc -l, which takes just over 0.1 seconds:

$ time wc -l factor.image
 6149212 factor.image

real 0m0.111s
user 0m0.090s
sys  0m0.020s

Our first attempt in Factor is the shortest amount of code but takes 5.8 seconds (you can time this by running “USE: system [ image wc-file-lines ] time”):

: wc-file-lines ( path -- n )
    binary file-lines length ;

We don’t really need the lines, just their count, so perhaps just increment each-line in a loop. This is an improvement at just over 3 seconds:

: wc-each-line ( path -- n )
    binary [ 0 [ drop 1 + ] each-line ] with-file-reader ;

Trying to use read-until to look for the next newline, is a bit slower at 3.4 seconds:

: wc-read-until ( path -- n )
    binary [
        0 [ "\n" read-until [ drop 1 + ] dip ] loop
    ] with-file-reader ;

Instead of reading each line at a time, we can just read 65,536 byte blocks and count the number of newlines. This takes about 1.5 seconds:

: wc-each-block ( path -- n )
    binary [
        0 [ [ CHAR: \n = ] count + ] each-block
    ] with-file-reader ;

Since we are only counting characters in each block, we don’t need to allocate and copy the bytes out of the I/O buffer. Instead, we can look at a slice. This takes about 1 second:

: wc-each-block-slice ( path -- n )
    binary [
        0 [ [ CHAR: \n = ] count + ] each-block-slice
    ] with-file-reader ;

The stream functions (such as read) operate on an input-stream dynamic variable, which introduces some overhead. If we remove that, the compiler can eliminate some of the dynamic dispatches. taking 0.320 seconds:

: wc-fast ( path -- n )
    binary <file-reader> [
        0 swap [
            [ CHAR: \n = ] count +
        ] each-stream-block-slice
    ] with-disposal ;

If we make an assumption that the number of lines in a file will fit into a fixnum, then we can get a bit faster at 0.240 seconds:

: wc-faster ( path -- n )
    binary <file-reader> [
        0 swap [
            [ CHAR: \n = ] count + >fixnum
        ] each-stream-block-slice
    ] with-disposal ;

And, if we cheat and just run the wc -l process directly, we can get to 0.210 seconds:

: wc-system ( path -- n )
    "wc -l " prepend utf8 [
        readln " " split harvest first string>number
    ] with-process-reader ;

Overall, not too bad! Factor is getting within shouting distance of programs written in “faster” languages. Probably with a few more hours, we could close the gap, but thats enough for today.

Update: using SIMD, Joe was able to get the time down to 0.120 seconds!

: aligned-slices ( seq -- head tail )
   dup length 15 bitnot bitand cut-slice ; inline

: wc-simd ( path -- n )
   [
       0 swap binary <file-reader> &dispose [
           aligned-slices [
               uchar-16 cast-array swap
               [ 10 uchar-16-with v= vcount + >fixnum ] reduce
           ] [ [ 10 = ] count + >fixnum ] bi*
       ] each-stream-block-slice
   ] with-destructors ;