Thursday, November 10, 2011
wc program is a utility that can count the number of lines in a
file. It has a number of options that are described on its man
page that can change its function to
count characters or word or produce the length of the longest line.
For the last month, Joe Groff has been improving the performance of Factor’s I/O libraries. Yesterday, we were investigating slow performance when doing lots of small reads from a file. Joe was able to make a number of nice speedups, which will be in the next release of Factor.
After suggesting that we use
wc -l as a benchmark to aspire to, we
came up with several approaches with varying performance. I want to
demonstrate these, using timing information from running it on my
computer. Although the Factor image file is binary data, we are going to
count the number of newline characters in it.
Note: some of these require the latest development version of Factor to run.
Our “gold standard” will be
wc -l, which takes just over 0.1 seconds:
$ time wc -l factor.image 6149212 factor.image real 0m0.111s user 0m0.090s sys 0m0.020s
Our first attempt in Factor is the shortest amount of code but takes 5.8
seconds (you can time this by running
USE: system [ image wc-file-lines ] time”):
: wc-file-lines ( path -- n ) binary file-lines length ;
We don’t really need the lines, just their count, so perhaps just
each-line in a loop. This is an improvement at just over 3
: wc-each-line ( path -- n ) binary [ 0 [ drop 1 + ] each-line ] with-file-reader ;
Trying to use
read-until to look for the next newline, is a bit slower
at 3.4 seconds:
: wc-read-until ( path -- n ) binary [ 0 [ "\n" read-until [ drop 1 + ] dip ] loop ] with-file-reader ;
Instead of reading each line at a time, we can just read 65,536 byte blocks and count the number of newlines. This takes about 1.5 seconds:
: wc-each-block ( path -- n ) binary [ 0 [ [ CHAR: \n = ] count + ] each-block ] with-file-reader ;
Since we are only counting characters in each block, we don’t need to
allocate and copy the bytes out of the I/O buffer. Instead, we can look
slice. This takes about 1 second:
: wc-each-block-slice ( path -- n ) binary [ 0 [ [ CHAR: \n = ] count + ] each-block-slice ] with-file-reader ;
The stream functions (such as
read) operate on an
dynamic variable, which introduces some overhead. If we remove that, the
compiler can eliminate some of the dynamic dispatches. taking 0.320
: wc-fast ( path -- n ) binary <file-reader> [ 0 swap [ [ CHAR: \n = ] count + ] each-stream-block-slice ] with-disposal ;
If we make an assumption that the number of lines in a file will fit
fixnum, then we can get a bit faster at 0.240 seconds:
: wc-faster ( path -- n ) binary <file-reader> [ 0 swap [ [ CHAR: \n = ] count + >fixnum ] each-stream-block-slice ] with-disposal ;
And, if we cheat and just run the
wc -l process directly, we can get
to 0.210 seconds:
: wc-system ( path -- n ) "wc -l " prepend utf8 [ readln " " split harvest first string>number ] with-process-reader ;
Overall, not too bad! Factor is getting within shouting distance of programs written in “faster” languages. Probably with a few more hours, we could close the gap, but thats enough for today.
Update: using SIMD, Joe was able to get the time down to 0.120 seconds!
: aligned-slices ( seq -- head tail ) dup length 15 bitnot bitand cut-slice ; inline : wc-simd ( path -- n ) [ 0 swap binary <file-reader> &dispose [ aligned-slices [ uchar-16 cast-array swap [ 10 uchar-16-with v= vcount + >fixnum ] reduce ] [ [ 10 = ] count + >fixnum ] bi* ] each-stream-block-slice ] with-destructors ;