Re: Factor

Factor: the language, the theory, and the practice.

Fast "shuffle"

Tuesday, February 26, 2013

#ffi #performance #random

Randomly shuffling the elements in an array is a pretty common task. The standard algorithm for this is called the Fisher-Yates shuffle. The modern version of it looks like this in pseudo-code:

To shuffle an array a of n elements (indices 0..n-1):
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

We’re going to implement this in Factor, and then look at improving the performance when shuffling an array of 10,000,000 numbers (generated with “10,000,000 <iota> >array”). Afterwards, we are going to look at Ruby and Python versions.

Factor

Version 1:

Our first implementation is a straight-forward translation of the algorithm (a simplification of the randomize word included in the Factor standard library), and looks something like this:

: shuffle1 ( seq -- seq )
    dup length [ dup 1 > ] [
       [ random ] [ 1 - ] bi
       [ pick exchange ] keep
    ] while drop ;

On our test case, this takes 2.972 seconds.

Version 2:

Our second implementation uses exchange-unsafe, which does not perform bounds checks (unnecessary given our loop constraints).

: shuffle2 ( seq -- seq )
    dup length [ dup 1 > ] [
       [ random ] [ 1 - ] bi
       [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.830 seconds (a 5% improvement on version 1!).

Version 3:

Our third implementation uses the typed vocabulary to provide type information to the compiler to produce more optimal code:

TYPED: shuffle3 ( seq: array -- seq )
    dup length [ dup 1 > ] [
       [ random ] [ 1 - ] bi
       [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.554 seconds (a 15% improvement on version 1!).

Version 4:

Our fourth implementation instead removes the generic dispatch of calling random as well as the dynamic variable lookup for the current random number generator:

: shuffle4 ( seq -- seq )
    dup length [ dup 1 > ]
    random-generator get '[
       [ _ (random-integer) ] [ 1 - ] bi
       [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.408 seconds (a 19% improvement on version 1!).

Version 5:

Our fifth implementation combines the optimizations in version 3 and 4:

TYPED: shuffle5 ( seq: array -- seq )
    dup length [ dup 1 > ]
    random-generator get '[
        [ _ (random-integer) ] [ 1 - ] bi
        [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.254 seconds (a 24% improvement on version 1!).

Version 6:

Our sixth implementation declares that the indices to exchange-unsafe are fixnums (removing a minor check that isn’t optimized out by the compiler for some reason):

TYPED: shuffle6 ( seq: array -- seq )
    dup length [ dup 1 > ]
    random-generator get '[
        [ _ (random-integer) ] [ 1 - ] bi
        { fixnum fixnum } declare
        [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.187 seconds (a 27% improvement on version 1!).

Version 7:

Our seventh implementation is just version 6 with a quick-and-dirty random number generator using rand():

FUNCTION: int rand ( )

SINGLETON: c-random
M: c-random random-32* drop rand ;

: with-c-random ( quot -- )
    [ c-random ] dip with-random ; inline

On our test case, this takes 2.015 seconds (a 32% improvement on version 1!).

Version 8:

Our eighth (and final!) version drops down to C, to implement a primitive version of shuffle:

void factor_vm::primitive_shuffle_array()
{
    array* a = untag_check<array>(ctx->peek());
    cell capacity = array_capacity(a);
    for (cell i = capacity - 1; i > 0; i--) {
        cell j = i + rand() / (RAND_MAX / (capacity - i) + 1);
        cell tmp = array_nth(a, i);
        set_array_nth(a, i, array_nth(a, j));
        set_array_nth(a, j, tmp);
    }
}

On our test case, this takes 0.494 seconds (a 83% improvement on version 1!).

Python

I wanted to see how various Python versions performed, so I wrote a simple version that takes 19.025 seconds on Python 2.7.3, 12.436 seconds on Jython 2.5.3, and 1.392 seconds with PyPy 1.9.0:

from random import randrange
import time

n = 10000000
l = list(xrange(n))
t0 = time.time()
i = n
while i > 1:
    j = randrange(i)
    i -= 1
    l[i], l[j] = l[j], l[i]
print 'took %.3f seconds' % (time.time() - t0)

A version using Numpy takes 3.273 seconds:

import numpy as np
import time

a = np.arange(10000000)
t0 = time.time()
np.random.shuffle(a)
print 'took %.3f seconds' % (time.time() - t0)

Ruby

Curious what Ruby would look like, I tested two versions. The first implements shuffle in “pure” Ruby, taking 149.975 seconds on Ruby 1.8.7, 25.086 seconds in Ruby 1.9.3, 8.054 seconds on MacRuby 0.12, 7.258 seconds on JRuby 1.7.3, and 4.682 seconds on Rubinius 1.2.4:

def shuffle!(a)
    size = a.length
    size.times do |i|
        r = i + Kernel.rand(size - i)
        a[i], a[r] = a[r], a[i]
    end
end

a = (1..10000000).to_a
t0 = Time.new
shuffle!(a)
t1 = Time.new
printf "took %.3f seconds", t1-t0

But since Ruby 1.8.7, Array.shuffle! is a builtin (implemented in C), so lets look at how performance differs. It takes 0.443 seconds in Ruby 1.8.7, 0.554 seconds in Ruby 1.9.3, 0.884 seconds in MacRuby 0.12, 2.170 seconds in JRuby 1.7.3, and 3.479 seconds in Rubinius 1.2.4:

a = (1..10000000).to_a
t0 = Time.new
a = a.shuffle!
t1 = Time.new
printf "took %.3f seconds", t1-t0

Conclusion

C is awesome. PyPy is really fast. Factor can be pretty fast. Numpy should be faster. Jython could be faster. Python isn’t very fast at all. Ruby is both blazing fast and slower than snails!