## Fast "shuffle"

Tuesday, February 26, 2013

Randomly shuffling the elements in an array is a pretty common task. The standard algorithm for this is called the Fisher-Yates shuffle. The modern version of it looks like this in pseudo-code:

```
To shuffle an array a of n elements (indices 0..n-1):
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
```

We’re going to implement this in Factor, and
then look at improving the performance when shuffling an array of
10,000,000 numbers (generated with “`10,000,000 <iota> >array`

”).
Afterwards, we are going to look at Ruby and Python versions.

### Factor

**Version 1:**

Our first implementation is a straight-forward translation of the algorithm (a simplification of the randomize word included in the Factor standard library), and looks something like this:

```
: shuffle1 ( seq -- seq )
dup length [ dup 1 > ] [
[ random ] [ 1 - ] bi
[ pick exchange ] keep
] while drop ;
```

On our test case, this takes **2.972** seconds.

**Version 2:**

Our second implementation uses exchange-unsafe, which does not perform bounds checks (unnecessary given our loop constraints).

```
: shuffle2 ( seq -- seq )
dup length [ dup 1 > ] [
[ random ] [ 1 - ] bi
[ pick exchange-unsafe ] keep
] while drop ;
```

On our test case, this takes **2.830** seconds (a 5% improvement on
version 1!).

**Version 3:**

Our third implementation uses the typed vocabulary to provide type information to the compiler to produce more optimal code:

```
TYPED: shuffle3 ( seq: array -- seq )
dup length [ dup 1 > ] [
[ random ] [ 1 - ] bi
[ pick exchange-unsafe ] keep
] while drop ;
```

On our test case, this takes **2.554** seconds (a 15% improvement on
version 1!).

**Version 4:**

Our fourth implementation instead removes the generic dispatch of calling random as well as the dynamic variable lookup for the current random number generator:

```
: shuffle4 ( seq -- seq )
dup length [ dup 1 > ]
random-generator get '[
[ _ (random-integer) ] [ 1 - ] bi
[ pick exchange-unsafe ] keep
] while drop ;
```

On our test case, this takes **2.408** seconds (a 19% improvement on
version 1!).

**Version 5:**

Our fifth implementation combines the optimizations in version 3 and 4:

```
TYPED: shuffle5 ( seq: array -- seq )
dup length [ dup 1 > ]
random-generator get '[
[ _ (random-integer) ] [ 1 - ] bi
[ pick exchange-unsafe ] keep
] while drop ;
```

On our test case, this takes **2.254** seconds (a 24% improvement on
version 1!).

**Version 6:**

Our sixth implementation declares that the indices to exchange-unsafe are fixnums (removing a minor check that isn’t optimized out by the compiler for some reason):

```
TYPED: shuffle6 ( seq: array -- seq )
dup length [ dup 1 > ]
random-generator get '[
[ _ (random-integer) ] [ 1 - ] bi
{ fixnum fixnum } declare
[ pick exchange-unsafe ] keep
] while drop ;
```

On our test case, this takes **2.187** seconds (a 27% improvement on
version 1!).

**Version 7:**

Our seventh implementation is just version 6 with a quick-and-dirty random number generator using rand():

```
FUNCTION: int rand ( )
SINGLETON: c-random
M: c-random random-32* drop rand ;
: with-c-random ( quot -- )
[ c-random ] dip with-random ; inline
```

On our test case, this takes **2.015** seconds (a 32% improvement on
version 1!).

**Version 8:**

Our eighth (and final!) version drops down to C, to implement a primitive version of shuffle:

```
void factor_vm::primitive_shuffle_array()
{
array* a = untag_check<array>(ctx->peek());
cell capacity = array_capacity(a);
for (cell i = capacity - 1; i > 0; i--) {
cell j = i + rand() / (RAND_MAX / (capacity - i) + 1);
cell tmp = array_nth(a, i);
set_array_nth(a, i, array_nth(a, j));
set_array_nth(a, j, tmp);
}
}
```

On our test case, this takes **0.494** seconds (a 83% improvement on
version 1!).

### Python

I wanted to see how various Python versions
performed, so I wrote a simple version that takes **19.025** seconds on
Python 2.7.3, **12.436** seconds on Jython 2.5.3, and **1.392** seconds
with PyPy 1.9.0:

```
from random import randrange
import time
n = 10000000
l = list(xrange(n))
t0 = time.time()
i = n
while i > 1:
j = randrange(i)
i -= 1
l[i], l[j] = l[j], l[i]
print 'took %.3f seconds' % (time.time() - t0)
```

A version using Numpy takes **3.273** seconds:

```
import numpy as np
import time
a = np.arange(10000000)
t0 = time.time()
np.random.shuffle(a)
print 'took %.3f seconds' % (time.time() - t0)
```

### Ruby

Curious what Ruby would look like, I tested two versions. The first
implements shuffle in “pure” Ruby, taking **149.975** seconds on Ruby
1.8.7, **25.086** seconds in Ruby 1.9.3, **8.054** seconds on MacRuby
0.12, **7.258** seconds on JRuby 1.7.3, and **4.682** seconds on
Rubinius 1.2.4:

```
def shuffle!(a)
size = a.length
size.times do |i|
r = i + Kernel.rand(size - i)
a[i], a[r] = a[r], a[i]
end
end
a = (1..10000000).to_a
t0 = Time.new
shuffle!(a)
t1 = Time.new
printf "took %.3f seconds", t1-t0
```

But since Ruby 1.8.7,
Array.shuffle!
is a builtin (implemented in
C), so lets look
at how performance differs. It takes **0.443** seconds in Ruby 1.8.7,
**0.554** seconds in Ruby 1.9.3, **0.884** seconds in MacRuby 0.12,
**2.170** seconds in JRuby 1.7.3, and **3.479** seconds in Rubinius
1.2.4:

```
a = (1..10000000).to_a
t0 = Time.new
a = a.shuffle!
t1 = Time.new
printf "took %.3f seconds", t1-t0
```

### Conclusion

C is awesome. PyPy is really fast. Factor can be pretty fast. Numpy should be faster. Jython could be faster. Python isn’t very fast at all. Ruby is both blazing fast and slower than snails!