## Fast Factorial

Monday, November 4, 2013

You might remember that I implemented a bunch of factorial words several months ago.

Recently, I was looking at Python’s latest mathmodule.c, and noticed an implementation of a “divide-and-conquer factorial algorithm”. The algorithm is based on one described by Peter Luschny as the binary-split formula of the factorial of n. It seemed like a fun thing to implement and benchmark in Factor.

It turns out this was one of the
improvements made during the Python
3.2 development cycle. Calculating the 50,000th factorial takes **0.724
seconds** with Python 2.7.5 and only **0.064 seconds** with Python
3.3.2 - a nice improvement!

### slow-factorial

The most basic factorial would just multiply the numbers from `1`

to
`n`

:

```
: slow-factorial ( n -- n! )
dup 1 > [ [1..b] 1 [ * ] reduce ] [ drop 1 ] if ;
```

Calculating the 50,000 factorial this way takes **0.915 seconds**.

### factorial

It turns out that doing that causes a lot of arbitrary-precision bignum multiplication. We can instead use binary-reduce (used by the standard library product and sum words), which is a generic algorithm that recursively multiplies groups of numbers which allows for more fixnum multiplications.

```
: factorial ( n -- n! )
dup 1 > [ [1..b] 1 [ * ] binary-reduce ] [ drop 1 ] if ;
```

Calculating the 50,000 factorial takes **0.217 seconds**.

### fast-factorial

We can implement Peter Luschny’s algorithm, keeping the structure very close to his sample code:

```
:: part-product ( n m -- x )
{
{ [ m n 1 + <= ] [ n ] }
{ [ m n 2 + = ] [ n m * ] }
[
n m + 2 /i dup even? [ 1 - ] when :> k
n k k 2 + m [ part-product ] 2bi@ *
]
} cond ;
:: factorial-loop ( n p r -- p' r' )
n 2 > [
n 2 /i :> mid
mid p r factorial-loop
[
mid 1 + mid 1 bitand +
n 1 - n 1 bitand + part-product *
] [ dupd * ] bi*
] [ p r ] if ;
: fast-factorial ( x -- n )
[ 1 1 factorial-loop nip ] [ dup bit-count - ] bi shift ;
```

Calculating the 50,000 factorial now only takes **0.171 seconds**!

The code for this is on my GitHub.

*Note: there are other
algorithms that are
likely faster, but for this purpose I wanted to implement a similar
algorithm to Python. Their implementation has a few modifications
including a tight loop for multiplications that fit into an unsigned
long and is over 150 lines of code.*