Fast Factorial
Monday, November 4, 2013
You might remember that I implemented a bunch of factorial words several months ago.
Recently, I was looking at Python’s latest mathmodule.c, and noticed an implementation of a “divide-and-conquer factorial algorithm”. The algorithm is based on one described by Peter Luschny as the binary-split formula of the factorial of n. It seemed like a fun thing to implement and benchmark in Factor.
It turns out this was one of the improvements made during the Python 3.2 development cycle. Calculating the 50,000th factorial takes 0.724 seconds with Python 2.7.5 and only 0.064 seconds with Python 3.3.2 - a nice improvement!
slow-factorial
The most basic factorial would just multiply the numbers from 1
to
n
:
: slow-factorial ( n -- n! )
dup 1 > [ [1..b] 1 [ * ] reduce ] [ drop 1 ] if ;
Calculating the 50,000 factorial this way takes 0.915 seconds.
factorial
It turns out that doing that causes a lot of arbitrary-precision bignum multiplication. We can instead use binary-reduce (used by the standard library product and sum words), which is a generic algorithm that recursively multiplies groups of numbers which allows for more fixnum multiplications.
: factorial ( n -- n! )
dup 1 > [ [1..b] 1 [ * ] binary-reduce ] [ drop 1 ] if ;
Calculating the 50,000 factorial takes 0.217 seconds.
fast-factorial
We can implement Peter Luschny’s algorithm, keeping the structure very close to his sample code:
:: part-product ( n m -- x )
{
{ [ m n 1 + <= ] [ n ] }
{ [ m n 2 + = ] [ n m * ] }
[
n m + 2 /i dup even? [ 1 - ] when :> k
n k k 2 + m [ part-product ] 2bi@ *
]
} cond ;
:: factorial-loop ( n p r -- p' r' )
n 2 > [
n 2 /i :> mid
mid p r factorial-loop
[
mid 1 + mid 1 bitand +
n 1 - n 1 bitand + part-product *
] [ dupd * ] bi*
] [ p r ] if ;
: fast-factorial ( x -- n )
[ 1 1 factorial-loop nip ] [ dup bit-count - ] bi shift ;
Calculating the 50,000 factorial now only takes 0.171 seconds!
The code for this is on my GitHub.
Note: there are other algorithms that are likely faster, but for this purpose I wanted to implement a similar algorithm to Python. Their implementation has a few modifications including a tight loop for multiplications that fit into an unsigned long and is over 150 lines of code.