Re: Factor

Factor: the language, the theory, and the practice.

Next Combination

Friday, August 25, 2023

#math #performance

Eleven years ago, I blogged about computing the next permutation of a sequence.

As part of some performance improvements to the math.combinatorics vocabulary that I was making around the same time, I needed a way of calculating the indices of the next combination of a given sequence.

Implementation

Without going into a great explanation, we split the problem into a few steps, operating on a sequence of indices and modifying in place – either incrementing the indices after the “max index” or the last index.

: find-max-index ( seq n -- i )
    over length - '[ _ + >= ] find-index drop ; inline

: increment-rest ( i seq -- )
    [ nth-unsafe ] [ swap index-to-tail <slice-unsafe> ] 2bi
    [ drop 1 + dup ] map! 2drop ; inline

: increment-last ( seq -- )
    [ index-of-last [ 1 + ] change-nth-unsafe ] unless-empty ; inline

:: next-combination ( seq n -- )
    seq n find-max-index [
        1 [-] seq increment-rest
    ] [
        seq increment-last
    ] if* ; inline

We can show how it works by using clone to see the intermediate results of “5 choose 3”:

IN: scratchpad { 0 1 2 } 10 [
                   [ clone 5 next-combination ] keep
               ] replicate nip .
{
    { 0 1 2 }
    { 0 1 3 }
    { 0 1 4 }
    { 0 2 3 }
    { 0 2 4 } 
    { 0 3 4 }
    { 1 2 3 }
    { 1 2 4 }
    { 1 3 4 }
    { 2 3 4 }
}

This is used to implement our various combinations words: each-combination, map-combinations, filter-combinations, all-combinations, find-combination, reduce-combinations, etc.

Benchmarking

The benchmark I have chosen is one that intentionally creates a large array of almost 65 million items with the result of allocating all combinations of 200 items choosing 4 items at a time.

We can run this benchmark on an Apple Mac mini (2018) with a 3.2 GHz 6-Core Intel Core i7 processor and 16 GB of memory that is used as part of the Factor nightly build farm.

Factor 0.99 takes 8.3 seconds:

IN: scratchpad [ 200 <iota> 4 all-combinations length ] time .
Running time: 8.269610590999999 seconds

Additional information was collected.
dispatch-stats.  - Print method dispatch statistics
gc-events.       - Print all garbage collection events
gc-stats.        - Print breakdown of different garbage collection events
gc-summary.      - Print aggregate garbage collection statistics
64684950

Look at the gc-summary. to see that more than half the time (60% or 4.9 seconds) is involved in garbage collection as part of allocating the almost 65 million results:

IN: scratchpad gc-summary.
Collections:          1,480
Cards scanned:        7,070,518
Decks scanned:        9,124
Code blocks scanned:  2
Total time:           4,910,722 µs
Card scan time:       3,256,711 µs
Code block scan time: 162 µs
Marking time:         0 µs
Data heap sweep time: 0 µs
Code heap sweep time: 0 µs
Data compaction time: 0 µs

Other Languages

On that same machine we can try some other programming languages and see how they compare, attempting to get the same behavior as the Factor example above.

Python 3.11.4 takes 8.4 seconds:

In [1]: from itertools import combinations

In [2]: %time print(len(list(combinations(range(200), 4))))
64684950
CPU times: user 5.64 s, sys: 2.77 s, total: 8.41 s
Wall time: 8.44 s

PyPy 7.3.12 (compatible with Python 3.10) takes 17 seconds:

>>>> from itertools import combinations
>>>> from time import time
>>>> t0 = time(); print(len(list(combinations(range(200), 4)))); time() - t0
64684950
17.00031805038452

Ruby 3.2.2 takes 37.4 seconds:

irb(main):001:0> measure
TIME is added.
=> nil

irb(main):002:0> (0...200).to_a.combination(4).to_a.length
processing time: 37.413635s
=> 64684950

Julia 1.9.2 takes 10.7 seconds:

julia> ]add Combinatorics

julia> using Combinatorics

julia> @time length(collect(combinations(range(1, 200), 4)))
 10.668982 seconds (129.37 M allocations: 8.193 GiB, 37.74% gc time)
64684950

Crystal 1.9.2 takes 515 seconds (interpreted) or 13.7 seconds (compiled):

# Run crystal interpreted
$ crystal i
icr:1> t0 = Time.utc;
       puts (0...200).to_a.combinations(4).to_a.size;
       Time.utc - t0
64684950 => 00:08:34.728442000

# Make a crystal source file
$ echo "t0 = Time.utc
puts (0...200).to_a.combinations(4).to_a.size
puts (Time.utc - t0)" > combo.cr

# Run crystal compiled
$ crystal combo.cr
64684950
00:00:13.739838000

Clojure 1.11.1 takes 10.7 seconds:

user=> (ns user (:use [clojure.math.combinatorics]))

user=> (time (count (combinations (range 0 200) 4)))
"Elapsed time: 10685.687736 msecs"
64684950

Rakudo v2023.08 takes 808.9 seconds:

$ time raku -e "print (1..200).combinations(4).Array.elems"
759.45s user 44.32s system 99% cpu 13:28.86 total
64684950

Considering that we only take the length of the large array of combinations, it is a bit artificial as a benchmark, with room for various optimizations, but it does seem to highlight both algorithmic and garbage collector issues in various languages.

Factor compares pretty favorably!