One Billion Loops

Wednesday, July 16, 2025

The Primeagen had a great video about Language Performance Comparisons Are Junk about six months ago talking about the 1 Billion nested loop iterations benchmark that Benjamin Dicken wrote. You can find a copy of the benchmark code on GitHub.

The loops benchmark can be summarized by a version in Python:

import sys
import random

u = int(sys.argv[1])          # Get an input number from the command line
r = random.randint(0, 10000)  # Get a random number 0 <= r < 10k
a = [0] * 10000               # Array of 10k elements initialized to 0
for i in range(10000):        # 10k outer loop iterations
    for j in range(100000):   # 100k inner loop iterations, per outer loop iteration
        a[i] += j % u         # Simple sum
    a[i] += r                 # Add a random value to each element in array
print(a[r])                   # Print out a single element from the array

As a benchmark, it has some flaws, but it was fun to iterate on and I would like to compare Factor as a percentage of Zig – which was the fastest solution – and find out if Factor is faster than Zig! again.

The Zig version takes about 1.3 seconds on the computer I used for benchmarking:

$ git clone https://github.com/bddicken/languages.git

$ cd languages/loops/zig/

$ zig version
0.14.1

$ zig build-exe -O ReleaseFast code.zig

$ time ./code 100
4958365

real    0m1.292s
user    0m1.288s
sys     0m0.002s

Benjamin has some thoughts about using test runners to eliminate startup time and arrive at stable benchmark times. I’m going to ignore this for the purpose of this blog post, but it might be worth revisiting at some point to see what the steady state performance of Factor is.

Version 1

Our first implementation will be the simplest direct copy of the Python version:

:: loops-benchmark ( u -- )
    10,000 random :> r
    10,000 0 <array> :> a
    10,000 [| i |
        100,000 [| j |
            i a [ j u mod + ] change-nth
        ] each-integer
        i a [ r + ] change-nth
    ] each-integer r a nth . ;

This takes 4.7 seconds or about 3.6x of Zig.

Version 2

By default change-nth performs bounds-checking and we can notice that our indices are always within bounds, so we can use the unsafe version instead:

:: loops-benchmark ( u -- )
    10,000 random :> r
    10,000 0 <array> :> a
    10,000 [| i |
        100,000 [| j |
-            i a [ j u mod + ] change-nth
+            i a [ j u mod + ] change-nth-unsafe
        ] each-integer
-        i a [ r + ] change-nth
+        i a [ r + ] change-nth-unsafe
-    ] each-integer r a nth . ;
+    ] each-integer r a nth-unsafe . ;

This does not really speed the benchmark up, taking 4.7 seconds.

Version 3

We can improve our math dispatch by specifying that the argument is an integer:

-:: loops-benchmark ( u -- )
+TYPED:: loops-benchmark ( u: integer -- )
    10,000 random :> r
    10,000 0 <array> :> al
    10,000 [| i |
        100,000 [| j |
            i a [ j u mod + ] change-nth-unsafe
        ] each-integer
        i a [ r + ] change-nth-unsafe
    ] each-integer r a nth-unsafe . ;

This takes 4.5 seconds or about 3.5x of Zig.

Version 4

The Zig version operates on 32-bit unsigned ints, so we could enforce using fixnum integers:

-TYPED:: loops-benchmark ( u: integer -- )
+TYPED:: loops-benchmark ( u: fixnum -- )
-    10,000 random :> r
+    10,000 random { fixnum } declare :> r
    10,000 0 <array> :> a
    10,000 [| i |
        100,000 [| j |
-            i a [ j u mod + ] change-nth-unsafe
+            i a [ j u mod fixnum+fast ] change-nth-unsafe
        ] each-integer
-        i a [ r + ] change-nth-unsafe
+        i a [ r fixnum+fast ] change-nth-unsafe
    ] each-integer r a nth-unsafe . ;

This takes 3.5 seconds or about 2.7x of Zig.

Version 5

We could operate on those same 32-bit unsigned ints using specialized-arrays:

+SPECIALIZED-ARRAY: uint32_t

TYPED:: loops-benchmark ( u: fixnum -- )
    10,000 random { fixnum } declare :> r
-    10,000 0 <array> :> a
+    10,000 uint32_t <c-array> :> a
    10,000 [| i |
        100,000 [| j |
            i a [ j u mod fixnum+fast ] change-nth-unsafe
        ] each-integer
        i a [ r fixnum+fast ] change-nth-unsafe
    ] each-integer r a nth-unsafe . ;

This takes 3.1 seconds or about 2.4x of Zig.

Version 6

We could notice that the value added to the array index can be computed first and then added:

TYPED:: loops-benchmark ( u: fixnum -- )
    10,000 random { fixnum } declare :> r
    10,000 uint32_t <c-array> :> a
    10,000 [| i |
-        100,000 [| j |
-            i a [ j u mod fixnum+fast ] change-nth-unsafe
-        ] each-integer
+        100,000 <iota> [ u mod ] map-sum :> v
+        i a [ v + ] change-nth-unsafe
        i a [ r fixnum+fast ] change-nth-unsafe
    ] each-integer r a nth-unsafe . ;

This takes 2.4 seconds or about 1.8x of Zig.

Version 7

While the previous version did the same number of loops, it did fewer modifications to the array memory. It turns out that not only can the value added to the array index be computed first – it can be computed outside the loop. And once you do that, you’ll notice that each element of the array gets the same value, and you don’t need to compute the array at all. This is effectively a compiler optimization that the compiler isn’t doing and, after writing a little proof in your head, it can reduce down to:

:: loops-benchmark ( u -- )
    10,000 random 100,000 <iota> [ u mod ] map-sum + . ;

This takes 0.0003 seconds which is now 4300x faster than Zig.

Conclusion

I’ve always thought of Factor as able to have about 2x-4x the performance of C with reasonable looking generic and dynamic code. This depends somewhat on which benchmark is being considered, and on occasion we can get as fast as C.

We can easily profile code using the sampling profiler, visualize profiles using the flamegraph vocabulary, print optimized output from the compiler, as well as disassemble words to investigate the actual machine code that our optimizing compiler generates.

In addition to all the open issues about performance, we have an open issue to improve fixnum iteration that would likely help this benchmark and I hope to someday get resolved. And, there are likely many other improvements we could make to our use of tagged fixnum and integer unions or generic dispatch to improve the un-typed arithmetic in the examples above.

Some interesting results on the relative value of different Factor optimizations!