Re: Factor

Factor: the language, the theory, and the practice.

Faster Leap Year?

Wednesday, May 21, 2025

#assembly #performance

Last week, Falk Hüffner wrote about making a leap year check in three instructions:

With the following code, we can check whether a year 0 ≤ y ≤ 102499 is a leap year with only about 3 CPU instructions:

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

How does this work? The answer is surprisingly complex. This article explains it, mostly to have some fun with bit-twiddling; at the end, I’ll briefly discuss the practical use.

This is how a leap year check is typically implemented:

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

It would be fun to see how that works in Factor and compare the relative performance between a simple version and the new super-fast-highly-optimized 3 instruction version. To do that, we can use the benchmark word to record execution time by calling it repeatedly and returning an average time-per-call in nanoseconds:

TYPED: average-benchmark ( n: fixnum quot -- nanos-per-call )
    over [ '[ _ _ times ] benchmark ] dip /f ; inline

Note: We are forcing the iteration loop above to be fixnum to reduce its overhead, and due to the design of the benchmark words below, are going to have code blocks with predictable inputs. Testing your program with random inputs is also important to see the impact of CPU optimizations such as cache and branch predictions, or across multiple CPU architectures. Performance is also impacted by use of code generation features such as inline and compiler steps such as dead-code elimination. Benchmarking is hard.

Simple implementation

The simple – and typical – implementation can be easily written as:

: leap-year? ( year -- ? )
    dup 100 divisor? 400 4 ? divisor? ;

And in fact, that’s how it is implemented in the standard library.

We can write a quick benchmarking word. This ensures we are using the optimizing compiler and also asserts that the result of the word is as expected:

: bench-leap-year ( n year ? -- nanos )
     '[ _ leap-year? _ assert= ] average-benchmark ;

And then call it one hundred million times, to see how long it takes each call on average:

IN: scratchpad 100,000,000 2028 t bench-leap-year .
10.53904317

Just under 11 nanoseconds, including the loop and the assert…

Fast implementation

The fast implementation suggested by Falk can be written directly as:

: fast-leap-year? ( year -- ? )
    1073750999 * 3221352463 bitand 126976 <= ;

And then write a benchmarking word:

: bench-fast-leap-year ( n year ? -- nanos )
     '[ _ fast-leap-year? _ assert= ] average-benchmark ;

And see how long it takes:

IN: scratchpad 100,000,000 2028 t bench-fast-leap-year .
4.74783302

Just under 5 nanoseconds…

Faster implementation

Well, generally Factor supports arbitrarily large integers by allowing integers to implicitly promote from word-sized fixnum to overflow into bignum. And, as they say, you can write the C programming language in any language.

A faster implementation might check the input is a fixnum and then force math without overflow:

TYPED: faster-leap-year? ( year: fixnum -- ? )
    1073750999 fixnum*fast 3221352463 fixnum-bitand 126976 fixnum<= ;

And write a benchmark word:

: bench-faster-leap-year ( n year ? -- nanos )
     '[ _ faster-leap-year? _ assert= ] average-benchmark ;

It’s a bit faster:

IN: scratchpad 100,000,000 2028 t bench-faster-leap-year .
3.24267145

Just under 4 nanoseconds…

Fastest implementation

But, to make sure that we take advantage of the least amount of instructions possible, we can make it slightly-less-safe by declaring the input to be a fixnum to avoid the run-time type checks. This could cause issues if it is called with other types on the stack.

: fastest-leap-year? ( year -- ? )
    { fixnum } declare
    1073750999 fixnum*fast 3221352463 fixnum-bitand 126976 fixnum<= ;

And write a benchmark word:

: bench-fastest-leap-year ( n year ? -- nanos )
     '[ _ fastest-leap-year? _ assert= ] average-benchmark ;

And then you can see it gets quite fast indeed:

IN: scratchpad 100,000,000 2028 t bench-fastest-leap-year .
2.82150476

Just under 3 nanoseconds!

But, is it also just 3 instructions?

IN: scratchpad \ fastest-leap-year? disassemble
000075f0afa19490: 89056a5bd1fe          mov [rip-0x12ea496], eax
000075f0afa19496: 498b06                mov rax, [r14]
000075f0afa19499: 48c1f804              sar rax, 0x4
000075f0afa1949d: 4869c0d7230040        imul rax, rax, 0x400023d7
000075f0afa194a4: bb0ff001c0            mov ebx, 0xc001f00f
000075f0afa194a9: 4821d8                and rax, rbx
000075f0afa194ac: 4881f800f00100        cmp rax, 0x1f000
000075f0afa194b3: b801000000            mov eax, 0x1
000075f0afa194b8: 48bb5c0e388cf0750000  mov rbx, 0x75f08c380e5c
000075f0afa194c2: 480f4ec3              cmovle rax, rbx
000075f0afa194c6: 498906                mov [r14], rax
000075f0afa194c9: 8905315bd1fe          mov [rip-0x12ea4cf], eax
000075f0afa194cf: c3                    ret

Pretty close!

There is an extra instruction near the beginning to untag our fixnum input. Due to the convention around handling booleans in Factor, there are a couple of extra instructions at the end for converting the result into a return value of either t or f.

And it could get even faster if either the assert= was removed, or the code was made inline so the function prologue and epilogue could be elided into the outer scope.

So much fun.