Faster Leap Year?
Wednesday, May 21, 2025
Last week, Falk Hüffner wrote about making a leap year check in three instructions:
With the following code, we can check whether a year 0 ≤ y ≤ 102499 is a leap year with only about 3 CPU instructions:
bool is_leap_year_fast(uint32_t y) { return ((y * 1073750999) & 3221352463) <= 126976; }
How does this work? The answer is surprisingly complex. This article explains it, mostly to have some fun with bit-twiddling; at the end, I’ll briefly discuss the practical use.
This is how a leap year check is typically implemented:
bool is_leap_year(uint32_t y) { if ((y % 4) != 0) return false; if ((y % 100) != 0) return true; if ((y % 400) == 0) return true; return false; }
It would be fun to see how that works in Factor and compare the relative performance between a simple version and the new super-fast-highly-optimized 3 instruction version. To do that, we can use the benchmark word to record execution time by calling it repeatedly and returning an average time-per-call in nanoseconds:
TYPED: average-benchmark ( n: fixnum quot -- nanos-per-call )
over [ '[ _ _ times ] benchmark ] dip /f ; inline
Note: We are forcing the iteration loop above to be fixnum
to reduce its
overhead, and due to the design of the benchmark words below, are going to have
code blocks with predictable inputs. Testing your program with random inputs is
also important to see the impact of CPU optimizations such as cache and branch
predictions, or across multiple CPU architectures. Performance is also impacted
by use of code generation features such as
inline and
compiler steps such as dead-code elimination. Benchmarking is hard.
Simple implementation
The simple – and typical – implementation can be easily written as:
: leap-year? ( year -- ? )
dup 100 divisor? 400 4 ? divisor? ;
And in fact, that’s how it is implemented in the standard library.
We can write a quick benchmarking word. This ensures we are using the optimizing compiler and also asserts that the result of the word is as expected:
: bench-leap-year ( n year ? -- nanos )
'[ _ leap-year? _ assert= ] average-benchmark ;
And then call it one hundred million times, to see how long it takes each call on average:
IN: scratchpad 100,000,000 2028 t bench-leap-year .
10.53904317
Just under 11 nanoseconds, including the loop and the assert…
Fast implementation
The fast implementation suggested by Falk can be written directly as:
: fast-leap-year? ( year -- ? )
1073750999 * 3221352463 bitand 126976 <= ;
And then write a benchmarking word:
: bench-fast-leap-year ( n year ? -- nanos )
'[ _ fast-leap-year? _ assert= ] average-benchmark ;
And see how long it takes:
IN: scratchpad 100,000,000 2028 t bench-fast-leap-year .
4.74783302
Just under 5 nanoseconds…
Faster implementation
Well, generally Factor supports arbitrarily large
integers by
allowing integers to implicitly promote from word-sized fixnum
to
overflow into bignum
. And, as they say, you can write the C programming
language in any
language.
A faster implementation might check the input is a fixnum
and then force
math without overflow:
TYPED: faster-leap-year? ( year: fixnum -- ? )
1073750999 fixnum*fast 3221352463 fixnum-bitand 126976 fixnum<= ;
And write a benchmark word:
: bench-faster-leap-year ( n year ? -- nanos )
'[ _ faster-leap-year? _ assert= ] average-benchmark ;
It’s a bit faster:
IN: scratchpad 100,000,000 2028 t bench-faster-leap-year .
3.24267145
Just under 4 nanoseconds…
Fastest implementation
But, to make sure that we take advantage of the least amount of instructions
possible, we can make it slightly-less-safe by declaring the input to be a
fixnum
to avoid the run-time type checks. This could cause issues if it is
called with other types on the stack.
: fastest-leap-year? ( year -- ? )
{ fixnum } declare
1073750999 fixnum*fast 3221352463 fixnum-bitand 126976 fixnum<= ;
And write a benchmark word:
: bench-fastest-leap-year ( n year ? -- nanos )
'[ _ fastest-leap-year? _ assert= ] average-benchmark ;
And then you can see it gets quite fast indeed:
IN: scratchpad 100,000,000 2028 t bench-fastest-leap-year .
2.82150476
Just under 3 nanoseconds!
But, is it also just 3 instructions?
IN: scratchpad \ fastest-leap-year? disassemble
000075f0afa19490: 89056a5bd1fe mov [rip-0x12ea496], eax
000075f0afa19496: 498b06 mov rax, [r14]
000075f0afa19499: 48c1f804 sar rax, 0x4
000075f0afa1949d: 4869c0d7230040 imul rax, rax, 0x400023d7
000075f0afa194a4: bb0ff001c0 mov ebx, 0xc001f00f
000075f0afa194a9: 4821d8 and rax, rbx
000075f0afa194ac: 4881f800f00100 cmp rax, 0x1f000
000075f0afa194b3: b801000000 mov eax, 0x1
000075f0afa194b8: 48bb5c0e388cf0750000 mov rbx, 0x75f08c380e5c
000075f0afa194c2: 480f4ec3 cmovle rax, rbx
000075f0afa194c6: 498906 mov [r14], rax
000075f0afa194c9: 8905315bd1fe mov [rip-0x12ea4cf], eax
000075f0afa194cf: c3 ret
Pretty close!
There is an extra instruction near the beginning to untag our fixnum
input. Due to the convention around handling booleans in
Factor, there
are a couple of extra instructions at the end for converting the result into
a return value of either t
or f
.
And it could get even faster if either the assert=
was removed, or the code
was made inline
so the function prologue and
epilogue could
be elided into the outer scope.
So much fun.