Re: Factor

Factor: the language, the theory, and the practice.

Jaro-Winkler

Saturday, June 21, 2025

#math

Jaro-Winkler distance is a measure of string similarity and edit distance between two sequences:

The higher the Jaro–Winkler distance for two strings is, the less similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The original paper actually defined the metric in terms of similarity, so the distance is defined as the inversion of that value (distance = 1 − similarity).

There are actually two different concepts – and RosettaCode tasks – implied by this algorithm:

  1. Jaro similarity and Jaro distance
  2. Jaro-Winkler similarity and Jaro-Winkler distance.

Let’s build an implementation of these in Factor!

Jaro Similarity

The base that all of these are built upon is the Jaro similarity. It is calculated as a score by measuring the number of matches (m) between the strings, counting the number of transpositions divided by 2 (t), and then returning a weighted score using the formula using the lengths of each sequence (|s1| and |s2|):

In particular, it considers a matching character to be one that is found in the other string within a match distance away, calculated by the formula:

There are multiple ways to go about this, with varying performance, but I decided one longer function was simpler to understand than breaking out the steps into their own words. We use a bit-array to efficiently track which characters have been matched already as we iterate:

:: jaro-similarity ( s1 s2 -- n )
    s1 s2 [ length ] bi@       :> ( len1 len2 )
    len1 len2 max 2/ 1 [-]     :> delta
    len2 <bit-array>           :> flags

    s1 [| ch i |
        i delta [-]            :> from
        i delta + 1 + len2 min :> to

        from to [| j |
            j flags nth [ f ] [
                ch j s2 nth = dup j flags set-nth
            ] if
        ] find-integer-from
    ] filter-index

    [ 0 ] [
        [ length ] keep s2 flags [ nip ] 2filter [ = not ] 2count
        :> ( #matches #transpositions )

        #matches len1 /f #matches len2 /f +
        #matches #transpositions 2/ - #matches /f + 3 /
    ] if-empty ;

The Jaro distance is then just a subtraction:

: jaro-distance ( s1 s2 -- n )
    jaro-similarity 1.0 swap - ;

I’m curious if anyone else has a simpler implementation – please share!

Jaro-Winkler Similarity

The Jaro-Winkler similarity builds upon this by factoring in the length of the common prefix (l) times a constant scaling factor (p) that is usually set to 0.1 in most implementations I’ve seen:

We can implement this by calcuting the Jaro similarity and then computing the common prefix and then generating the result:

:: jaro-winkler-similarity ( s1 s2 -- n )
    s1 s2 jaro-similarity :> jaro
    s1 s2 min-length 4 min :> len
    s1 s2 [ len head-slice ] bi@ [ = ] 2count :> #common
    1 jaro - #common 0.1 * * jaro + ;

The Jaro-Winkler distance is again just a subtraction:

: jaro-winkler-distance ( a b -- n )
    jaro-winkler-similarity 1.0 swap - ;

Try it out

The Wikipedia article compares the similarity of FARMVILLE and FAREMVIEL:

IN: scratchpad "FARMVILLE" "FAREMVIEL" jaro-similarity .
0.8842592592592592

We can also see that the algorithm considers the transposition of two close characters to be less of a penalty than the transposition of two characters farther away from each other. It also penalizes additions and substitutions of characters that cannot be expressed as transpositions.

IN: scratchpad "My string" "My tsring" jaro-winkler-similarity .
0.9740740740740741

IN: scratchpad "My string" "My ntrisg" jaro-winkler-similarity .
0.8962962962962963

We can compare the rough performance of Julia using the same algorithm:

julia> using Random

julia> s = randstring(10_000)

julia> t = randstring(10_000)

julia> @time jarowinklerdistance(s, t)
  1.492011 seconds (108.32 M allocations: 2.178 GiB, 1.87% gc time)
0.19016926812348256

Note: I’m not a Julia developer, I just play one on TV. I adapted this implementation in Julia, which originally took over 4.5 seconds. A better developer could probably improve it quite a bit. In fact, it was pointed out that we are indexing UTF-8 String in a loop, and should instead collect the Char into a Vector first. That does indeed make it super fast.

To the implementation in Factor that we built above, which runs quite a bit faster:

IN: scratchpad USE: random.data

IN: scratchpad 10,000 random-string
               10,000 random-string
               gc [ jaro-winkler-distance ] time .
Running time: 0.259643166 seconds

0.1952856823031448

Thats not bad for a first version that uses safe indexing with unnecessary bounds-checking, generic iteration on integers when usually the indices are fixnum (something I hope to fix someday automatically), and should probably order the input sequences by length for consistency.

If we fix those problems, it gets even faster:

IN: scratchpad USE: random.data

IN: scratchpad 10,000 random-string
               10,000 random-string
               gc [ jaro-winkler-distance ] time .
Running time: 0.068086625 seconds

0.19297898770334765

This is available in the development version in the math.similarity and the math.distances vocabularies.