Re: Factor

Factor: the language, the theory, and the practice.

Substrings

Sunday, July 10, 2011

#math #text

One year ago, I wrote about some new combinatoric functions that I wrote. I had a recent need for a couple of “substring” functions. Since I couldn’t find them in Factor’s standard library, I thought I would contribute these:

all-subseqs

Our first word takes a sequence, and then returns all (consecutive) subsequences that can be found. In Factor, the grouping vocabulary provides a clumping feature that splits a sequence into overlapping, fixed-length, subsequences. We can use this to find clumps of every possible length to solve this problem:

USING: grouping kernel ranges sequences ;

: all-subseqs ( seq -- seqs )
    dup length [1..b] [ <clumps> ] with map concat ;

You can see how this works:

IN: scratchpad "abcd" all-subseqs .
{ "a" "b" "c" "d" "ab" "bc" "cd" "abc" "bcd" "abcd" }

Note: we specifically don’t include the “empty” string in the results.

longest-subseq

Several algorithms and implementations can be found for the longest common substring problem. Basically, we want a word that returns the longest (consecutive) substring that is common between two strings.

Using locals, I translated the Python solution:

USING: arrays kernel locals math ranges sequences ;

:: longest-subseq ( seq1 seq2 -- subseq )
    seq1 length :> len1
    seq2 length :> len2
    0 :> n!
    0 :> end!
    len1 1 + [ len2 1 + 0 <array> ] replicate :> table
    len1 [1..b] [| x |
        len2 [1..b] [| y |
            x 1 - seq1 nth
            y 1 - seq2 nth = [
                y 1 - x 1 - table nth nth 1 + :> len
                len y x table nth set-nth
                len n > [ len n! x end! ] when
            ] [ 0 y x table nth set-nth ] if
        ] each
    ] each end n - end seq1 subseq ;

Below, you can see how it works:

IN: scratchpad "abc" "def" longest-subseq .
""

IN: scratchpad "abcd" "abcde" longest-subseq .
"abcd"

Note: don’t confuse this with the longest common subsequence problem (see substring vs. subsequence for more details), which is implemented in Factor by the lcs vocabulary.