Substrings
Sunday, July 10, 2011
One year ago, I wrote about some new combinatoric functions that I wrote. I had a recent need for a couple of “substring” functions. Since I couldn’t find them in Factor’s standard library, I thought I would contribute these:
all-subseqs
Our first word takes a sequence, and then returns all (consecutive) subsequences that can be found. In Factor, the grouping vocabulary provides a clumping feature that splits a sequence into overlapping, fixed-length, subsequences. We can use this to find clumps of every possible length to solve this problem:
USING: grouping kernel ranges sequences ;
: all-subseqs ( seq -- seqs )
dup length [1..b] [ <clumps> ] with map concat ;
You can see how this works:
IN: scratchpad "abcd" all-subseqs .
{ "a" "b" "c" "d" "ab" "bc" "cd" "abc" "bcd" "abcd" }
Note: we specifically don’t include the “empty” string in the results.
longest-subseq
Several algorithms and implementations can be found for the longest common substring problem. Basically, we want a word that returns the longest (consecutive) substring that is common between two strings.
Using locals, I translated the Python solution:
USING: arrays kernel locals math ranges sequences ;
:: longest-subseq ( seq1 seq2 -- subseq )
seq1 length :> len1
seq2 length :> len2
0 :> n!
0 :> end!
len1 1 + [ len2 1 + 0 <array> ] replicate :> table
len1 [1..b] [| x |
len2 [1..b] [| y |
x 1 - seq1 nth
y 1 - seq2 nth = [
y 1 - x 1 - table nth nth 1 + :> len
len y x table nth set-nth
len n > [ len n! x end! ] when
] [ 0 y x table nth set-nth ] if
] each
] each end n - end seq1 subseq ;
Below, you can see how it works:
IN: scratchpad "abc" "def" longest-subseq .
""
IN: scratchpad "abcd" "abcde" longest-subseq .
"abcd"
Note: don’t confuse this with the longest common subsequence problem (see substring vs. subsequence for more details), which is implemented in Factor by the lcs vocabulary.