Finding Subsequences

Tuesday, November 12, 2024

Recently, I’ve been inspired by conversations taking place on our Factor Discord server. This sometimes reflects areas of interest from new contributors, curiousity exploring similarities and differences between Factor and other programming languages, or even early learning moments when exploring concatenative languages in general.

Today, someone asked about how to think about “accumulation of values in an array… to find all occurences (their position) of a subseq in a seq”. The solution to this might have this word name and stack effect:

: subseq-indices ( seq subseq -- indices ) ... ;

Before answering, I wanted to make sure they wanted to find overlapping indices vs. non-overlapping indices, and they clarified that they expect it to find this result – allowing overlapping subsequences:

IN: scratchpad "abcabcabc" "abcabc" subseq-indices .
{ 0 3 }

So, now that we have a reasonable specification, how do we think about solving this problem when we are at the same time learning to solve problems in stack languages and trying to see what features of Factor’s standard library would help.

There are a lot of ways to think about this, and I often recommend one of three approaches:

starting inside-out (working on the inner part of the loop)
starting outside-in (modeling the outer loop and then figuring out what’s inside it)
using local variables (helpful when coming from an applicative language background)

So let’s look at each approach in turn:

Inside Out

The inner logic is going to require something like “take an index to start from and find the next matching subseq index”, which looks an awful lot like subseq-index-from – except you also want to increment the found index afterwards to make sure you are progressing through the sequence.

: next-subseq-index ( index seq subseq -- next-index/f found-index/f )
    subseq-index-from [ [ 1 + ] keep ] [ f f ] if* ;

Then you could use it like so in a loop with an accumulator:

: subseq-indices ( seq subseq -- indices )
    [ V{ } clone 0 ] 2dip '[
        _ _ next-subseq-index dup [ [ pick push ] keep ] when
    ] loop drop ;

But that feels like we had to work hard to do that, directly using an accumulator, conditionals, and some stack shuffling. Luckily we have some higher level words that might help, for example the make vocabulary which has an implicit accumulator that we can use , or % to push into:

: subseq-indices ( seq subseq -- indices )
    [ 0 ] 2dip '[
        [ _ _ next-subseq-index dup [ , ] when* ] loop
    ] { } make nip ;

Or even using a while* loop, which is less code:

: subseq-indices ( seq subseq -- indices )
    [ 0 ] 2dip '[
        [ _ _ next-subseq-index ] [ , ] while*
    ] { } make nip ;

But that feels like a lot too, simpler might be produce:

: subseq-indices ( seq subseq -- indices )
    [ 0 ] 2dip '[ _ _ next-subseq-index dup ] [ ] produce 2nip ;

Or using follow, adjusting our start index and increment:

: subseq-indices ( seq subseq -- indices )
    [ -1 ] 2dip '[ 1 + _ _ subseq-index-from ] follow rest ;

Outside In

The outer logic approach would be something like “we need to loop from the start of the sequence, finding the next match, and accumulating it, until we hit some exit condition and then return a result” which you could write in a kind of non-functional stack pseudocode:

: subseq-indices ( seq subseq -- indices )
    0 [ find-next-match ] [ accumulate-match ] while ;

Then you have to kind of figure out what goes into those blocks:

: find-next-match ( seq subseq n -- found-index/f )
    -rot subseq-index-from ;

And also something like:

: accumulate-match ( accum found-index -- accum next-index )
    [ suffix! ] keep 1 + ;

Taking those, and maybe thinking about what items should be on the stack and in what order to reduce stack shuffling, becomes something like:

: subseq-indices ( seq subseq -- indices )
    [ V{ } clone 0 ] 2dip
    '[ _ _ subseq-index-from ] [ [ suffix! ] keep 1 + ] while* ;

It is true that [ suffix! ] keep 1 + is also [ suffix! ] [ 1 + ] bi, with varying aesthetics and ease of understanding, but sometimes when learning a new language especially a stack language with combinators, it is sometimes easy to start with stack shuffling and then learn about these forms later to see if they can improve your code.

Locals

Instead of those two stack approaches, we could instead use our local variables and write one big word in a manner similar to applicative languages, stepping back and focusing on the result we want:

:: subseq-indices ( seq subseq -- indices )
    V{ } clone :> accum
    0 :> i!

    [ i seq subseq subseq-index-from ]
    [ dup accum push 1 + i! ] while*

    accum ;

When working on this stuff, it’s nice to remember you can put a B to set a breakpoint in places to examine the stack at some inner point, or perhaps write a comment showing the incoming stack and optionally the outgoing stack that a piece of code is expected to have so that you understand what is happening in the next few lines:

! the next block of code finds the next index
! ( index seq subseq -- found-index )

! and pushes it into an accumulator
! ( accum found-index -- accum )

This was added to the developer branch in the sequences.extras vocabulary.

We love to hear questions and it’s even better when we can provide answers or guidance for learning and solving problems. Feel free to join our conversations and explore learning Factor!