Group By

Friday, April 8, 2011

When dealing with sequences, it can be useful to “group” them based on some criteria into smaller sequences. I couldn’t find a word that quite solved my problem in Factor, so I wrote group-by:

USING: assocs kernel sequences ;

: group-by ( seq quot: ( elt -- key ) -- assoc )
    H{ } clone [
        [ push-at ] curry compose [ dup ] prepose each
    ] keep ; inline

Examples

For example, we could use it to split the first 20 numbers into two groups based on whether they are prime or not:

IN: scratchpad USE: math.primes

IN: scratchpad 20 <iota> [ prime? ] group-by .
H{
    { f V{ 0 1 4 6 8 9 10 12 14 15 16 18 } }
    { t V{ 2 3 5 7 11 13 17 19 } }
}

Or, we could group the subsets of a string by their length:

IN: scratchpad USE: math.combinatorics

IN: scratchpad "abc" all-subsets [ length ] group-by .
H{
    { 0 V{ "" } }
    { 1 V{ "a" "b" "c" } }
    { 2 V{ "ab" "ac" "bc" } }
    { 3 V{ "abc" } }
}

Or, group some random numbers by their bit-count:

IN: scratchpad USE: math.bitwise

IN: scratchpad 20 [ 100 random ] replicate
              [ bit-count ] group-by .
H{
    { 1 V{ 32 1 32 4 } }
    { 2 V{ 12 } }
    { 3 V{ 13 74 56 98 44 } }
    { 4 V{ 83 30 45 46 75 77 } }
    { 5 V{ 61 59 61 } }
    { 6 V{ 63 } }
}

How it works

The Factor code is roughly equivalent to the following Python code:

from collections import defaultdict

def group_by(seq, f):
    d = defaultdict(list)
    for value in seq:
        key = f(value)
        d[key].append(value)
    return d

Let’s take it step-by-step. First, start defining a word (function) called group-by.

: group-by

Next, define it to take two arguments, a sequence (list or array) of values and a quotation (anonymous function or lambda) that computes a key for each element, and outputs an assoc (association, map, or dict). Names here are used only for documentation, it could take a foo and bar and return a baz.

( seq quot: ( elt -- key ) -- assoc )

The code inside the word is everything until the “;”. We want to output a hashtable, so we first create one by cloning an empty hashtable (H{ }).

H{ } clone

We will compose a word that duplicates each element to compute a key that is used to push each element into an appropriate bucket (a vector) in the hashtable. The push-at word has the signature ( value key assoc -- ). For example, if grouping by the length of a string, we want something that looks a bit like [ dup length H{ } push-at ]:

[ push-at ] curry compose [ dup ] prepose

We then, apply this quotation to each element in the sequence:

each

And, finally, we want to make sure that the hashtable that we created isn’t “consumed”, but kept on the stack as a return value.

[ ... ] keep ;