Group By
Friday, April 8, 2011
When dealing with sequences, it can be useful to “group” them based on
some criteria into smaller sequences. I couldn’t find a word that quite
solved my problem in Factor, so I wrote
group-by
:
USING: assocs kernel sequences ;
: group-by ( seq quot: ( elt -- key ) -- assoc )
H{ } clone [
[ push-at ] curry compose [ dup ] prepose each
] keep ; inline
Examples
For example, we could use it to split the first 20 numbers into two groups based on whether they are prime or not:
IN: scratchpad USE: math.primes
IN: scratchpad 20 <iota> [ prime? ] group-by .
H{
{ f V{ 0 1 4 6 8 9 10 12 14 15 16 18 } }
{ t V{ 2 3 5 7 11 13 17 19 } }
}
Or, we could group the subsets of a string by their length:
IN: scratchpad USE: math.combinatorics
IN: scratchpad "abc" all-subsets [ length ] group-by .
H{
{ 0 V{ "" } }
{ 1 V{ "a" "b" "c" } }
{ 2 V{ "ab" "ac" "bc" } }
{ 3 V{ "abc" } }
}
Or, group some random numbers by their bit-count:
IN: scratchpad USE: math.bitwise
IN: scratchpad 20 [ 100 random ] replicate
[ bit-count ] group-by .
H{
{ 1 V{ 32 1 32 4 } }
{ 2 V{ 12 } }
{ 3 V{ 13 74 56 98 44 } }
{ 4 V{ 83 30 45 46 75 77 } }
{ 5 V{ 61 59 61 } }
{ 6 V{ 63 } }
}
How it works
The Factor code is roughly equivalent to the following Python code:
from collections import defaultdict
def group_by(seq, f):
d = defaultdict(list)
for value in seq:
key = f(value)
d[key].append(value)
return d
Let’s take it step-by-step. First, start defining a word (function)
called group-by
.
: group-by
Next, define it to take two arguments, a
sequence
(list or array) of values and a
quotation
(anonymous function or lambda) that computes a key for each element, and
outputs an
assoc
(association, map, or dict). Names here are used only for documentation,
it could take a foo
and bar
and return a baz
.
( seq quot: ( elt -- key ) -- assoc )
The code inside the word is everything until the “;
”. We want to
output a
hashtable,
so we first create one by
cloning an
empty hashtable (H{ }
).
H{ } clone
We will
compose a
word that
duplicates
each element to compute a key that is used to
push each
element into an appropriate bucket (a
vector) in
the hashtable. The push-at
word has the signature
( value key assoc -- )
. For example, if grouping by the length of a
string, we want something that looks a bit like
[ dup length H{ } push-at ]
:
[ push-at ] curry compose [ dup ] prepose
We then, apply this quotation to each element in the sequence:
each
And, finally, we want to make sure that the hashtable that we created isn’t “consumed”, but kept on the stack as a return value.
[ ... ] keep ;