Improved hash-sets

Friday, March 22, 2013

For quite a long time, the hash-sets vocabulary in Factor has had this note at the top:

! In a better implementation, less memory would be used
TUPLE: hash-set { table hashtable read-only } ;

By wrapping a hashtable, we had a pleasantly simple set implementation:

M: hash-set in? table>> key? ;
M: hash-set adjoin table>> dupd set-at ;
M: hash-set delete table>> delete-at ;
M: hash-set members table>> keys ;
M: hash-set null? table>> assoc-empty? ;
M: hash-set cardinality table>> assoc-size ;

Unfortunately, this implementation had the consequence of increased memory usage (members of the set would be stored twice - as both key and value in the hashtable) and some performance implications due to collecting the underlying hashtable into an array and then throwing out the values to return only the keys.

After a new hash-set implementation and a series of minor improvements, we had a greatly improved hash-set. While working on this, I realized we could speed up growing the backing array and applied a similar improvement to hashtables.

Some performance results using our benchmark.hash-sets test:

Before:

IN: scratchpad gc [ hash-sets-benchmark ] time
Running time: 3.066324117 seconds

After:

IN: scratchpad gc [ hash-sets-benchmark ] time
Running time: 0.746381819 seconds

I have since converted various parts of the compiler infrastructure to use hash-sets which had previously used hashtables for set operations and have gotten some overall performance improvements. This is available now in the development version of Factor.