Re: Factor

Factor: the language, the theory, and the practice.

Parsing Chemistry

Friday, October 24, 2025

#ebnf

In Python, the chemparse project is available as a “lightweight package for parsing chemical formula strings into python dictionaries” mapping chemical elements to numeric counts.

It supports parsing several variants of formula such as:

  • simple formulas like "H2O"
  • fractional stoichiometry like "C1.5O3"
  • groups such as "(CH3)2"
  • nested groups such as "((CH3)2)3"
  • square brackets such as "K4[Fe(SCN)6]"

I thought it would fun to build a similar functionality using Factor.

We are going to be using the EBNF syntax support to more simply write a parsing expression grammar. As is often the most useful way to implement things, we break it down into steps. We can parse a symbol as one or two letters, a number as an integer or float, and then a pair which is a symbol with an optional number prefix and postfix.

EBNF: split-formula [=[

symbol = [A-Z] [a-z]? => [[ sift >string ]]

number = [0-9]+ { "." [0-9]+ }? { { "e" | "E" } { "+" | "-" }? [0-9]+ }?

       => [[ first3 [ concat ] bi@ "" 3append-as string>number ]]

pair   = number? { symbol | "("~ pair+ ")"~ | "["~ pair+ "]"~ } number?

       => [[ first3 swapd [ 1 or ] bi@ * 2array ]]

pairs  = pair+

]=]

We can test that this works:

IN: scratchpad "H2O" split-formula .
V{ { "H" 2 } { "O" 1 } }

IN: scratchpad "(CH3)2" split-formula .
V{ { V{ { "C" 1 } { "H" 3 } } 2 } }

But we need to recursively flatten these into an assoc, mapping element to count.

: flatten-formula ( elt n assoc -- )
    [ [ first2 ] [ * ] bi* ] dip pick string?
    [ swapd at+ ] [ '[ _ _ flatten-formula ] each ] if ;

And combine those two steps to parse a formula:

: parse-formula ( str -- seq )
    split-formula H{ } clone [
        '[ 1 _ flatten-formula ] each
    ] keep ;

We can now test that this works with a few unit tests that show each of the features we hoped to support:

{ H{ { "H" 2 } { "O" 1 } } } [ "H2O" parse-formula ] unit-test

{ H{ { "C" 1.5 } { "O" 3 } } } [ "C1.5O3" parse-formula ] unit-test

{ H{ { "C" 2 } { "H" 6 } } } [ "(CH3)2" parse-formula ] unit-test

{ H{ { "C" 6 } { "H" 18 } } } [ "((CH3)2)3" parse-formula ] unit-test

{ H{ { "K" 4 } { "Fe" 1 } { "S" 6 } { "C" 6 } { "N" 6 } } }
[ "K4[Fe(SCN)6]" parse-formula ] unit-test

This is available in my GitHub.