Parsing Chemistry
Friday, October 24, 2025
In Python, the chemparse project is available as a “lightweight package for parsing chemical formula strings into python dictionaries” mapping chemical elements to numeric counts.
It supports parsing several variants of formula such as:
- simple formulas like
"H2O" - fractional stoichiometry like
"C1.5O3" - groups such as
"(CH3)2" - nested groups such as
"((CH3)2)3" - square brackets such as
"K4[Fe(SCN)6]"
I thought it would fun to build a similar functionality using Factor.
We are going to be using the EBNF syntax support to more simply write a parsing expression grammar. As is often the most useful way to implement things, we break it down into steps. We can parse a symbol as one or two letters, a number as an integer or float, and then a pair which is a symbol with an optional number prefix and postfix.
EBNF: split-formula [=[
symbol = [A-Z] [a-z]? => [[ sift >string ]]
number = [0-9]+ { "." [0-9]+ }? { { "e" | "E" } { "+" | "-" }? [0-9]+ }?
=> [[ first3 [ concat ] bi@ "" 3append-as string>number ]]
pair = number? { symbol | "("~ pair+ ")"~ | "["~ pair+ "]"~ } number?
=> [[ first3 swapd [ 1 or ] bi@ * 2array ]]
pairs = pair+
]=]
We can test that this works:
IN: scratchpad "H2O" split-formula .
V{ { "H" 2 } { "O" 1 } }
IN: scratchpad "(CH3)2" split-formula .
V{ { V{ { "C" 1 } { "H" 3 } } 2 } }
But we need to recursively flatten these into an assoc, mapping element to count.
: flatten-formula ( elt n assoc -- )
[ [ first2 ] [ * ] bi* ] dip pick string?
[ swapd at+ ] [ '[ _ _ flatten-formula ] each ] if ;
And combine those two steps to parse a formula:
: parse-formula ( str -- seq )
split-formula H{ } clone [
'[ 1 _ flatten-formula ] each
] keep ;
We can now test that this works with a few unit tests that show each of the features we hoped to support:
{ H{ { "H" 2 } { "O" 1 } } } [ "H2O" parse-formula ] unit-test
{ H{ { "C" 1.5 } { "O" 3 } } } [ "C1.5O3" parse-formula ] unit-test
{ H{ { "C" 2 } { "H" 6 } } } [ "(CH3)2" parse-formula ] unit-test
{ H{ { "C" 6 } { "H" 18 } } } [ "((CH3)2)3" parse-formula ] unit-test
{ H{ { "K" 4 } { "Fe" 1 } { "S" 6 } { "C" 6 } { "N" 6 } } }
[ "K4[Fe(SCN)6]" parse-formula ] unit-test
This is available in my GitHub.