TXON
Tuesday, February 21, 2012
The TXON, also known as “Text Object Notation”, is a proposed format for structured data.
Much less popular than other formats such as JSON, XML, or even INI files - I thought it would still be fun to implement encode and decode words in Factor.
An example TXON
might look something like this:
Factor:`
url:`https://factorcode.org`
development:`Started in 2003`
license:`Open source (BSD license)`
influences:`Forth, Lisp, and Smalltalk`
`
Encoding
Since TXON uses “`” characters to delimit values, we need to escape them:
: encode-value ( string -- string' )
R" `" "\\`" re-replace ;
To implement encoding in a generic way, we dispatch on the type of object being encoded:
GENERIC: >txon ( object -- string )
M: sequence >txon
[ >txon ] map "\n" join ;
M: assoc >txon
>alist [
first2 [ encode-value ] [ >txon ] bi* "%s:`%s`" sprintf
] map "\n" join ;
M: string >txon
encode-value ;
M: number >txon
number>string >txon ;
Decoding
Although the TXON specification includes an EBNF grammar, I am going to show one way to build a parser from scratch. In the tradition of concatenative languages, we will build our decoder from several smaller words.
For symmetry with the encode-value
word, we need a way to unescape the
`
characters:
: decode-value ( string -- string' )
R" \\`" "`" re-replace ;
Since the TXON format is a series of name:`value`
pairs, we can
parse the name by finding the separator and then decoding the name
(which might contain escaped characters):
: parse-name ( string -- remain name )
":`" split1 swap decode-value ;
To build a word that finds the first (unescaped) `
character, we
will first make a word that looks at adjacent characters, returning true
if the second character is an unescaped `
:
: `? ( ch1 ch2 -- ? )
[ CHAR: \ = not ] [ CHAR: ` = ] bi* and ;
By
grouping
the string into adjacent characters, we can find the first unescaped
`
(specially handling the case where the first character is an
`
):
: (find-`) ( string -- n/f )
2 clump [ first2 `? ] find drop [ 1 + ] [ f ] if* ;
: find-` ( string -- n/f )
dup ?first CHAR: ` = [ drop 0 ] [ (find-`) ] if ;
Parsing the value is slightly complicated by the fact that TXON supports values which might themselves be a single value, a sequence of values, or a series of name/value pairs. Basically, that means we need to:
- find the first
`
character - checks if the previous character is a
:
(indicating a name/value) - parse all name/values if so, otherwise decode the value(s)
That algorithm can be translated into this code:
DEFER: name/values
: (parse-value) ( string -- values )
decode-value string-lines dup length 1 = [ first ] when ;
: parse-value ( string -- remain value )
dup find-` [
dup 1 - pick ?nth CHAR: : =
[ drop name/values ] [ cut swap (parse-value) ] if
[ rest [ blank? ] trim-head ] dip
] [ f swap ] if* ;
We want to parse a “name=value” pair, which should be as easy as parsing the name, then the value, then associating into a hashtable:
: (name=value) ( string -- remain term )
parse-name [ parse-value ] dip associate ;
The string might contain a “name=value” pair, or just a single value:
: name=value ( string -- remain term )
[ blank? ] trim
":`" over subseq? [ (name=value) ] [ f swap ] if ;
We finish by building a word to produce all “name=value” pairs, used in
the parse-value
word earlier.
: name/values ( string -- remain terms )
[ dup { [ empty? not ] [ first CHAR: ` = not ] } 1&& ]
[ name=value ] produce assoc-combine ;
Putting all of that together, we can make a word to parse a TXON string, producing “name=value” pairs until exhausted:
: parse-txon ( string -- objects )
[ dup empty? not ] [ name=value ] produce nip ;
: txon> ( string -- object )
parse-txon dup length 1 = [ first ] when ;
Try It
You can try this out in the listener:
IN: scratchpad H{ { "a" "123" } } >txon .
"a:`123`"
IN: scratchpad "a:`123`" txon> .
H{ { "a" "123" } }
Can you improve on this? Maybe by using the peg.ebnf
vocabulary to
create an EBNF parsing word?
The code for this (and a bunch of tests) are on my GitHub.