Re: Factor

Factor: the language, the theory, and the practice.

TXON

Tuesday, February 21, 2012

#parsing

The TXON, also known as “Text Object Notation”, is a proposed format for structured data.

Much less popular than other formats such as JSON, XML, or even INI files - I thought it would still be fun to implement encode and decode words in Factor.

An example TXON might look something like this:

Factor:`
    url:`https://factorcode.org`
    development:`Started in 2003`
    license:`Open source (BSD license)`
    influences:`Forth, Lisp, and Smalltalk`
`

Encoding

Since TXON uses “`” characters to delimit values, we need to escape them:

: encode-value ( string -- string' )
    R" `" "\\`" re-replace ;

To implement encoding in a generic way, we dispatch on the type of object being encoded:

GENERIC: >txon ( object -- string )

M: sequence >txon
    [ >txon ] map "\n" join ;

M: assoc >txon
    >alist [
        first2 [ encode-value ] [ >txon ] bi* "%s:`%s`" sprintf
    ] map "\n" join ;

M: string >txon
    encode-value ;

M: number >txon
    number>string >txon ;

Decoding

Although the TXON specification includes an EBNF grammar, I am going to show one way to build a parser from scratch. In the tradition of concatenative languages, we will build our decoder from several smaller words.

For symmetry with the encode-value word, we need a way to unescape the ` characters:

: decode-value ( string -- string' )
    R" \\`" "`" re-replace ;

Since the TXON format is a series of name:`value` pairs, we can parse the name by finding the separator and then decoding the name (which might contain escaped characters):

: parse-name ( string -- remain name )
    ":`" split1 swap decode-value ;

To build a word that finds the first (unescaped) ` character, we will first make a word that looks at adjacent characters, returning true if the second character is an unescaped `:

: `? ( ch1 ch2 -- ? )
    [ CHAR: \ = not ] [ CHAR: ` = ] bi* and ;

By grouping the string into adjacent characters, we can find the first unescaped ` (specially handling the case where the first character is an `):

: (find-`) ( string -- n/f )
    2 clump [ first2 `? ] find drop [ 1 + ] [ f ] if* ;

: find-` ( string -- n/f )
    dup ?first CHAR: ` = [ drop 0 ] [ (find-`) ] if ;

Parsing the value is slightly complicated by the fact that TXON supports values which might themselves be a single value, a sequence of values, or a series of name/value pairs. Basically, that means we need to:

  1. find the first ` character
  2. checks if the previous character is a : (indicating a name/value)
  3. parse all name/values if so, otherwise decode the value(s)

That algorithm can be translated into this code:

DEFER: name/values

: (parse-value) ( string -- values )
    decode-value string-lines dup length 1 = [ first ] when ;

: parse-value ( string -- remain value )
    dup find-` [
        dup 1 - pick ?nth CHAR: : =
        [ drop name/values ] [ cut swap (parse-value) ] if
        [ rest [ blank? ] trim-head ] dip
    ] [ f swap ] if* ;

We want to parse a “name=value” pair, which should be as easy as parsing the name, then the value, then associating into a hashtable:

: (name=value) ( string -- remain term )
    parse-name [ parse-value ] dip associate ;

The string might contain a “name=value” pair, or just a single value:

: name=value ( string -- remain term )
    [ blank? ] trim
    ":`" over subseq? [ (name=value) ] [ f swap ] if ;

We finish by building a word to produce all “name=value” pairs, used in the parse-value word earlier.

: name/values ( string -- remain terms )
    [ dup { [ empty? not ] [ first CHAR: ` = not ] } 1&& ]
    [ name=value ] produce assoc-combine ;

Putting all of that together, we can make a word to parse a TXON string, producing “name=value” pairs until exhausted:

: parse-txon ( string -- objects )
    [ dup empty? not ] [ name=value ] produce nip ;

: txon> ( string -- object )
    parse-txon dup length 1 = [ first ] when ;

Try It

You can try this out in the listener:

IN: scratchpad H{ { "a" "123" } } >txon .
"a:`123`"

IN: scratchpad "a:`123`" txon> .
H{ { "a" "123" } }

Can you improve on this? Maybe by using the peg.ebnf vocabulary to create an EBNF parsing word?

The code for this (and a bunch of tests) are on my GitHub.