Re: Factor

Factor: the language, the theory, and the practice.

Case Conversion

Friday, May 5, 2023

#text #unicode

One aspect of exposure to different programming languages and programmers is differing opinions on proper case conventions for class names, variable names, and other attribute names. Sometimes you want to convert between them for various reasons.

Looking around at other programming languages, you can find modules such as Change Case for Javascript, case-converter for Python, a code golf challenge, a regular expression approach to convert string to different case styles, and even a PHP module written by Jawira Portugal called Case Converter that handles quite a few, ahem, cases:

Convert strings between 13 naming conventions: Snake case, Camel case, Kebab case, Pascal case, Ada case, Train case, Cobol case, Macro case, Upper case, Lower case, Title case, Sentence case and Dot notation.

Examples of which might look something like:

  • snake_case
  • camelCase
  • kebab-case
  • PascalCase
  • Ada_Case
  • Train-Case
  • COBOL-CASE
  • MACRO_CASE
  • UPPER CASE
  • lower case
  • Title Case
  • Sentence case
  • dot.case

I thought it would be an interesting example, to make a Unicode-aware case conversion library for Factor that handles all of those same cases in a small amount of code (less than 35 lines of code!).

The first word looks for a lowercase grapheme, then finds the next one that is not lowercase:

: case-index ( str -- i/f )
    dup [ lower? ] find [
        swap [ lower? not ] find-from drop
    ] [ nip ] if ;

We can then use that method to split the graphemes at these case boundaries:

: split-case ( str -- words )
    >graphemes [ dup empty? not ] [
        dup [ case-index ] [ length or ] bi
        cut-slice swap concat
    ] produce nip ;

Splitting tokens, first on the common token separators, and then on the case boundaries.

: split-tokens ( str -- words )
    " -_." split [ split-case ] map concat ;

And now the core of the algorithm that splits an input string into tokens, with two variants (one that applies a quotation to each token and another that handles the first token differently than the rest) before joining the tokens using a provided glue character.

: case1 ( str quot glue -- str' )
    [ split-tokens ] [ map ] [ join ] tri* ; inline

: case2 ( str first-quot rest-quot glue -- str' )
    {
        [ split-tokens 0 over ]
        [ change-nth dup rest-slice ]
        [ map! drop ]
        [ join ]
    } spread ; inline

Now that’s everything we need to implement all the case conversions!

: >camelcase ( str -- str' ) [ >lower ] [ >title ] "" case2 ;
: >pascalcase ( str -- str' ) [ >title ] "" case1 ;
: >snakecase ( str -- str' ) [ >lower ] "_" case1 ;
: >adacase ( str -- str' ) [ >title ] "_" case1 ;
: >macrocase ( str -- str' ) [ >upper ] "_" case1 ;
: >kebabcase ( str -- str' ) [ >lower ] "-" case1 ;
: >traincase ( str -- str' ) [ >title ] "-" case1 ;
: >cobolcase ( str -- str' ) [ >upper ] "-" case1 ;
: >lowercase ( str -- str' ) [ >lower ] " " case1 ;
: >uppercase ( str -- str' ) [ >upper ] " " case1 ;
: >titlecase ( str -- str' ) [ >title ] " " case1 ;
: >sentencecase ( str -- str' ) [ >title ] [ >lower ] " " case2 ;
: >dotcase ( str -- str' ) [ >lower ] "." case1 ;

These are available in the tokencase vocabulary and is included in the latest nightly builds.