Unicode

Thursday, May 4, 2023

The Rust programming language is pretty cool. I’ve enjoyed many aspects of the Rewrite It In Rust meme that appears as a part of the Rust Evangelism Strike Force. The Rust documentation includes a pretty awesome Rust book that is probably a gold standard for programming language documentation.

In the Rust book, there is a section on Storing UTF-8 Encoded Text with Strings. It contains a neat example that I would like to use to show how Factor string objects work, how we handle Unicode and other character encodings, and show how we probably can make some improvements in the future. At the time of this blog post, we support Unicode 15.0.0 which was released in September 2022.

Factor strings are a sequence of Unicode code points which we explore to see how they work.

Bytes

If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is stored as a vector of u8 values that looks like this:

IN: scratchpad "नमस्ते" utf8 encode .
B{
    224 164 168 224 164 174 224 164 184 224 165 141 224 164 164
    224 165 135
}

Or, as a series of hex values:

IN: scratchpad "नमस्ते" utf8 encode .h
B{
    0xe0 0xa4 0xa8 0xe0 0xa4 0xae 0xe0 0xa4 0xb8 0xe0 0xa5 0x8d
    0xe0 0xa4 0xa4 0xe0 0xa5 0x87
}

You could instead print them as octal or binary quite easily.

Code Points

That’s 18 bytes and is how computers ultimately store this data. If we look at them as Unicode scalar values, which are what Rust’s char type is, those bytes look like this:

IN: scratchpad "नमस्ते" [ 1string ] { } map-as .
{ "न" "म" "स" "्" "त" "े" }

You can see what the code point numeric values are:

IN: scratchpad "नमस्ते" >array .
{ 2344 2350 2360 2381 2340 2375 }

Or even see what the code point names are:

IN: scratchpad "नमस्ते" [ char>name ] { } map-as .
{
    "devanagari-letter-na"
    "devanagari-letter-ma"
    "devanagari-letter-sa"
    "devanagari-sign-virama"
    "devanagari-letter-ta"
    "devanagari-vowel-sign-e"
}

Characters

There are six char values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:

IN: scratchpad "नमस्ते" >graphemes [ >string ] map .
{ "न" "म" "स्" "ते" }

These graphemes are code points grouped like so:

IN: scratchpad "नमस्ते" >graphemes [ >array ] map .
{ { 2344 } { 2350 } { 2360 2381 } { 2340 2375 } }

Encodings

Rust provides different ways of interpreting the raw string data that computers store so that each program can choose the interpretation it needs, no matter what human language the data is in.

Factor supports many encodings which can be used for interacting with other computer systems. These include ASCII, many legacy 8-bit encodings (including MacRoman, EBCDIC, and others), other Unicode variants (such as UTF-7, UTF-16, and UTF-32), ISO-2022, and several others.

There are a couple of space optimizations to save memory when only small code points are used, which is common in English as well as formats such as Base64. Looking at the Rust standard library, the improvements made to the Python unicode support, or other languages such as Strings and Characters in Swift, there are likely improvements we can make when working with text in Factor.