String Length
Saturday, August 23, 2025
I was reminded recently about a great article about unicode string lengths:
Itβs Not Wrong that
"π€¦πΌββοΈ".length == 7
But Itβs Better that
"π€¦πΌββοΈ".len() == 17
and Rather Useless thatlen("π€¦πΌββοΈ") == 5
This comes at a time of excessive emoji tsunami thanks to the proliferation of large language models and probably lots of Gen Z in the training data sets. Sometimes emojis are fun and useful like in Base256Emoji and sometimes it can get carried away like in the Emoji Kitchen.
I have written about Factor’s unicode support before and wanted to use this example to show a bit more about how Factor represents text using the Unicode standard.
IN: scratchpad "π€¦" length .
1
IN: scratchpad "π€¦πΌββοΈ" length .
5
Wat.
Well, what is happening is that the current strings vocabulary stores Unicode code points. This can be both useful and useless depending on the task at hand. We can print out which ones are used in this example:
IN: scratchpad "π€¦πΌββοΈ" [ char>name . ] each
"face-palm"
"emoji-modifier-fitzpatrick-type-3"
"zero-width-joiner"
"male-sign"
"variation-selector-16"
When a developer expresses a need to store or retrieve textual data, they likely need to know about character encodings. In this case, we can see the number of bytes required to store this string in different encodings:
IN: scratchpad "π€¦πΌββοΈ" utf8 encode length .
17
IN: scratchpad "π€¦πΌββοΈ" utf16 encode length .
16
IN: scratchpad "π€¦πΌββοΈ" utf32 encode length .
24
But, what if we just want to know how many visual characters are in the string?
IN: scratchpad "π€¦πΌββοΈ" >graphemes length .
1
This is covered in The Absolute Minimum Every Software Developer Must Know About Unicode in 2023, which is also a great article and covers this as well as a number of other aspects of the Unicode standard.