Re: Factor

Factor: the language, the theory, and the practice.

String Length

Saturday, August 23, 2025

#unicode

I was reminded recently about a great article about unicode string lengths:

It’s Not Wrong that "πŸ€¦πŸΌβ€β™‚οΈ".length == 7

But It’s Better that "πŸ€¦πŸΌβ€β™‚οΈ".len() == 17 and Rather Useless that len("πŸ€¦πŸΌβ€β™‚οΈ") == 5

This comes at a time of excessive emoji tsunami thanks to the proliferation of large language models and probably lots of Gen Z in the training data sets. Sometimes emojis are fun and useful like in Base256Emoji and sometimes it can get carried away like in the Emoji Kitchen.

I have written about Factor’s unicode support before and wanted to use this example to show a bit more about how Factor represents text using the Unicode standard.

IN: scratchpad "🀦" length .
1

IN: scratchpad "πŸ€¦πŸΌβ€β™‚οΈ" length .
5

Wat.

Well, what is happening is that the current strings vocabulary stores Unicode code points. This can be both useful and useless depending on the task at hand. We can print out which ones are used in this example:

IN: scratchpad "πŸ€¦πŸΌβ€β™‚οΈ" [ char>name . ] each
"face-palm"
"emoji-modifier-fitzpatrick-type-3"
"zero-width-joiner"
"male-sign"
"variation-selector-16"

When a developer expresses a need to store or retrieve textual data, they likely need to know about character encodings. In this case, we can see the number of bytes required to store this string in different encodings:

IN: scratchpad "πŸ€¦πŸΌβ€β™‚οΈ" utf8 encode length .
17

IN: scratchpad "πŸ€¦πŸΌβ€β™‚οΈ" utf16 encode length .
16

IN: scratchpad "πŸ€¦πŸΌβ€β™‚οΈ" utf32 encode length .
24

But, what if we just want to know how many visual characters are in the string?

IN: scratchpad "πŸ€¦πŸΌβ€β™‚οΈ" >graphemes length .
1

This is covered in The Absolute Minimum Every Software Developer Must Know About Unicode in 2023, which is also a great article and covers this as well as a number of other aspects of the Unicode standard.