Unicode Strings are more complicated than you think

June 2nd 2023 — 2 minute read

Work is all abuzz with work on switching to MathJax for math rendering. As part of this work, we have to work with math that was written to be beautiful and illuminating in languages other than English and that means that we have to come face-to-face with the complexities of Unicode.

The biggest thing to remember is that code points are not graphemes and that what might look like one distinct block might actually be many characters (or codepoints) long. A codepoint is what the computer sees as a "character" and a grapheme is what the user sees as a character. The fun example is that emoji are made up of many distinct code points and my very favorite example within that is the family emoji:

[...'👨‍👩‍👦'] // ["👨", "‍", "👩", "‍", "👦"]
‘👨‍👩‍👦’.length // 8

The family emoji is made up of a man, a woman, and a child all joined by zero-width joiners (which TIL are called zwidges!). It turns out that for languages like Vietnamese the joining can be even more complicated (and at work we need to show beautiful math in Vietnamese).

I have a special relationship with unicode. A long time ago I was working on an ecommerce platform and was asked to "upgrade the database so that we can support names that have umlauts in them". I was very confident that this would be an easy task. This was not an easy task. I still feel extremely cautious anytime a project involves explcitly complicated non-ascii characters.

Resources permalink

Here are some resources that were passed around during our discussions today: