The Lisowe

You can do things the right way, the wrong way, or the Lisowe

We’ve been told time and time again that variables in Python must begin with a letter or underscore, but that’s not quite the whole story. In fact, given how expansive unicode is, it’s almost none of the story. The truth is that as of Python 3.9 using Unicode 13 you can start your variables with any one of 131,459 different characters. Letters and underscores? Pfft, who needs them.

As it turns out, unicode characters have different properties associated with them. There’s a lot of them, but the ones we are interested in are xid_start, and xid_continue. Characters with the xid_start property are allowed to be at the beginning and throughout a variable while characters with the xid_continue property are never allowed to be first. Letters and ๐’ซ have the first property while numbers and ๐‘‡ฉ have the second.

Naturally after discovering this, I set out to find a list of of these variables, but the closest I could find was an incomplete list from unicode 9. Not only that, but it only displays the codes rather than the characters themselves. I want to know every single option I have and what they look like, so I decided I’d throw together a script to try to find the rest of these identifier characters and dump it to a file for reference later.

Queue two or three weeks of debugging various unsuccessful methods of displaying, sorting, and collecting the various characters and their properties.

The method I ended up settling on was building and parsing a list of valid xid_start and xid_continue ranges based on the text in the original unicode specification.

xid_start = [
    "0041..005A",
    "0061..007A",
    "00AA",
    ...
    "2CEB0..2EBE0",
    "2F800..2FA1D",
    "30000..3134A",
]
xid_continue = [
    "0030..0039",
    "0041..005A",
    "005F",
    ...
    "2F800..2FA1D",
    "30000..3134A",
    "E0100..E01EF",
]

Writing some code to loop through those values, add them to a dictionary for easy reference, and print them and their length.

xid_start_glyphs = {}
xid_continue_glyphs = {}

for unicode_range in xid_start:
    unicode_range = unicode_range.split("..")
    first_number = int(hex(int(unicode_range[0], 16)), 16)
    last_number = int(hex(int(unicode_range[-1], 16)), 16)
    for glyph in range(first_number, last_number+1):
        xid_start_glyphs[glyph] = chr(glyph)

for unicode_range in xid_continue:
    unicode_range = unicode_range.split("..")
    first_number = int(hex(int(unicode_range[0], 16)), 16)
    last_number = int(hex(int(unicode_range[-1], 16)), 16)
    for glyph in range(first_number, last_number+1):
        xid_continue_glyphs[glyph] = chr(glyph)

pprint(xid_start_glyphs)
pprint(xid_continue_glyphs)
print(len(xid_start_glyphs))
print(len(xid_continue_glyphs))

And there we go, 131459 characters with the xid_start property and 134415 characters with the xid_continue property. Many of the xid_start characters are also a part of the xid_continue, but not the other way around. But there’s still a problem…

 ...
 01537: '๐ฑ',
 201538: '๐ฑ‚',
 201539: '๐ฑƒ',
 201540: '๐ฑ„',
 201541: '๐ฑ…',
 201542: '๐ฑ†',
 201543: '๐ฑ‡',
 201544: '๐ฑˆ',
 201545: '๐ฑ‰',
 201546: '๐ฑŠ'}

A lot of characters aren’t displaying properly, yet they were still apparently able to be handled? What gives?

What we’re seeing here is a consequence of the OpenType Font specification. In this specification fonts are limited to 2^16 characters, so the maximum number that they can display is 65536. Considering we can’t just assign a value to a character code ala "\u9665" = 10, that leaves nearly half of the xid_start characters at best unavailable to us. Technically we can set a value equal to a non-displayable glyph, but I want to actually see the characters that my values are bound to.

It was at this point that I thought that I had hit a wall and figured that if it was impossible to view all the characters, I might as well filter out the non-displayable ones. In order to filter out the the non-displayable characters I needed the font file for reference to know what characters are and aren’t designed within the font. Adding some code lets us check any individual unicode character against a font to see if it supports it:

from fontTools.ttLib.ttFont import TTFont
from pathlib import Path

font = TTFont(file=Path("C:\\Windows\\fonts\\arial.ttf"))

def char_in_font(unicode_char, font):
    for cmap in font["cmap"].tables:
        if cmap.isUnicode():
            if ord(unicode_char) in cmap.cmap:
                return True
    return False

And then modifying our parser to include this check for displayable characters:

for unicode_range in xid_start:
    unicode_range = unicode_range.split("..")
    first_number = int(hex(int(unicode_range[0], 16)), 16)
    last_number = int(hex(int(unicode_range[-1], 16)), 16)
    for glyph in range(first_number, last_number+1):
        if char_in_font(chr(glyph), font):
            xid_start_glyphs[glyph] = chr(glyph)

for unicode_range in xid_continue:
    unicode_range = unicode_range.split("..")
    first_number = int(hex(int(unicode_range[0], 16)), 16)
    last_number = int(hex(int(unicode_range[-1], 16)), 16)
    for glyph in range(first_number, last_number+1):
        if char_in_font(chr(glyph), font):
            xid_continue_glyphs[glyph] = chr(glyph)

Re-running the script after these changes gets us 2613 start characters….ouch, that’s a terrible ratio when compared to the maximum. Different fonts support different glyphs though, so maybe we’ll have better luck with another font?

  • Calibri: 2730
  • Courier New: 2488
  • Microsoft Sans Serif Regular: 2943

Well that’s not good, that’s not even 3% of our limit. I took a closer look at what characters were being shown with and without this non-displayable characters filter on and noticed something peculiar: with the filter turned on there were no eastern Asian characters. How could specifying a font to check against have restricted my selection? Font Families is how.

By default, Python was checking against a group of several different 65536 character limited fonts. Within a single specified font, you may in fact be accessing several different fonts. Take Courier New for example:

  • Courier New Regular
  • Courier New Italic
  • Courier New Bold
  • Courier New Bold Italic

When I specified arial.ttf I was cutting that down to just one font. So what happens if we mix several unrelated fonts together to check against?

list_of_fonts = [
    TTFont(file=Path("C:\\Windows\\fonts\\arial.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\calibri.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\cour.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\ebrima.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\LeelawUI.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\micross.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\taile.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\msyi.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\Nirmala.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\segoeui.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\seguihis.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\seguisym.ttf")),
    TTFont(file=Path("C:\\Windows\\fonts\\seguisym.ttf")),
]

And looping over the fonts

for unicode_range in xid_start:
    unicode_range = unicode_range.split("..")
    first_number = int(hex(int(unicode_range[0], 16)), 16)
    last_number = int(hex(int(unicode_range[-1], 16)), 16)
    for glyph in range(first_number, last_number+1):
        for font in list_of_fonts:
            if char_in_font(chr(glyph), font):
                xid_start_glyphs[glyph] = chr(glyph)

for unicode_range in xid_continue:
    unicode_range = unicode_range.split("..")
    first_number = int(hex(int(unicode_range[0], 16)), 16)
    last_number = int(hex(int(unicode_range[-1], 16)), 16)
    for glyph in range(first_number, last_number+1):
        for font in list_of_fonts:
            if char_in_font(chr(glyph), font):
                xid_continue_glyphs[glyph] = chr(glyph)

Doing this, we get…10038 and 11177! It’s not all of them, and there are still some non-displayable characters, but we’re heading in the right direction. We’re left with clues about how to find the rest. If we open up the fonts folder, we can see that different fonts are designed to support different languages.

Now all we need to do is make sure that we have a font that handles for each language in the xid_start and xid_continue lists and we should be good! Sublime Text 3, the editor I’m using, is clearly using a more expansive set of fonts than my cobbled together baker’s dozen. Using a script I get the directory of every font installed on my system and output that to a list.

import matplotlib.font_manager
from pprint import pprint

fonts = matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')
pprint(fonts)

Important note, as of writing this matplotlib has not yet updated to Python 3.9, so the script was invoked with python 3.7. This ends up creating a list of fonts over 700 lines long which I then load using the fontTools library.

from fontTools.ttLib.ttFont import TTFont
from pathlib import Path

list_of_fonts = [
    TTFont(file=Path("C:\\Windows\\Fonts\\arial.ttf")),
    TTFont(file=Path("C:\\Windows\\Fonts\\corbelli.ttf")),
    TTFont(file=Path("C:\\Windows\\Fonts\\tahomabd.ttf")),
    TTFont(file=Path("C:\\Windows\\Fonts\\RAGE.TTF")),
    TTFont(file=Path("C:\\Windows\\Fonts\\LATINWD.TTF")),
    ...

From here I can use the xid_start, xid_continue, and list_of_fonts to find all the characters my computer is unable to display and in which unicode block they belong to. I can cross reference the resulting character codes with the unicode specification that I used in the beginning of this article to figure out the name of that block.

for unicode_range in xid_start:
    # print(unicode_range, f"{xid_start.index(unicode_range)+1}/{len(xid_start)+1}")
    unicode_range_list = unicode_range.split("..")
    xid_start_range = int(hex(int(unicode_range_list[0], 16)), 16)
    xid_end_range = int(hex(int(unicode_range_list[-1], 16)), 16)
    for character in range(xid_start_range, xid_end_range + 1):
        for font in list_of_fonts:
            if (
                str(character) not in xid_start_displayable
                and str(character) not in xid_start_non_displayable
                and not char_in_font(chr(character), font)
            ):
                # Add the character to xid_start_non_displayable
                try:
                    character_name = unicodedata.name(chr(character))
                except ValueError:
                    character_name = ""
                xid_start_non_displayable[character] = [
                    chr(character),
                    chr(character).encode("raw_unicode_escape"),
                    character_name,
                ]
            elif (
                str(character) not in xid_start_displayable
                and str(character) not in xid_start_non_displayable
                and char_in_font(chr(character), font)
            ):
                # Add character to xid_start_displayable
                try:
                    character_name = unicodedata.name(chr(character))
                except ValueError:
                    character_name = ""
                xid_start_displayable[character] = [
                    chr(character),
                    chr(character).encode("raw_unicode_escape"),
                    character_name,
                ]

    for key in xid_start_non_displayable:
        if key in range(xid_start_range, xid_end_range + 1) and key not in xid_start_displayable:
            print(chr(character))

yields

...
16F50 609/701
๐–ฝ
16F93..16F9F 610/701
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
๐–พŸ
16FE0..16FE1 611/701
๐–ฟก
๐–ฟก
16FE3 612/701
๐–ฟฃ
...

At that point it’s just a quick google search away to find the fonts that have those characters!

http://www.fileformat.info was extremely helpful in this regard as it allowed me to type in a character code to find fonts that supported it.

Fonts used:

  • Unifont Medium
  • Everson Mono
  • Last Resort
    • I used this font literally as a last resort since there were a TON of characters that simply weren’t in any font that I could find
  • Unifont Upper Medium
  • sim-ch_n5100
    • Unicode 13.0.0 was first released in January 2020, so I can’t say I’m surprised that there were virtually no fonts that supported the newly released characters

I ran the categorization script after adding each installed font’s file path to my list of fonts to check against. By doing it like this I was quickly able to whittle down what I still needed until I had complete unicode coverage.

Unfortunately Sublime Text 3 (and Visual Studio for that matter) don’t allow you to reference multiple font families, so even though I have all the requisite fonts installed I’m still unable to view all the characters. The only way around this seems to be a feature request, which doesn’t look like it’s happening anytime soon. I can still view every character, it just requires me to change my font.

And after all that, here’s a list of fun unicode characters to try out in production:

  • 1589: [“ุต”, b”\u0635″, “ARABIC LETTER SAD”],
  • 1590: [“ุถ”, b”\u0636″, “ARABIC LETTER DAD”],
    • There a lot of characters in Arabic (the 2 above included) which when rendered in sublime text will contain an abnormally large space to the left, shoving the character visually outside of the quotation. This lets you place one character on top of another.
  • 3424: [“เต ”, b”\u0d60″, “MALAYALAM LETTER VOCALIC RR”]
  • 4138: [“แ€ช”, b”\u102a”, “MYANMAR LETTER AU”]
  • 4447: [“แ…Ÿ”, b”\u115f”, “HANGUL CHOSEONG FILLER”]
    • No I did not forget to place a character, it’s just not visible
  • 5158: [“แฆ”, b”\u1426″, “CANADIAN SYLLABICS FINAL DOUBLE SHORT VERTICAL STROKES”]
  • 5171: [“แณ”, b”\u1433″, “CANADIAN SYLLABICS PO”]
  • 5176: [“แธ”, b”\u1438″, “CANADIAN SYLLABICS PA”]
  • 8505: [“โ„น”, b”\u2139″, “INFORMATION SOURCE”]
    • Variables contain info, do they not?
  • 12339: [“ใ€ณ”, b”\u3033″, “VERTICAL KANA REPEAT MARK UPPER HALF”]
  • 12341: [“ใ€ต”, b”\u3035″, “VERTICAL KANA REPEAT MARK LOWER HALF”]
  • 12484: [“ใƒ„”, b”\u30c4″, “KATAKANA LETTER TU”]
  • 73776: [“๐’€ฐ”, b”\U00012030″, “CUNEIFORM SIGN AN PLUS NAGA OPPOSING AN PLUS NAGA”]
    • The cuneiform blocks (hex blocks 12000-12399, 12400-1246E, 12480-1254E) are FILLED with shit like this
  • 74482: [“๐’‹ฒ”, b”\U000122f2″, “CUNEIFORM SIGN TAB SQUARED”]
    • Deus Vult
  • 74795: [“๐’ซ”, b”\U0001242b”, “CUNEIFORM NUMERIC SIGN NINE SHAR2”]
  • 78063: [“๐“ƒฏ”, b”\U000130ef”, “EGYPTIAN HIEROGLYPH E025”]
  • 120433: [“๐™ฑ”, b”\U0001d671″, “MATHEMATICAL MONOSPACE CAPITAL B”]

I only pulled from the xid_start characters in the list above, and only in fonts that I didn’t need to download so that they would render for the most amount of people. Honorable mention to the surprising amount of phallic looking symbols (looking at you cuneiform) and the crazy amount of characters that look exactly like other characters. It’s easy to see why unicode abuse of web urls is so common. Also in case it wasn’t obvious please never use these in production or any other serious project — it won’t end well.

The code and formatted lists of characters for this project can be found on my GitHub here.