Querying the Unicode database

In a previous post, I referred to Unicode code points by their names. In fact, each assigned code point does have a name, and they’re often quite unwieldy. For that post, I wanted to choose character names that were especially long, and it wasn’t immediately obvious how to do so. Here’s how I figured it out.

The official source-of-truth for the Unicode database is the Unicode Character Database, or UCD. The database’s full documentation is available in Unicode Standard Annex #44, also known as Technical Report #44, which weighs in at about forty-six pages. Luckily, we don’t need too much of it. Section 5.1 lists all of the properties of a Unicode code point one might possibly be interested in, and right at the top is “Name.” Clicking on “Name” reveals that the normative reference for the Name property is UnicodeData.txt. Don’t close TR-44 just yet, though — we’ll need it again soon.

The original UCD page indicated that the latest version of the UCD itself could be found here. A couple subdirectories down is UnicodeData.txt, 1.7 MB of semicolon-separated data rows. UnicodeData.txt provides a lot of information, but we’re just interested in the Name field. Unfortunately, it’s clear from the first 48 lines that there’s something odd going on with the Name. Some rows appear to have the name in a field near the end:

0007;<control>;Cc;0;BN;;;;;N;BELL;;;;

while others show it in the second field:

0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;

TR-44 is here to help, though. Table 9 lists each field of UnicodeData.txt, and it has this to say about field 1, the field after the code point in hexadecimal:

When a string value not enclosed in occurs in this field, it specifies the character’s Name property value, which matches exactly the name published in the code charts. The Name property value for most ideographic characters and for Hangul syllables is derived instead by various rules. See Section 4.8, Name in [Unicode] for a full specification of those rules. Strings enclosed in in this field either provide label information used in the name derivation rules, or—in the case of characters which have a null string as their Name property value, such as control characters—provide other information about their code point type.

In typically long-winded fashion, TR-44 is saying that some characters that have the null string as their Name, and these characters have some value in angle brackets instead. But BELL, above, does appear for the 0007 row, just much later. Some careful counting shows that it’s in field 10. What does TR-44 have to say about that field, called Unicode_1_Name?

Old name as published in Unicode 1.0 or ISO 6429 names for control functions. This field is empty unless it is significantly different from the current name for the character. No longer used in code chart production. See Name_Alias.

Ah, that sounds perfect. Though field 10 is Informative rather than Normative, we should be able to get away with using it when the Name field isn’t helpful. Let’s get started.

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
head

0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;

OK, we can download it no problem. Let’s print the Name…

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
awk -F ';' '{ print $2 }' | \
head -5

<control>
<control>
<control>
<control>
<control>

Not especially exciting, but looks right. We can try some characters a little farther along with sed:

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
awk -F ';' '{ print $2 }' | \
sed '49,54!d'

DIGIT ZERO
DIGIT ONE
DIGIT TWO
DIGIT THREE
DIGIT FOUR
DIGIT FIVE

Now let’s try to deal with those <control> characters. First, let’s handle the case where the name doesn’t include a <:

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
awk -F ';' '!match($2, "<") { print $2 }' | \
head -5

SPACE
EXCLAMATION MARK
QUOTATION MARK
NUMBER SIGN
DOLLAR SIGN

And now we just have to deal with the case where it does:

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
awk -F ';' '!match($2, "<") { print $2; next } { print $11 }' | \
head -5

NULL
START OF HEADING
START OF TEXT
END OF TEXT
END OF TRANSMISSION

Now that we’ve got a great long list of names, let’s also include the lengths and code points:

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
awk -F ';' '!match($2, "<") { print $1, length($2), $2; next } { print $1, length($11), $11 }' | \
head -5

0000 4 NULL
0001 16 START OF HEADING
0002 13 START OF TEXT
0003 11 END OF TEXT
0004 19 END OF TRANSMISSION

Almost done. We can use sort to get the top N longest names. -r and -n give us reversed numeric sorting, and -k2 uses the second field as the sort key, in this case the length of the name:

$ curl -s https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt | \
awk -F ';' '!match($2, "<") { print $1, length($2), $2; next } { print $1, length($11), $11 }' | \
sort -rnk2 | \
head

FBF9 83 ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM
FBFB 82 ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA INITIAL FORM
FBFA 80 ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA FINAL FORM
1F502 78 CLOCKWISE RIGHTWARDS AND LEFTWARDS OPEN CIRCLE ARROWS WITH CIRCLED ONE OVERLAY
0753 75 ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE
2B83 74 DOWNWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF UPWARDS TRIANGLE-HEADED ARROW
2B81 74 UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW
2965 73 DOWNWARDS HARPOON WITH BARB LEFT BESIDE DOWNWARDS HARPOON WITH BARB RIGHT
2969 72 RIGHTWARDS HARPOON WITH BARB DOWN ABOVE LEFTWARDS HARPOON WITH BARB DOWN
2967 72 LEFTWARDS HARPOON WITH BARB DOWN ABOVE RIGHTWARDS HARPOON WITH BARB DOWN

And there’s the answer for “what are the Unicode characters with the longest names?” All it took was a bit of documentation and some classic text processing.