13.3.5 Reader

To support Unicode characters, the reader has been extended to recognize characters written in hexadecimal. Thus #\U+41 is the ASCII capital letter “A”, since 41 is the hexadecimal code for that letter. The Unicode name of the character is also recognized, except spaces in the name are replaced by underscores.

Recall, however, that characters in CMUCL are only 16 bits long so many Unicode characters cannot be represented. However, strings can represent all Unicode characters.

Note that while not quite legal, #\U+ supports codepoints larger than 16 bits. Thus #\U+1d11e works even though (code-char #x1d11e) signals an error. This is quite handy for dealing with all unicode characters. For example:

CL-USER> (describe #\u+1d11e)
#\𝄞 is a BASE-CHAR.
Its code is #x1D11E.
Its name is Musical_Symbol_G_Clef.

and also

CL-USER> (describe #\Musical_Symbol_G_Clef)
#\𝄞 is a BASE-CHAR.
Its code is #x1D11E.
Its name is Musical_Symbol_G_Clef.

Note: the displayed character may be incorrect. Also note that describe says it is a base-char; this is obviously not correct since char-code-limit is 65536.

Use this feature with care.

When symbols are read, the symbol name is converted to Unicode NFC form before interning the symbol into the package. Hence, symbol-name (intern ``string'') may produce a string that is not string= to “string”. However, after conversion to NFC form, the strings will be identical.