To support Unicode characters, the reader has been extended to
recognize characters written in hexadecimal. Thus #\U+41
is
the ASCII capital letter “A”, since 41 is the hexadecimal code for
that letter. The Unicode name of the character is also recognized,
except spaces in the name are replaced by underscores.
Recall, however, that characters in CMUCL are only 16 bits long so many Unicode characters cannot be represented. However, strings can represent all Unicode characters.
Note that while not quite legal, #\U+
supports codepoints
larger than 16 bits. Thus #\U+1d11e
works even though
(code-char #x1d11e)
signals an error. This is quite handy for
dealing with all unicode characters. For example:
CL-USER> (describe #\u+1d11e) #\𝄞 is a BASE-CHAR. Its code is #x1D11E. Its name is Musical_Symbol_G_Clef.
and also
CL-USER> (describe #\Musical_Symbol_G_Clef) #\𝄞 is a BASE-CHAR. Its code is #x1D11E. Its name is Musical_Symbol_G_Clef.
Note: the displayed character may be incorrect. Also note that
describe
says it is a base-char
; this is obviously not
correct since char-code-limit
is 65536.
Use this feature with care.
When symbols are read, the symbol name is converted to Unicode NFC
form before interning the symbol into the package. Hence,
symbol-name (intern ``string'')
may produce a string that is
not string=
to “string”. However, after conversion to NFC
form, the strings will be identical.