To support Unicode, there are many approaches. One choice is to
support both 8-bit base-char
and a 21-bit (or larger)
character
since Unicode codepoints use 21 bits. This generally
means strings are much larger, and complicates the compiler by having
to support both base-char
and character
types and the
corresponding string types. This also adds complexity for the user to
understand the difference between the different string and character
types.
Another choice is to have just one character and string type that can hold the entire Unicode codepoint. While simplifying the compiler and reducing the burden on the user, this significantly increases memory usage for strings.
The solution chosen by CMUCL is to tradeoff the size and complexity
by having only 16-bit characters. Most of the important languages can
be encoded using only 16-bits. The rest of the codepoints are for
rare languages or ancient scripts. Thus, the memory usage is
significantly reduced while still supporting the the most important
languages. Compiler complexity is also reduced since base-char
and character
are the same as are the string types.. But we
still want to support the full Unicode character set. This is
achieved by making strings be UTF-16 strings internally. Hence, Lisp
strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.