13.1.1 Design Choices

To support Unicode, there are many approaches. One choice is to support both 8-bit base-char and a 21-bit (or larger) character since Unicode codepoints use 21 bits. This generally means strings are much larger, and complicates the compiler by having to support both base-char and character types and the corresponding string types. This also adds complexity for the user to understand the difference between the different string and character types.

Another choice is to have just one character and string type that can hold the entire Unicode codepoint. While simplifying the compiler and reducing the burden on the user, this significantly increases memory usage for strings.

The solution chosen by CMUCL is to tradeoff the size and complexity by having only 16-bit characters. Most of the important languages can be encoded using only 16-bits. The rest of the codepoints are for rare languages or ancient scripts. Thus, the memory usage is significantly reduced while still supporting the the most important languages. Compiler complexity is also reduced since base-char and character are the same as are the string types.. But we still want to support the full Unicode character set. This is achieved by making strings be UTF-16 strings internally. Hence, Lisp strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.