 
 
 
CMUCL supports internationalization by supporting Unicode
characters internally and by adding support for external formats to
convert from the internal format to an appropriate external character
coding format.
To understand the support for Unicode, we refer the reader to the
Unicode standard at 
To support internationalization, the following changes to Common Lisp
functions have been done.
To support Unicode, there are many approaches. One choice is to
support both 8-bit base-char and a 21-bit (or larger)
character since Unicode codepoints use 21 bits. This generally
means strings are much larger, and complicates the compiler by having
to support both base-char and character types and the
corresponding string types. This also adds complexity for the user to
understand the difference between the different string and character
types.
Another choice is to have just one character and string type that can
hold the entire Unicode codepoint. While simplifying the compiler and
reducing the burden on the user, this significantly increases memory
usage for strings.
The solution chosen by CMUCL is to tradeoff the size and complexity
by having only 16-bit characters. Most of the important languages can
be encoded using only 16-bits. The rest of the codepoints are for
rare languages or ancient scripts. Thus, the memory usage is
significantly reduced while still supporting the the most important
languages. Compiler complexity is also reduced since base-char
and character are the same as are the string types.. But we
still want to support the full Unicode character set. This is
achieved by making strings be UTF-16 strings internally. Hence, Lisp
strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.
Characters are now 16 bits long instead of 8 bits, and base-char
and character types are the same. This difference is
naturally indicated by changing char-code-limit from 256 to
65536.
In CMUCL there is only one type of string---base-string and
string are the same. 
Internally, the strings are encoded using UTF-16. This means that in
some rare cases the number of Lisp characters in a string is not the
same as the number of codepoints in the string.
To be able to communicate to the external world, CMUCL supports
external formats to convert to and from the external world to
CMUCL's string format. The external format is specified in several
ways. The standard streams *standard-input*,
*standard-output*, and *standard-error* take the format
from the value specified by *default-external-format*. The
default value of *default-external-format* is :iso8859-1.
For files, OPEN takes the :external-format
parameter to specify the format. The default external format is
:default. 
 [Function]
stream:set-system-external-format terminal &optional filenames    
 
 
 This function changes the external format used for
 *standard-input*, *standard-output*, and
 *standard-error* to the external format specified by
 terminal. Additionally, the Unix file name encoding can be
 set to the value specified by filenames if non-nil.
| 
| 13.2.1 | Available External Formats |  | 
The available external formats are listed below in
Table 13.1. The first column gives the
external format, and the second column gives a list of aliases that
can be used for this format. The set of aliases can be changed by
changing the aliases file.
For all of these formats, if an illegal sequence is encountered, no
error or warning is signaled. Instead, the offending sequence is
silently replaced with the Unicode REPLACEMENT CHARACTER (U+FFFD).
 
| Format | Aliases | Description | 
| :iso8859-1 | :latin1 :latin-1 :iso-8859-1 | ISO8859-1 | 
| :iso8859-2 | :latin2 :latin-2 :iso-8859-2 | ISO8859-2 | 
| :iso8859-3 | :latin3 :latin-3 :iso-8859-3 | ISO8859-3 | 
| :iso8859-4 | :latin4 :latin-4 :iso-8859-4 | ISO8859-4 | 
| :iso8859-5 | :cyrillic :iso-8859-5 | ISO8859-5 | 
| :iso8859-6 | :arabic :iso-8859-6 | ISO8859-6 | 
| :iso8859-7 | :greek :iso-8859-7 | ISO8859-7 | 
| :iso8859-8 | :hebrew :iso-8859-8 | ISO8859-8 | 
| :iso8859-9 | :latin5 :latin-5 :iso-8859-9 | ISO8859-9 | 
| :iso8859-10 | :latin6 :latin-6 :iso-8859-10 | ISO8859-10 | 
| :iso8859-13 | :latin7 :latin-7 :iso-8859-13 | ISO8859-13 | 
| :iso8859-14 | :latin8 :latin-8 :iso-8859-14 | ISO8859-14 | 
| :iso8859-15 | :latin9 :latin-9 :iso-8859-15 | ISO8859-15 | 
| :utf-8 | :utf :utf8 | UTF-8 | 
| :utf-16 | :utf16 | UTF-16 with optional BOM | 
| :utf-16-be | :utf-16be :utf16-be | UTF-16 big-endian (without BOM) | 
| :utf-16-le | :utf-16le :utf16-le | UTF-16 little-endian (without BOM) | 
| :utf-32 | :utf32 | UTF-32 with optional BOM | 
| :utf-32-be | :utf-32be :utf32-be | UTF-32 big-endian (without BOM) | 
| :utf-32-le | :utf-32le :utf32-le | UTF-32 little-endian (without BOM) | 
| :cp1250 |  |  | 
| :cp1251 |  |  | 
| :cp1252 | :windows-1252 :windows-cp1252 :windows-latin1 |  | 
| :cp1253 |  |  | 
| :cp1254 |  |  | 
| :cp1255 |  |  | 
| :cp1256 |  |  | 
| :cp1257 |  |  | 
| :cp1258 |  |  | 
| :koi8-r |  |  | 
| :mac-cyrillic |  |  | 
| :mac-greek |  |  | 
| :mac-icelandic |  |  | 
| :mac-latin2 |  |  | 
| :mac-roman |  |  | 
| :mac-turkish |  |  | 
 
Table 13.1: External formats
 
| 
| 13.2.2 | Composing External Formats |  | 
A composing external format is an external format that converts between
one codepoint and another, rather than between codepoints and octets.
A composing external format must be used in conjunction with another
(octet-producing) external format. This is specified by
using a list as the external format. For example, we can use
'(:latin1 :crlf) as the external format. In this
particular example, the external format is latin1, but whenever a
carriage-return/linefeed sequence is read, it is converted to the Lisp
#\Newline character. Conversely, whenever a string is written,
a Lisp #\Newline character is converted to a
carriage-return/linefeed sequence. Without the :crlf composing
format, the carriage-return and linefeed will be read in as separate
characters, and on output the Lisp #\Newline character is
output as a single linefeed character.
Table 13.2 lists the available composing formats.
 
| Format | Aliases | Description | 
| :crlf | :dos | Composing format for converting to/from DOS (CR/LF)
 end-of-line sequence to #\Newline | 
| :cr | :mac | Composing format for converting to/from DOS (CR/LF)
 end-of-line sequence to #\Newline | 
| :beta-gk |  | Composing format that translates (lower-case) Beta
 code (an ASCII encoding of ancient Greek) | 
| :final-sigma |  | Composing format that attempts to detect sigma in
 word-final position and change it from U+3C3 to U+3C2 | 
 
Table 13.2: Composing external formats
 
 
 [Variable]
extensions:*default-external-format*     
 
 
 This is the default external format to use for all newly opened
 files. It is also the default format to use for
 *standard-input*, *standard-output*, and
 *standard-error*. The default value is :iso8859-1.
Setting this will cause the standard streams to start using the new
 format immediately. If a stream has been created with external
 format :default, then setting *default-external-format*
 will cause all subsequent input and output to use the new value of
 *default-external-format*.
Remember that CMUCL's characters are only 16-bits long but Unicode
codepoints are up to 21 bits long. Hence there are codepoints that
cannot be represented via Lisp characters. Operating on individual
characters is not recommended. Operations on strings are better.
(This would be true even if CMUCL's characters could hold a
full Unicode codepoint.)
 [Function]
char-equal &rest characters    
 
 
 
 [Function]
char-not-equal &rest characters    
 
 
 [Function]
char-lessp &rest characters    
 
 
 [Function]
char-greaterp &rest characters    
 
 
 [Function]
char-not-greaterp &rest characters    
 
 
 [Function]
char-not-lessp &rest characters    
 
 For the comparison, the characters are converted to lowercase and
 the corresponding char-code are compared.
 [Function]
alpha-char-p character    
 
 
 Returns non-nil if the Unicode category is a letter category.
 [Function]
alphanumericp character    
 
 
 Returns non-nil if the Unicode category is a letter category or an ASCII
 digit.
 [Function]
digit-char-p character &optional radix    
 
 
 Only recognizes ASCII digits (and ASCII letters if the radix is larger
 than 10).
 [Function]
graphic-char-p character    
 
 
 Returns non-nil if the Unicode category is a graphic category.
 [Function]
upper-case-p character    
 
 
 
 [Function]
lower-case-p character    
 
 Returns non-nil if the Unicode category is an uppercase
 (lowercase) character.
 [Function]
lisp:title-case-p character    
 
 
 Returns non-nil if the Unicode category is a titlecase character.
 [Function]
both-case-p character    
 
 
 Returns non-nil if the Unicode category is an uppercase,
 lowercase, or titlecase character.
 [Function]
char-upcase character    
 
 
 
 [Function]
char-downcase character    
 
 The Unicode uppercase (lowercase) letter is returned.
 [Function]
lisp:char-titlecase character    
 
 
 The Unicode titlecase letter is returned.
 [Function]
char-name char    
 
 
 If possible the name of the character char is returned. If
 there is a Unicode name, the Unicode name is returned, except
 spaces are converted to underscores and the string is capitalized
 via string-capitalize. If there is no Unicode name, the
 form #\U+xxxx is returned where ``xxxx'' is the
 char-code of the character, in hexadecimal.
 [Function]
name-char name    
 
 
 The inverse to char-name. If no character has the name
 name, then nil is returned. Unicode names are not
 case-sensitive, and spaces and underscores are optional.
Strings in CMUCL are UTF-16 strings. That is, for Unicode code
points greater than 65535, surrogate pairs are used. We refer the
reader to the Unicode standard for more information about surrogate
pairs. We just want to make a note that because of the UTF-16
encoding of strings, there is a distinction between Lisp characters
and Unicode codepoints. The standard string operations know about
this encoding and handle the surrogate pairs correctly.
 [Function]
string-upcase string &key :start
 :end :casing    
 
 
 
 [Function]
string-downcase string &key :start
 :end :casing    
 
 
 [Function]
string-capitalize string &key :start
 :end :casing    
 
 The case of the string is changed appropriately. Surrogate
 pairs are handled correctly. The conversion to the appropriate case
 is done based on the Unicode conversion. The additional argument
 :casing controls how case conversion is done. The default
 value is :simple, which uses simple Unicode case conversion.
 If :casing is :full, then full Unicode case conversion is
 done where the string may actually increase in length.
 [Function]
nstring-upcase string &key :start :end    
 
 
 
 [Function]
nstring-downcase string &key :start :end    
 
 
 [Function]
nstring-capitalize string &key :start
 :end    
 
 The case of the string is changed appropriately. Surrogate
 pairs are handled correctly. The conversion to the appropriate case
 is done based on the Unicode conversion. (Full casing is not
 available because the string length cannot be increased when needed.)
 [Function]
string= s1 s2 &key :start1
 :end1 :start2 :end2    
 
 
 
 [Function]
string/= s1 s2 &key :start1 :end1 :start2 :end2    
 
 
 [Function]
string> s1 s2 &key :start1 :end1 :start2 :end2    
 
 
 [Function]
string>= s1 s2 &key :start1 :end1 :start2 :end2    
 
 The string comparison is done in codepoint order. (This is
 different from just comparing the order of the individual characters
 due to surrogate pairs.) Unicode collation is not done.
 [Function]
string-equal s1 s2 &key :start1
 :end1 :start2 :end2    
 
 
 
 [Function]
string-not-equal s1 s2 &key :start1 :end1 :start2 :end2    
 
 
 [Function]
string-lessp s1 s2 &key :start1 :end1 :start2 :end2    
 
 
 [Function]
string-greaterp s1 s2 &key :start1 :end1 :start2 :end2    
 
 
 [Function]
string-not-greaterp s1 s2 &key :start1 :end1 :start2 :end2    
 
 
 [Function]
string-not-lessp s1 s2 &key :start1 :end1 :start2 :end2    
 
 Each codepoint in each string is converted to lowercase and the
 appropriate comparison of the codepoint values is done. Unicode
 collation is not done.
 [Function]
string-left-trim bag string    
 
 
 
 [Function]
string-right-trim bag string    
 
 
 [Function]
string-trim bag string    
 
 Removes any characters in bag from the left, right, or both
 ends of the string string, respectively. This has potential
 problems if you want to remove a surrogate character from the
 string, since a single character cannot represent a surrogate. As
 an extension, if bag is a string, we properly handle
 surrogate characters in the bag.
Since strings are also sequences, the sequence functions can be used
on strings. We note here some issues with these functions. Most
issues are due to the fact that strings are UTF-16 strings and
characters are UTF-16 code units, not Unicode codepoints.
 [Function]
remove-duplicates sequence
 &key :from-end :test :test-not :start
 :end :key    
 
 
 
 [Function]
delete-duplicates sequence
 &key :from-end :test :test-not :start
 :end :key    
 
 Because of surrogate pairs these functions may remove a high or low
 surrogate value, leaving the string in an invalid state. Use these
 functions carefully with strings.
To support Unicode characters, the reader has been extended to
recognize characters written in hexadecimal. Thus #\U+41 is
the ASCII capital letter ``A'', since 41 is the hexadecimal code for
that letter. The Unicode name of the character is also recognized,
except spaces in the name are replaced by underscores.
Recall, however, that characters in CMUCL are only 16 bits long so
many Unicode characters cannot be represented. However, strings can
represent all Unicode characters.
When symbols are read, the symbol name is converted to Unicode NFC
form before interning the symbol into the package. Hence,
symbol-name (intern ``string'') may produce a string that is
not string= to ``string''. However, after conversion to NFC
form, the strings will be identical.
When printing characters, if the character is a graphic character, the
character is printed. Thus #\U+41 is printed as
#\A. If the character is not a graphic character, the Lisp
name (e.g., #\Tab) is used if possible;
if there is no Lisp name, the Unicode name is used. If there is no
Unicode name, the hexadecimal char-code is
printed. For example, #\U+34e, which is not a graphic
character, is printed as #\Combining_Upwards_Arrow_Below,
and #\U+9f which is not a graphic character and does not have a
Unicode name, is printed as #\U+009F.
CMUCL loads external formats using the search-list
ext-formats:. The aliases file is also located using
this search-list.
The Unicode data base is stored in compressed form in the file
ext-formats:unidata.bin. If this file is not found, Unicode
support is severely reduced; you can only use ASCII characters.
Since strings are UTF-16 and hence may contain surrogate pairs, some
utility functions are provided to make access easier.
 [Function]
lisp:codepoint string i
 &optional end    
 
 
 Return the codepoint value from string at position i.
 If code unit at that position is a surrogate value, it is combined
 with either the previous or following code unit (when possible) to
 compute the codepoint. The first return value is the codepoint
 itself. The second return value is nil if the position is not a
 surrogate pair. Otherwise, +1 or -1 is returned if the position
 is the high (leading) or low (trailing) surrogate value, respectively.
This is useful for iterating through a string in codepoint sequence.
 [Function]
lisp:surrogates-to-codepoint hi lo    
 
 
 Convert the given hi and lo surrogate characters to the
 corresponding codepoint value
 [Function]
lisp:surrogates codepoint    
 
 
 Convert the given codepoint value to the corresponding high
 and low surrogate characters. If the codepoint is less than 65536,
 the second value is nil since the codepoint does not need to be
 represented as a surrogate pair.
 [Function]
stream:string-encode string
 external-format &optional (start 0) end    
 
 
 string-encode encodes string using the format
 external-format, producing an array of octets. Each octet is
 converted to a character via code-char and the resulting
 string is returned.
The optional argument start, defaulting to 0, specifies the
 starting index and end, defaulting to the length of the
 string, is the end of the string.
 [Function]
stream:string-decode string
 external-format &optional (start 0) end    
 
 
 string-decode decodes string using the format
 external-format and produces a new string. Each character of
 string is converted to octet (by char-code) and the
 resulting array of octets is used by the external format to produce
 a string. This is the inverse of string-encode.
The optional argument start, defaulting to 0, specifies the
 starting index and end, defaulting to the length of the
 string, is the end of the string.
string must consist of characters whose char-code is
 less than 256.
 [Function]
string-to-octets string &key :start
 :end :external-format :buffer    
 
 
 string-to-octets converts string to a sequence of
 octets according to the external format specified by
 external-format. The string to be converted is bounded by
 start, which defaults to 0, and end, which defaults to
 the length of the string. If buffer is specified, the octets
 are placed in buffer. If buffer is not specified, a new
 array is allocated to hold the octets. In all cases the buffer is
 returned.
 [Function]
octets-to-string octets &key :start
 :end :external-format :string :s-start
 :s-end :state    
 
 
 octets-to-string converts the sequence of octets in
 octets to a string. octets must be a
 (simple-array (unsigned-byte 8) (*)). The octets to be
 converted are bounded by start and end, which default to
 0 and the length of the array, respectively. The conversion is
 performed according to the external format specified by
 external-format. If string is specified, the octets are
 converted and stored in string, starting at s-start
 (defaulting to 0) and ending just before s-end (defaulting to
 the end of string. string must be simple-string.
 If the bounded string is not large enough to hold all of the
 characters, then some octets will not be converted. If string
 is not specified, a new string is created.
The state is used as the initial state of for the external
 format. This is useful when converting buffers of octets where the
 buffers are not on character boundaries, and state information is
 needed between buffers.
Four values are returned: the string, the number of characters
 written to the string, and the number of octets consumed to produce
 the characters, and the final state of external format after
 converting the octets.
| 
| 13.4 | Writing External Formats |  | 
Users may write their own external formats. It is probably easiest to
look at existing external formats to see how do this.
An external format basically needs two functions:
octets-to-code to convert octets to Unicode codepoints and
code-to-octets to convert Unicode codepoints to octets. The
external format is defined using the macro
stream::define-external-format.
 [Macro]
[b a    
 
 se]stream:define-external-formatname
 (&key min max size) (&rest slots)
octets-to-code code-to-octets
 flush-state copy-state
 
 [Function]
stream:define-external-format name
 (base) (&rest slots)    
 
The first defines a new external format of the name :name.
 min, max, and size are the minimum and maximum
 number of octets that make up a character. (:size n is
 just a short cut for :min n :max n.) The arguments
 octets-to-code and code-to-octets are not optional in
 this case. They specify how to convert octets to codepoints and
 vice versa, respectively. These should be backquoted forms for the
 body of a function to do the conversion. See the description below
 for these functions. Some good examples are the external format for
 :utf-8 or :utf-16. The :slots argument is a list of
 read-only slots, similar to defstruct. The slot names are available as
 local variables inside the code-to-octets and octets-to-code
 bodies.
The second form above defines an external format with the name
 :name that is based on a previously defined format :base.
 The slots are inherited from the :base format by default,
 although the definition may alter their values and add new slots.
 See, for example, the :mac-greek external format.
 [Macro]
octets-to-code state input
 unput &rest args    
 
 
 This defines a form to be used by an external format to convert
 octets to a code point. state is a form that can be used by
 the body to access the state variable of a stream. This can be used
 for any reason to hold anything needed by octets-to-code.
 input is a form that returns one octet from the input stream.
 unput will put back N octets to the stream. args is a
 list of variables that need to be defined for any symbols in the
 body of the macro.
 [Macro]
code-to-octets code state
 output &rest args    
 
 
 Defines a form to be used by the external format to convert a code
 point to octets for output. code is the code point to be
 converted. state is a form to access the current value of the
 stream's state variable. output is a form that writes one
 octet to the output stream.
 [Macro]
flush-state state
 output &rest args    
 
 
 Defines a form to be used by the external format to flush out
 any state when an output stream is closed. Similar to
 code-to-octets, but there is no code point to be output.
If nil, then nothing special is needed to flush the state to the
 output.
This is called only when an output character stream is being closed.
 [Macro]
copy-state state &rest args    
 
 
 Defines a form to copy any state needed by the external format.
 This should probably be a deep copy so that if the original
 state is modified, the copy is not.
If not given, then nothing special is needed to copy the state
 either because there is no state for the external format or that no
 special copier is needed.
| 
| 13.4.2 | Composing External Formats |  | 
 [Macro]
stream:define-composing-external-format name
 (&key min max size) input
 output    
 
 
 This is the same as define-external-format, except that a
 composing external format is created.
 
 
