CMUCL supports internationalization by supporting Unicode
characters internally and by adding support for external formats to
convert from the internal format to an appropriate external character
coding format.
To understand the support for Unicode, we refer the reader to the
Unicode standard at
To support internationalization, the following changes to Common Lisp
functions have been done.
To support Unicode, there are many approaches. One choice is to
support both 8-bit base-char and a 21-bit (or larger)
character since Unicode codepoints use 21 bits. This generally
means strings are much larger, and complicates the compiler by having
to support both base-char and character types and the
corresponding string types. This also adds complexity for the user to
understand the difference between the different string and character
types.
Another choice is to have just one character and string type that can
hold the entire Unicode codepoint. While simplifying the compiler and
reducing the burden on the user, this significantly increases memory
usage for strings.
The solution chosen by CMUCL is to tradeoff the size and complexity
by having only 16-bit characters. Most of the important languages can
be encoded using only 16-bits. The rest of the codepoints are for
rare languages or ancient scripts. Thus, the memory usage is
significantly reduced while still supporting the the most important
languages. Compiler complexity is also reduced since base-char
and character are the same as are the string types.. But we
still want to support the full Unicode character set. This is
achieved by making strings be UTF-16 strings internally. Hence, Lisp
strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.
Characters are now 16 bits long instead of 8 bits, and base-char
and character types are the same. This difference is
naturally indicated by changing char-code-limit from 256 to
65536.
In CMUCL there is only one type of string---base-string and
string are the same.
Internally, the strings are encoded using UTF-16. This means that in
some rare cases the number of Lisp characters in a string is not the
same as the number of codepoints in the string.
To be able to communicate to the external world, CMUCL supports
external formats to convert to and from the external world to
CMUCL's string format. The external format is specified in several
ways. The standard streams *standard-input*,
*standard-output*, and *standard-error* take the format
from the value specified by *default-external-format*. The
default value of *default-external-format* is :iso8859-1.
For files, OPEN takes the :external-format
parameter to specify the format. The default external format is
:default.
[Function]
stream:set-system-external-format terminal &optional filenames
This function changes the external format used for
*standard-input*, *standard-output*, and
*standard-error* to the external format specified by
terminal. Additionally, the Unix file name encoding can be
set to the value specified by filenames if non-nil.
13.2.1 |
Available External Formats |
|
The available external formats are listed below in
Table 13.1. The first column gives the
external format, and the second column gives a list of aliases that
can be used for this format. The set of aliases can be changed by
changing the aliases file.
For all of these formats, if an illegal sequence is encountered, no
error or warning is signaled. Instead, the offending sequence is
silently replaced with the Unicode REPLACEMENT CHARACTER (U+FFFD).
Format |
Aliases |
Description |
:iso8859-1 |
:latin1 :latin-1 :iso-8859-1 |
ISO8859-1 |
:iso8859-2 |
:latin2 :latin-2 :iso-8859-2 |
ISO8859-2 |
:iso8859-3 |
:latin3 :latin-3 :iso-8859-3 |
ISO8859-3 |
:iso8859-4 |
:latin4 :latin-4 :iso-8859-4 |
ISO8859-4 |
:iso8859-5 |
:cyrillic :iso-8859-5 |
ISO8859-5 |
:iso8859-6 |
:arabic :iso-8859-6 |
ISO8859-6 |
:iso8859-7 |
:greek :iso-8859-7 |
ISO8859-7 |
:iso8859-8 |
:hebrew :iso-8859-8 |
ISO8859-8 |
:iso8859-9 |
:latin5 :latin-5 :iso-8859-9 |
ISO8859-9 |
:iso8859-10 |
:latin6 :latin-6 :iso-8859-10 |
ISO8859-10 |
:iso8859-13 |
:latin7 :latin-7 :iso-8859-13 |
ISO8859-13 |
:iso8859-14 |
:latin8 :latin-8 :iso-8859-14 |
ISO8859-14 |
:iso8859-15 |
:latin9 :latin-9 :iso-8859-15 |
ISO8859-15 |
:utf-8 |
:utf :utf8 |
UTF-8 |
:utf-16 |
:utf16 |
UTF-16 with optional BOM |
:utf-16-be |
:utf-16be :utf16-be |
UTF-16 big-endian (without BOM) |
:utf-16-le |
:utf-16le :utf16-le |
UTF-16 little-endian (without BOM) |
:utf-32 |
:utf32 |
UTF-32 with optional BOM |
:utf-32-be |
:utf-32be :utf32-be |
UTF-32 big-endian (without BOM) |
:utf-32-le |
:utf-32le :utf32-le |
UTF-32 little-endian (without BOM) |
:cp1250 |
|
|
:cp1251 |
|
|
:cp1252 |
:windows-1252 :windows-cp1252 :windows-latin1 |
|
:cp1253 |
|
|
:cp1254 |
|
|
:cp1255 |
|
|
:cp1256 |
|
|
:cp1257 |
|
|
:cp1258 |
|
|
:koi8-r |
|
|
:mac-cyrillic |
|
|
:mac-greek |
|
|
:mac-icelandic |
|
|
:mac-latin2 |
|
|
:mac-roman |
|
|
:mac-turkish |
|
|
Table 13.1: External formats
13.2.2 |
Composing External Formats |
|
A composing external format is an external format that converts between
one codepoint and another, rather than between codepoints and octets.
A composing external format must be used in conjunction with another
(octet-producing) external format. This is specified by
using a list as the external format. For example, we can use
'(:latin1 :crlf) as the external format. In this
particular example, the external format is latin1, but whenever a
carriage-return/linefeed sequence is read, it is converted to the Lisp
#\Newline character. Conversely, whenever a string is written,
a Lisp #\Newline character is converted to a
carriage-return/linefeed sequence. Without the :crlf composing
format, the carriage-return and linefeed will be read in as separate
characters, and on output the Lisp #\Newline character is
output as a single linefeed character.
Table 13.2 lists the available composing formats.
Format |
Aliases |
Description |
:crlf |
:dos |
Composing format for converting to/from DOS (CR/LF)
end-of-line sequence to #\Newline |
:cr |
:mac |
Composing format for converting to/from DOS (CR/LF)
end-of-line sequence to #\Newline |
:beta-gk |
|
Composing format that translates (lower-case) Beta
code (an ASCII encoding of ancient Greek) |
:final-sigma |
|
Composing format that attempts to detect sigma in
word-final position and change it from U+3C3 to U+3C2 |
Table 13.2: Composing external formats
[Variable]
extensions:*default-external-format*
This is the default external format to use for all newly opened
files. It is also the default format to use for
*standard-input*, *standard-output*, and
*standard-error*. The default value is :iso8859-1.
Setting this will cause the standard streams to start using the new
format immediately. If a stream has been created with external
format :default, then setting *default-external-format*
will cause all subsequent input and output to use the new value of
*default-external-format*.
Remember that CMUCL's characters are only 16-bits long but Unicode
codepoints are up to 21 bits long. Hence there are codepoints that
cannot be represented via Lisp characters. Operating on individual
characters is not recommended. Operations on strings are better.
(This would be true even if CMUCL's characters could hold a
full Unicode codepoint.)
[Function]
char-equal &rest characters
[Function]
char-not-equal &rest characters
[Function]
char-lessp &rest characters
[Function]
char-greaterp &rest characters
[Function]
char-not-greaterp &rest characters
[Function]
char-not-lessp &rest characters
For the comparison, the characters are converted to lowercase and
the corresponding char-code are compared.
[Function]
alpha-char-p character
Returns non-nil if the Unicode category is a letter category.
[Function]
alphanumericp character
Returns non-nil if the Unicode category is a letter category or an ASCII
digit.
[Function]
digit-char-p character &optional radix
Only recognizes ASCII digits (and ASCII letters if the radix is larger
than 10).
[Function]
graphic-char-p character
Returns non-nil if the Unicode category is a graphic category.
[Function]
upper-case-p character
[Function]
lower-case-p character
Returns non-nil if the Unicode category is an uppercase
(lowercase) character.
[Function]
lisp:title-case-p character
Returns non-nil if the Unicode category is a titlecase character.
[Function]
both-case-p character
Returns non-nil if the Unicode category is an uppercase,
lowercase, or titlecase character.
[Function]
char-upcase character
[Function]
char-downcase character
The Unicode uppercase (lowercase) letter is returned.
[Function]
lisp:char-titlecase character
The Unicode titlecase letter is returned.
[Function]
char-name char
If possible the name of the character char is returned. If
there is a Unicode name, the Unicode name is returned, except
spaces are converted to underscores and the string is capitalized
via string-capitalize. If there is no Unicode name, the
form #\U+xxxx is returned where ``xxxx'' is the
char-code of the character, in hexadecimal.
[Function]
name-char name
The inverse to char-name. If no character has the name
name, then nil is returned. Unicode names are not
case-sensitive, and spaces and underscores are optional.
Strings in CMUCL are UTF-16 strings. That is, for Unicode code
points greater than 65535, surrogate pairs are used. We refer the
reader to the Unicode standard for more information about surrogate
pairs. We just want to make a note that because of the UTF-16
encoding of strings, there is a distinction between Lisp characters
and Unicode codepoints. The standard string operations know about
this encoding and handle the surrogate pairs correctly.
[Function]
string-upcase string &key :start
:end :casing
[Function]
string-downcase string &key :start
:end :casing
[Function]
string-capitalize string &key :start
:end :casing
The case of the string is changed appropriately. Surrogate
pairs are handled correctly. The conversion to the appropriate case
is done based on the Unicode conversion. The additional argument
:casing controls how case conversion is done. The default
value is :simple, which uses simple Unicode case conversion.
If :casing is :full, then full Unicode case conversion is
done where the string may actually increase in length.
[Function]
nstring-upcase string &key :start :end
[Function]
nstring-downcase string &key :start :end
[Function]
nstring-capitalize string &key :start
:end
The case of the string is changed appropriately. Surrogate
pairs are handled correctly. The conversion to the appropriate case
is done based on the Unicode conversion. (Full casing is not
available because the string length cannot be increased when needed.)
[Function]
string= s1 s2 &key :start1
:end1 :start2 :end2
[Function]
string/= s1 s2 &key :start1 :end1 :start2 :end2
[Function]
string> s1 s2 &key :start1 :end1 :start2 :end2
[Function]
string>= s1 s2 &key :start1 :end1 :start2 :end2
The string comparison is done in codepoint order. (This is
different from just comparing the order of the individual characters
due to surrogate pairs.) Unicode collation is not done.
[Function]
string-equal s1 s2 &key :start1
:end1 :start2 :end2
[Function]
string-not-equal s1 s2 &key :start1 :end1 :start2 :end2
[Function]
string-lessp s1 s2 &key :start1 :end1 :start2 :end2
[Function]
string-greaterp s1 s2 &key :start1 :end1 :start2 :end2
[Function]
string-not-greaterp s1 s2 &key :start1 :end1 :start2 :end2
[Function]
string-not-lessp s1 s2 &key :start1 :end1 :start2 :end2
Each codepoint in each string is converted to lowercase and the
appropriate comparison of the codepoint values is done. Unicode
collation is not done.
[Function]
string-left-trim bag string
[Function]
string-right-trim bag string
[Function]
string-trim bag string
Removes any characters in bag from the left, right, or both
ends of the string string, respectively. This has potential
problems if you want to remove a surrogate character from the
string, since a single character cannot represent a surrogate. As
an extension, if bag is a string, we properly handle
surrogate characters in the bag.
Since strings are also sequences, the sequence functions can be used
on strings. We note here some issues with these functions. Most
issues are due to the fact that strings are UTF-16 strings and
characters are UTF-16 code units, not Unicode codepoints.
[Function]
remove-duplicates sequence
&key :from-end :test :test-not :start
:end :key
[Function]
delete-duplicates sequence
&key :from-end :test :test-not :start
:end :key
Because of surrogate pairs these functions may remove a high or low
surrogate value, leaving the string in an invalid state. Use these
functions carefully with strings.
To support Unicode characters, the reader has been extended to
recognize characters written in hexadecimal. Thus #\U+41 is
the ASCII capital letter ``A'', since 41 is the hexadecimal code for
that letter. The Unicode name of the character is also recognized,
except spaces in the name are replaced by underscores.
Recall, however, that characters in CMUCL are only 16 bits long so
many Unicode characters cannot be represented. However, strings can
represent all Unicode characters.
When symbols are read, the symbol name is converted to Unicode NFC
form before interning the symbol into the package. Hence,
symbol-name (intern ``string'') may produce a string that is
not string= to ``string''. However, after conversion to NFC
form, the strings will be identical.
When printing characters, if the character is a graphic character, the
character is printed. Thus #\U+41 is printed as
#\A. If the character is not a graphic character, the Lisp
name (e.g., #\Tab) is used if possible;
if there is no Lisp name, the Unicode name is used. If there is no
Unicode name, the hexadecimal char-code is
printed. For example, #\U+34e, which is not a graphic
character, is printed as #\Combining_Upwards_Arrow_Below,
and #\U+9f which is not a graphic character and does not have a
Unicode name, is printed as #\U+009F.
CMUCL loads external formats using the search-list
ext-formats:. The aliases file is also located using
this search-list.
The Unicode data base is stored in compressed form in the file
ext-formats:unidata.bin. If this file is not found, Unicode
support is severely reduced; you can only use ASCII characters.
Since strings are UTF-16 and hence may contain surrogate pairs, some
utility functions are provided to make access easier.
[Function]
lisp:codepoint string i
&optional end
Return the codepoint value from string at position i.
If code unit at that position is a surrogate value, it is combined
with either the previous or following code unit (when possible) to
compute the codepoint. The first return value is the codepoint
itself. The second return value is nil if the position is not a
surrogate pair. Otherwise, +1 or -1 is returned if the position
is the high (leading) or low (trailing) surrogate value, respectively.
This is useful for iterating through a string in codepoint sequence.
[Function]
lisp:surrogates-to-codepoint hi lo
Convert the given hi and lo surrogate characters to the
corresponding codepoint value
[Function]
lisp:surrogates codepoint
Convert the given codepoint value to the corresponding high
and low surrogate characters. If the codepoint is less than 65536,
the second value is nil since the codepoint does not need to be
represented as a surrogate pair.
[Function]
stream:string-encode string
external-format &optional (start 0) end
string-encode encodes string using the format
external-format, producing an array of octets. Each octet is
converted to a character via code-char and the resulting
string is returned.
The optional argument start, defaulting to 0, specifies the
starting index and end, defaulting to the length of the
string, is the end of the string.
[Function]
stream:string-decode string
external-format &optional (start 0) end
string-decode decodes string using the format
external-format and produces a new string. Each character of
string is converted to octet (by char-code) and the
resulting array of octets is used by the external format to produce
a string. This is the inverse of string-encode.
The optional argument start, defaulting to 0, specifies the
starting index and end, defaulting to the length of the
string, is the end of the string.
string must consist of characters whose char-code is
less than 256.
[Function]
string-to-octets string &key :start
:end :external-format :buffer
string-to-octets converts string to a sequence of
octets according to the external format specified by
external-format. The string to be converted is bounded by
start, which defaults to 0, and end, which defaults to
the length of the string. If buffer is specified, the octets
are placed in buffer. If buffer is not specified, a new
array is allocated to hold the octets. In all cases the buffer is
returned.
[Function]
octets-to-string octets &key :start
:end :external-format :string :s-start
:s-end :state
octets-to-string converts the sequence of octets in
octets to a string. octets must be a
(simple-array (unsigned-byte 8) (*)). The octets to be
converted are bounded by start and end, which default to
0 and the length of the array, respectively. The conversion is
performed according to the external format specified by
external-format. If string is specified, the octets are
converted and stored in string, starting at s-start
(defaulting to 0) and ending just before s-end (defaulting to
the end of string. string must be simple-string.
If the bounded string is not large enough to hold all of the
characters, then some octets will not be converted. If string
is not specified, a new string is created.
The state is used as the initial state of for the external
format. This is useful when converting buffers of octets where the
buffers are not on character boundaries, and state information is
needed between buffers.
Four values are returned: the string, the number of characters
written to the string, and the number of octets consumed to produce
the characters, and the final state of external format after
converting the octets.
13.4 |
Writing External Formats |
|
Users may write their own external formats. It is probably easiest to
look at existing external formats to see how do this.
An external format basically needs two functions:
octets-to-code to convert octets to Unicode codepoints and
code-to-octets to convert Unicode codepoints to octets. The
external format is defined using the macro
stream::define-external-format.
[Macro]
[b a
se]stream:define-external-formatname
(&key min max size) (&rest slots)
octets-to-code code-to-octets
flush-state copy-state
[Function]
stream:define-external-format name
(base) (&rest slots)
The first defines a new external format of the name :name.
min, max, and size are the minimum and maximum
number of octets that make up a character. (:size n is
just a short cut for :min n :max n.) The arguments
octets-to-code and code-to-octets are not optional in
this case. They specify how to convert octets to codepoints and
vice versa, respectively. These should be backquoted forms for the
body of a function to do the conversion. See the description below
for these functions. Some good examples are the external format for
:utf-8 or :utf-16. The :slots argument is a list of
read-only slots, similar to defstruct. The slot names are available as
local variables inside the code-to-octets and octets-to-code
bodies.
The second form above defines an external format with the name
:name that is based on a previously defined format :base.
The slots are inherited from the :base format by default,
although the definition may alter their values and add new slots.
See, for example, the :mac-greek external format.
[Macro]
octets-to-code state input
unput &rest args
This defines a form to be used by an external format to convert
octets to a code point. state is a form that can be used by
the body to access the state variable of a stream. This can be used
for any reason to hold anything needed by octets-to-code.
input is a form that returns one octet from the input stream.
unput will put back N octets to the stream. args is a
list of variables that need to be defined for any symbols in the
body of the macro.
[Macro]
code-to-octets code state
output &rest args
Defines a form to be used by the external format to convert a code
point to octets for output. code is the code point to be
converted. state is a form to access the current value of the
stream's state variable. output is a form that writes one
octet to the output stream.
[Macro]
flush-state state
output &rest args
Defines a form to be used by the external format to flush out
any state when an output stream is closed. Similar to
code-to-octets, but there is no code point to be output.
If nil, then nothing special is needed to flush the state to the
output.
This is called only when an output character stream is being closed.
[Macro]
copy-state state &rest args
Defines a form to copy any state needed by the external format.
This should probably be a deep copy so that if the original
state is modified, the copy is not.
If not given, then nothing special is needed to copy the state
either because there is no state for the external format or that no
special copier is needed.
13.4.2 |
Composing External Formats |
|
[Macro]
stream:define-composing-external-format name
(&key min max size) input
output
This is the same as define-external-format, except that a
composing external format is created.