CMUCL User's Manual: Internationalization

Internationalization

CMUCL supports internationalization by supporting Unicode characters internally and by adding support for external formats to convert from the internal format to an appropriate external character coding format.

To understand the support for Unicode, we refer the reader to the Unicode standard at

13.1

Changes

To support internationalization, the following changes to Common Lisp functions have been done.

13.1.1

Design Choices

To support Unicode, there are many approaches. One choice is to support both 8-bit base-char and a 21-bit (or larger) character since Unicode codepoints use 21 bits. This generally means strings are much larger, and complicates the compiler by having to support both base-char and character types and the corresponding string types. This also adds complexity for the user to understand the difference between the different string and character types.

Another choice is to have just one character and string type that can hold the entire Unicode codepoint. While simplifying the compiler and reducing the burden on the user, this significantly increases memory usage for strings.

The solution chosen by CMUCL is to tradeoff the size and complexity by having only 16-bit characters. Most of the important languages can be encoded using only 16-bits. The rest of the codepoints are for rare languages or ancient scripts. Thus, the memory usage is significantly reduced while still supporting the the most important languages. Compiler complexity is also reduced since base-char and character are the same as are the string types.. But we still want to support the full Unicode character set. This is achieved by making strings be UTF-16 strings internally. Hence, Lisp strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.

13.1.2

Characters

Characters are now 16 bits long instead of 8 bits, and base-char and character types are the same. This difference is naturally indicated by changing char-code-limit from 256 to 65536.

13.1.3

Strings

In CMUCL there is only one type of string---base-string and string are the same.

Internally, the strings are encoded using UTF-16. This means that in some rare cases the number of Lisp characters in a string is not the same as the number of codepoints in the string.

13.2

External Formats

To be able to communicate to the external world, CMUCL supports external formats to convert to and from the external world to CMUCL's string format. The external format is specified in several ways. The standard streams *standard-input*, *standard-output*, and *standard-error* take the format from the value specified by *default-external-format*. The default value of *default-external-format* is :iso8859-1.

For files, OPEN takes the :external-format parameter to specify the format. The default external format is :default.

[Function]
stream:set-system-external-format terminal &optional filenames

This function changes the external format used for *standard-input*, *standard-output*, and *standard-error* to the external format specified by terminal. Additionally, the Unix file name encoding can be set to the value specified by filenames if non-nil.

13.2.1

Available External Formats

The available external formats are listed below in Table 13.1. The first column gives the external format, and the second column gives a list of aliases that can be used for this format. The set of aliases can be changed by changing the aliases file.

For all of these formats, if an illegal sequence is encountered, no error or warning is signaled. Instead, the offending sequence is silently replaced with the Unicode REPLACEMENT CHARACTER (U+FFFD).

Format Aliases Description

:iso8859-1 :latin1 :latin-1 :iso-8859-1 ISO8859-1

:iso8859-2 :latin2 :latin-2 :iso-8859-2 ISO8859-2

:iso8859-3 :latin3 :latin-3 :iso-8859-3 ISO8859-3

:iso8859-4 :latin4 :latin-4 :iso-8859-4 ISO8859-4

:iso8859-5 :cyrillic :iso-8859-5 ISO8859-5

:iso8859-6 :arabic :iso-8859-6 ISO8859-6

:iso8859-7 :greek :iso-8859-7 ISO8859-7

:iso8859-8 :hebrew :iso-8859-8 ISO8859-8

:iso8859-9 :latin5 :latin-5 :iso-8859-9 ISO8859-9

:iso8859-10 :latin6 :latin-6 :iso-8859-10 ISO8859-10

:iso8859-13 :latin7 :latin-7 :iso-8859-13 ISO8859-13

:iso8859-14 :latin8 :latin-8 :iso-8859-14 ISO8859-14

:iso8859-15 :latin9 :latin-9 :iso-8859-15 ISO8859-15

:utf-8 :utf :utf8 UTF-8

:utf-16 :utf16 UTF-16 with optional BOM

:utf-16-be :utf-16be :utf16-be UTF-16 big-endian (without BOM)

:utf-16-le :utf-16le :utf16-le UTF-16 little-endian (without BOM)

:utf-32 :utf32 UTF-32 with optional BOM

:utf-32-be :utf-32be :utf32-be UTF-32 big-endian (without BOM)

:utf-32-le :utf-32le :utf32-le UTF-32 little-endian (without BOM)

:cp1250

:cp1251

:cp1252 :windows-1252 :windows-cp1252 :windows-latin1

:cp1253

:cp1254

:cp1255

:cp1256

:cp1257

:cp1258

:koi8-r

:mac-cyrillic

:mac-greek

:mac-icelandic

:mac-latin2

:mac-roman

:mac-turkish

Table 13.1: External formats

13.2.2

Composing External Formats

A composing external format is an external format that converts between one codepoint and another, rather than between codepoints and octets. A composing external format must be used in conjunction with another (octet-producing) external format. This is specified by using a list as the external format. For example, we can use '(:latin1 :crlf) as the external format. In this particular example, the external format is latin1, but whenever a carriage-return/linefeed sequence is read, it is converted to the Lisp #\Newline character. Conversely, whenever a string is written, a Lisp #\Newline character is converted to a carriage-return/linefeed sequence. Without the :crlf composing format, the carriage-return and linefeed will be read in as separate characters, and on output the Lisp #\Newline character is output as a single linefeed character.

Table 13.2 lists the available composing formats.

Format Aliases Description

:crlf :dos Composing format for converting to/from DOS (CR/LF) end-of-line sequence to #\Newline

:cr :mac Composing format for converting to/from DOS (CR/LF) end-of-line sequence to #\Newline

:beta-gk Composing format that translates (lower-case) Beta code (an ASCII encoding of ancient Greek)

:final-sigma Composing format that attempts to detect sigma in word-final position and change it from U+3C3 to U+3C2

Table 13.2: Composing external formats

13.3

Dictionary

13.3.1

Variables

[Variable]
extensions:*default-external-format*

This is the default external format to use for all newly opened files. It is also the default format to use for *standard-input*, *standard-output*, and *standard-error*. The default value is :iso8859-1.

Setting this will cause the standard streams to start using the new format immediately. If a stream has been created with external format :default, then setting *default-external-format* will cause all subsequent input and output to use the new value of *default-external-format*.

13.3.2

Characters

Remember that CMUCL's characters are only 16-bits long but Unicode codepoints are up to 21 bits long. Hence there are codepoints that cannot be represented via Lisp characters. Operating on individual characters is not recommended. Operations on strings are better. (This would be true even if CMUCL's characters could hold a full Unicode codepoint.)

[Function]
char-equal &rest characters

[Function]
char-not-equal &rest characters

[Function]
char-lessp &rest characters

[Function]
char-greaterp &rest characters

[Function]
char-not-greaterp &rest characters

[Function]
char-not-lessp &rest characters
For the comparison, the characters are converted to lowercase and the corresponding char-code are compared.

[Function]
alpha-char-p character

Returns non-nil if the Unicode category is a letter category.

[Function]
alphanumericp character

Returns non-nil if the Unicode category is a letter category or an ASCII digit.

[Function]
digit-char-p character &optional radix

Only recognizes ASCII digits (and ASCII letters if the radix is larger than 10).

[Function]
graphic-char-p character

Returns non-nil if the Unicode category is a graphic category.

[Function]
upper-case-p character

[Function]
lower-case-p character
Returns non-nil if the Unicode category is an uppercase (lowercase) character.

[Function]
lisp:title-case-p character

Returns non-nil if the Unicode category is a titlecase character.

[Function]
both-case-p character

Returns non-nil if the Unicode category is an uppercase, lowercase, or titlecase character.

[Function]
char-upcase character

[Function]
char-downcase character
The Unicode uppercase (lowercase) letter is returned.

[Function]
lisp:char-titlecase character

The Unicode titlecase letter is returned.

[Function]
char-name char

If possible the name of the character char is returned. If there is a Unicode name, the Unicode name is returned, except spaces are converted to underscores and the string is capitalized via string-capitalize. If there is no Unicode name, the form #\U+xxxx is returned where ``xxxx'' is the char-code of the character, in hexadecimal.

[Function]
name-char name

The inverse to char-name. If no character has the name name, then nil is returned. Unicode names are not case-sensitive, and spaces and underscores are optional.

13.3.3

Strings

Strings in CMUCL are UTF-16 strings. That is, for Unicode code points greater than 65535, surrogate pairs are used. We refer the reader to the Unicode standard for more information about surrogate pairs. We just want to make a note that because of the UTF-16 encoding of strings, there is a distinction between Lisp characters and Unicode codepoints. The standard string operations know about this encoding and handle the surrogate pairs correctly.

[Function]
string-upcase string &key :start :end :casing

[Function]
string-downcase string &key :start :end :casing

[Function]
string-capitalize string &key :start :end :casing
The case of the string is changed appropriately. Surrogate pairs are handled correctly. The conversion to the appropriate case is done based on the Unicode conversion. The additional argument :casing controls how case conversion is done. The default value is :simple, which uses simple Unicode case conversion. If :casing is :full, then full Unicode case conversion is done where the string may actually increase in length.

[Function]
nstring-upcase string &key :start :end

[Function]
nstring-downcase string &key :start :end

[Function]
nstring-capitalize string &key :start :end
The case of the string is changed appropriately. Surrogate pairs are handled correctly. The conversion to the appropriate case is done based on the Unicode conversion. (Full casing is not available because the string length cannot be increased when needed.)

[Function]
string= s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string/= s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string> s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string>= s1 s2 &key :start1 :end1 :start2 :end2
The string comparison is done in codepoint order. (This is different from just comparing the order of the individual characters due to surrogate pairs.) Unicode collation is not done.

[Function]
string-equal s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string-not-equal s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string-lessp s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string-greaterp s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string-not-greaterp s1 s2 &key :start1 :end1 :start2 :end2

[Function]
string-not-lessp s1 s2 &key :start1 :end1 :start2 :end2
Each codepoint in each string is converted to lowercase and the appropriate comparison of the codepoint values is done. Unicode collation is not done.

[Function]
string-left-trim bag string

[Function]
string-right-trim bag string

[Function]
string-trim bag string
Removes any characters in bag from the left, right, or both ends of the string string, respectively. This has potential problems if you want to remove a surrogate character from the string, since a single character cannot represent a surrogate. As an extension, if bag is a string, we properly handle surrogate characters in the bag.

13.3.4

Sequences

Since strings are also sequences, the sequence functions can be used on strings. We note here some issues with these functions. Most issues are due to the fact that strings are UTF-16 strings and characters are UTF-16 code units, not Unicode codepoints.

[Function]
remove-duplicates sequence &key :from-end :test :test-not :start :end :key

[Function]
delete-duplicates sequence &key :from-end :test :test-not :start :end :key
Because of surrogate pairs these functions may remove a high or low surrogate value, leaving the string in an invalid state. Use these functions carefully with strings.

13.3.5

Reader

To support Unicode characters, the reader has been extended to recognize characters written in hexadecimal. Thus #\U+41 is the ASCII capital letter ``A'', since 41 is the hexadecimal code for that letter. The Unicode name of the character is also recognized, except spaces in the name are replaced by underscores.

Recall, however, that characters in CMUCL are only 16 bits long so many Unicode characters cannot be represented. However, strings can represent all Unicode characters.

When symbols are read, the symbol name is converted to Unicode NFC form before interning the symbol into the package. Hence, symbol-name (intern ``string'') may produce a string that is not string= to ``string''. However, after conversion to NFC form, the strings will be identical.

13.3.6

Printer

When printing characters, if the character is a graphic character, the character is printed. Thus #\U+41 is printed as #\A. If the character is not a graphic character, the Lisp name (e.g., #\Tab) is used if possible; if there is no Lisp name, the Unicode name is used. If there is no Unicode name, the hexadecimal char-code is printed. For example, #\U+34e, which is not a graphic character, is printed as #\Combining_Upwards_Arrow_Below, and #\U+9f which is not a graphic character and does not have a Unicode name, is printed as #\U+009F.

13.3.7

Miscellaneous

13.3.7.1

Files

CMUCL loads external formats using the search-list ext-formats:. The aliases file is also located using this search-list.

The Unicode data base is stored in compressed form in the file ext-formats:unidata.bin. If this file is not found, Unicode support is severely reduced; you can only use ASCII characters.

13.3.7.2

Utilities

Since strings are UTF-16 and hence may contain surrogate pairs, some utility functions are provided to make access easier.

[Function]
lisp:codepoint string i &optional end

Return the codepoint value from string at position i. If code unit at that position is a surrogate value, it is combined with either the previous or following code unit (when possible) to compute the codepoint. The first return value is the codepoint itself. The second return value is nil if the position is not a surrogate pair. Otherwise, +1 or -1 is returned if the position is the high (leading) or low (trailing) surrogate value, respectively.

This is useful for iterating through a string in codepoint sequence.

[Function]
lisp:surrogates-to-codepoint hi lo

Convert the given hi and lo surrogate characters to the corresponding codepoint value

[Function]
lisp:surrogates codepoint

Convert the given codepoint value to the corresponding high and low surrogate characters. If the codepoint is less than 65536, the second value is nil since the codepoint does not need to be represented as a surrogate pair.

[Function]
stream:string-encode string external-format &optional (start 0) end

string-encode encodes string using the format external-format, producing an array of octets. Each octet is converted to a character via code-char and the resulting string is returned.

The optional argument start, defaulting to 0, specifies the starting index and end, defaulting to the length of the string, is the end of the string.

[Function]
stream:string-decode string external-format &optional (start 0) end

string-decode decodes string using the format external-format and produces a new string. Each character of string is converted to octet (by char-code) and the resulting array of octets is used by the external format to produce a string. This is the inverse of string-encode.

The optional argument start, defaulting to 0, specifies the starting index and end, defaulting to the length of the string, is the end of the string.

string must consist of characters whose char-code is less than 256.

[Function]
string-to-octets string &key :start :end :external-format :buffer

string-to-octets converts string to a sequence of octets according to the external format specified by external-format. The string to be converted is bounded by start, which defaults to 0, and end, which defaults to the length of the string. If buffer is specified, the octets are placed in buffer. If buffer is not specified, a new array is allocated to hold the octets. In all cases the buffer is returned.

[Function]
octets-to-string octets &key :start :end :external-format :string :s-start :s-end :state

octets-to-string converts the sequence of octets in octets to a string. octets must be a (simple-array (unsigned-byte 8) (*)). The octets to be converted are bounded by start and end, which default to 0 and the length of the array, respectively. The conversion is performed according to the external format specified by external-format. If string is specified, the octets are converted and stored in string, starting at s-start (defaulting to 0) and ending just before s-end (defaulting to the end of string. string must be simple-string. If the bounded string is not large enough to hold all of the characters, then some octets will not be converted. If string is not specified, a new string is created.

The state is used as the initial state of for the external format. This is useful when converting buffers of octets where the buffers are not on character boundaries, and state information is needed between buffers.

Four values are returned: the string, the number of characters written to the string, and the number of octets consumed to produce the characters, and the final state of external format after converting the octets.

13.4

Writing External Formats

13.4.1

External Formats

Users may write their own external formats. It is probably easiest to look at existing external formats to see how do this.

An external format basically needs two functions: octets-to-code to convert octets to Unicode codepoints and code-to-octets to convert Unicode codepoints to octets. The external format is defined using the macro stream::define-external-format.

[Macro]
[b a

se]stream:define-external-formatname (&key min max size) (&rest slots)
octets-to-code code-to-octets flush-state copy-state

[Function]
stream:define-external-format name (base) (&rest slots)

The first defines a new external format of the name :name. min, max, and size are the minimum and maximum number of octets that make up a character. (:size n is just a short cut for :min n :max n.) The arguments octets-to-code and code-to-octets are not optional in this case. They specify how to convert octets to codepoints and vice versa, respectively. These should be backquoted forms for the body of a function to do the conversion. See the description below for these functions. Some good examples are the external format for :utf-8 or :utf-16. The :slots argument is a list of read-only slots, similar to defstruct. The slot names are available as local variables inside the code-to-octets and octets-to-code bodies.

The second form above defines an external format with the name :name that is based on a previously defined format :base. The slots are inherited from the :base format by default, although the definition may alter their values and add new slots. See, for example, the :mac-greek external format.

[Macro]
octets-to-code state input unput &rest args

This defines a form to be used by an external format to convert octets to a code point. state is a form that can be used by the body to access the state variable of a stream. This can be used for any reason to hold anything needed by octets-to-code. input is a form that returns one octet from the input stream. unput will put back N octets to the stream. args is a list of variables that need to be defined for any symbols in the body of the macro.

[Macro]
code-to-octets code state output &rest args

Defines a form to be used by the external format to convert a code point to octets for output. code is the code point to be converted. state is a form to access the current value of the stream's state variable. output is a form that writes one octet to the output stream.

[Macro]
flush-state state output &rest args

Defines a form to be used by the external format to flush out any state when an output stream is closed. Similar to code-to-octets, but there is no code point to be output.

If nil, then nothing special is needed to flush the state to the output.

This is called only when an output character stream is being closed.

[Macro]
copy-state state &rest args

Defines a form to copy any state needed by the external format. This should probably be a deep copy so that if the original state is modified, the copy is not.

If not given, then nothing special is needed to copy the state either because there is no state for the external format or that no special copier is needed.

13.4.2

Composing External Formats

[Macro]
stream:define-composing-external-format name (&key min max size) input output

This is the same as define-external-format, except that a composing external format is created.

Format	Aliases	Description
`:iso8859-1`	`:latin1` `:latin-1` `:iso-8859-1`	ISO8859-1
`:iso8859-2`	`:latin2` `:latin-2` `:iso-8859-2`	ISO8859-2
`:iso8859-3`	`:latin3` `:latin-3` `:iso-8859-3`	ISO8859-3
`:iso8859-4`	`:latin4` `:latin-4` `:iso-8859-4`	ISO8859-4
`:iso8859-5`	`:cyrillic` `:iso-8859-5`	ISO8859-5
`:iso8859-6`	`:arabic` `:iso-8859-6`	ISO8859-6
`:iso8859-7`	`:greek` `:iso-8859-7`	ISO8859-7
`:iso8859-8`	`:hebrew` `:iso-8859-8`	ISO8859-8
`:iso8859-9`	`:latin5` `:latin-5` `:iso-8859-9`	ISO8859-9
`:iso8859-10`	`:latin6` `:latin-6` `:iso-8859-10`	ISO8859-10
`:iso8859-13`	`:latin7` `:latin-7` `:iso-8859-13`	ISO8859-13
`:iso8859-14`	`:latin8` `:latin-8` `:iso-8859-14`	ISO8859-14
`:iso8859-15`	`:latin9` `:latin-9` `:iso-8859-15`	ISO8859-15
`:utf-8`	`:utf` `:utf8`	UTF-8
`:utf-16`	`:utf16`	UTF-16 with optional BOM
`:utf-16-be`	`:utf-16be` `:utf16-be`	UTF-16 big-endian (without BOM)
`:utf-16-le`	`:utf-16le` `:utf16-le`	UTF-16 little-endian (without BOM)
`:utf-32`	`:utf32`	UTF-32 with optional BOM
`:utf-32-be`	`:utf-32be` `:utf32-be`	UTF-32 big-endian (without BOM)
`:utf-32-le`	`:utf-32le` `:utf32-le`	UTF-32 little-endian (without BOM)
`:cp1250`
`:cp1251`
`:cp1252`	`:windows-1252` `:windows-cp1252` `:windows-latin1`
`:cp1253`
`:cp1254`
`:cp1255`
`:cp1256`
`:cp1257`
`:cp1258`
`:koi8-r`
`:mac-cyrillic`
`:mac-greek`
`:mac-icelandic`
`:mac-latin2`
`:mac-roman`
`:mac-turkish`