13.3.3 Strings

Strings in CMUCL are UTF-16 strings. That is, for Unicode code points greater than 65535, surrogate pairs are used. We refer the reader to the Unicode standard for more information about surrogate pairs. We just want to make a note that because of the UTF-16 encoding of strings, there is a distinction between Lisp characters and Unicode codepoints. The standard string operations know about this encoding and handle the surrogate pairs correctly.

Function: unicode:string-upcase string &key :start :end :casing
Function: unicode:string-downcase string &key :start :end :casing

The case of the string is changed appropriately. Surrogate pairs are handled correctly. The conversion to the appropriate case is done based on the Unicode conversion. The additional argument :casing controls how case conversion is done. The default value is :simple, which uses simple Unicode case conversion, which is equivalent to the same function in the COMMON-LISP package. If :casing is :full, then full Unicode case conversion is done where the string may actually increase in length.

Function: unicode:string-capitalize string &key :start

:end :casing :unicode-word-break Given a string, returns a copy of the string with the first character of each “word” converted to upper-case, and remaining characters in the word converted to lower case. The value of :casing is :simple, :full or :title for simple, full or title case conversion, respectively. The default value for :casing is :title. If :unicode-word-break is non-Nil, then the Unicode word-breaking algorithm is used to determine the word boundaries. Otherwise, a “word” is defined to be a string of case-modifiable characters delimited by non-case-modifiable chars. The default for :unicode-word-break is T.

Function: nstring-upcase string &key :start :end
Function: nstring-downcase string &key :start :end
Function: nstring-capitalize string &key :start :end

The case of the string is changed appropriately. Surrogate pairs are handled correctly. The conversion to the appropriate case is done based on the Unicode conversion. (Full casing is not available because the string length cannot be increased when needed.)

Function: string= s1 s2 &key :start1 :end1 :start2 :end2
Function: string/= s1 s2 &key :start1 :end1 :start2 :end2
Function: string< s1 s2 &key :start1 :end1 :start2 :end2
Function: string> s1 s2 &key :start1 :end1 :start2 :end2
Function: string<= s1 s2 &key :start1 :end1 :start2 :end2
Function: string>= s1 s2 &key :start1 :end1 :start2 :end2

The string comparison is done in codepoint order. (This is different from just comparing the order of the individual characters due to surrogate pairs.) Unicode collation is not done.

Function: string-equal s1 s2 &key :start1 :end1 :start2 :end2
Function: string-not-equal s1 s2 &key :start1 :end1 :start2 :end2
Function: string-lessp s1 s2 &key :start1 :end1 :start2 :end2
Function: string-greaterp s1 s2 &key :start1 :end1 :start2 :end2
Function: string-not-greaterp s1 s2 &key :start1 :end1 :start2 :end2
Function: string-not-lessp s1 s2 &key :start1 :end1 :start2 :end2

Each codepoint in each string is converted to lowercase and the appropriate comparison of the codepoint values is done. Unicode collation is not done.

Function: string-left-trim bag string
Function: string-right-trim bag string
Function: string-trim bag string

Removes any characters in bag from the left, right, or both ends of the string string, respectively. This has potential problems if you want to remove a surrogate character from the string, since a single character cannot represent a surrogate. As an extension, if bag is a string, we properly handle surrogate characters in the bag.