20 String Encodings

The scheme_utf8_decode function decodes a char array as UTF-8 into either a UCS-4 mzchar array or a UTF-16 short array. The scheme_utf8_encode function encodes either a UCS-4 mzchar array or a UTF-16 short array into a UTF-8 char array.

These functions can be used to check or measure an encoding or decoding without actually producing the result decoding or encoding, and variations of the function provide control over the handling of decoding errors.

int

 

scheme_utf8_decode

(

const unsigned char* s,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

mzchar* us,

 

 

 

 

int dstart,

 

 

 

 

int dend,

 

 

 

 

intptr_t* ipos,

 

 

 

 

char utf16,

 

 

 

 

int permissive)

Decodes a byte array as UTF-8 to produce either Unicode code points into us (when utf16 is zero) or UTF-16 code units into us cast to short* (when utf16 is non-zero). No nul terminator is added to us.

The result is non-negative when all of the given bytes are decoded, and the result is the length of the decoding (in mzchars or shorts). A -2 result indicates an invalid encoding sequence in the given bytes (possibly because the range to decode ended mid-encoding), and a -3 result indicates that decoding stopped because not enough room was available in the result string.

The start and end arguments specify a range of s to be decoded. If end is negative, strlen(s) is used as the end.

If us is NULL, then decoded bytes are not produced, but the result is valid as if decoded bytes were written. The dstart and dend arguments specify a target range in us (in mzchar or short units) for the decoding; a negative value for dend indicates that any number of bytes can be written to us, which is normally sensible only when us is NULL for measuring the length of the decoding.

If ipos is non-NULL, it is filled with the first undecoded index within s. If the function result is non-negative, then *ipos is set to the ending index (with is end if non-negative, strlen(s) otherwise). If the result is -1 or -2, then *ipos effectively indicates how many bytes were decoded before decoding stopped.

If permissive is non-zero, it is used as the decoding of bytes that are not part of a valid UTF-8 encoding or if the input ends in the middle of an encoding. Thus, the function result can be -1 or -2 only if permissive is 0.

On Windows, when utf16 is non-zero, decoding supports a natural extension of UTF-8 that can produce unpaired UTF-16 surrogates in the result.

This function does not allocate or trigger garbage collection.

int

 

scheme_utf8_decode_offset_prefix

(

const unsigned char* s,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

mzchar* us,

 

 

 

 

int dstart,

 

 

 

 

int dend,

 

 

 

 

intptr_t* ipos,

 

 

 

 

char utf16,

 

 

 

 

int permissive)

Like scheme_utf8_decode, but returns -1 if the input ends in the middle of a UTF-8 encoding even if permission is non-zero.

Added in version 6.0.1.13.

int

 

scheme_utf8_decode_as_prefix

(

const unsigned char* s,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

mzchar* us,

 

 

 

 

int dstart,

 

 

 

 

int dend,

 

 

 

 

intptr_t* ipos,

 

 

 

 

char utf16,

 

 

 

 

int permissive)

Like scheme_utf8_decode, but the result is always the number of the decoded mzchars or shorts. If a decoding error is encountered, the result is still the size of the decoding up until the error.

int

 

scheme_utf8_decode_all

(

const unsigned char* s,

 

 

 

 

int len,

 

 

 

 

mzchar* us,

 

 

 

 

int permissive)

Like scheme_utf8_decode, but with fewer arguments. The decoding produces UCS-4 mzchars. If the buffer us is non-NULL, it is assumed to be long enough to hold the decoding (which cannot be longer than the length of the input, though it may be shorter). If len is negative, strlen(s) is used as the input length.

int

 

scheme_utf8_decode_prefix

(

const unsigned char* s,

 

 

 

 

int len,

 

 

 

 

mzchar* us,

 

 

 

 

int permissive)

Like scheme_utf8_decode, but with fewer arguments. The decoding produces UCS-4 mzchars. The buffer us must be non-NULL, and it is assumed to be long enough to hold the decoding (which cannot be longer than the length of the input, though it may be shorter). If len is negative, strlen(s) is used as the input length.

In addition to the result of scheme_utf8_decode, the result can be -1 to indicate that the input ended with a partial (valid) encoding. A -1 result is possible even when permissive is non-zero.

mzchar*

 

scheme_utf8_decode_to_buffer

(

const unsigned char* s,

 

 

 

 

int len,

 

 

 

 

mzchar* buf,

 

 

 

 

int blen)

Like scheme_utf8_decode_all with permissive as 0, but if buf is not large enough (as indicated by blen) to hold the result, a new buffer is allocated. Unlike other functions, this one adds a nul terminator to the decoding result. The function result is either buf (if it was big enough) or a buffer allocated with scheme_malloc_atomic.

mzchar*

 

scheme_utf8_decode_to_buffer_len

(

const unsigned char* s,

 

 

 

 

int len,

 

 

 

 

mzchar* buf,

 

 

 

 

int blen,

 

 

 

 

intptr_t* ulen)

Like scheme_utf8_decode_to_buffer, but the length of the result (not including the terminator) is placed into ulen if ulen is non-NULL.

int

 

scheme_utf8_decode_count

(

const unsigned char* s,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

int* state,

 

 

 

 

int might_continue,

 

 

 

 

int permissive)

Like scheme_utf8_decode, but without producing the decoded mzchars, and always returning the number of decoded mzchars up until a decoding error (if any). If might_continue is non-zero, the a partial valid encoding at the end of the input is not decoded when permissive is also non-zero.

If state is non-NULL, it holds information about partial encodings; it should be set to zero for an initial call, and then passed back to scheme_utf8_decode along with bytes that extend the given input (i.e., without any unused partial encodings). Typically, this mode makes sense only when might_continue and permissive are non-zero.

int

 

scheme_utf8_encode

(

const mzchar* us,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

unsigned char* s,

 

 

 

 

int dstart,

 

 

 

 

char utf16)

Encodes the given UCS-4 array of mzchars (if utf16 is zero) or UTF-16 array of shorts (if utf16 is non-zero) into s. The end argument must be no less than start.

The array s is assumed to be long enough to contain the encoding, but no encoding is written if s is NULL. The dstart argument indicates a starting place in s to hold the encoding. No nul terminator is added to s.

The result is the number of bytes produced for the encoding (or that would be produced if s was non-NULL). Encoding never fails.

On Windows, when utf16 is non-zero, encoding supports unpaired surrogates the input UTF-16 code-unit sequence, in which case encoding generates a natural extension of UTF-8 that encodes unpaired surrogates.

This function does not allocate or trigger garbage collection.

int

 

scheme_utf8_encode_all

(

const mzchar* us,

 

 

 

 

int len,

 

 

 

 

unsigned char* s)

Like scheme_utf8_encode with 0 for start, len for end, 0 for dstart and 0 for utf16.

char*

 

scheme_utf8_encode_to_buffer

(

const mzchar* s,

 

 

 

 

int len,

 

 

 

 

char* buf,

 

 

 

 

int blen)

Like scheme_utf8_encode_all, but the length of buf is given, and if it is not long enough to hold the encoding, a buffer is allocated. A nul terminator is added to the encoded array. The result is either buf or an array allocated with scheme_malloc_atomic.

char*

 

scheme_utf8_encode_to_buffer_len

(

const mzchar* s,

 

 

 

 

int len,

 

 

 

 

char* buf,

 

 

 

 

int blen,

 

 

 

 

intptr_t* rlen)

Like scheme_utf8_encode_to_buffer, but the length of the resulting encoding (not including a nul terminator) is reported in rlen if it is non-NULL.

unsigned short*

 

scheme_ucs4_to_utf16

(

const mzchar* text,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

unsigned short* buf,

 

 

 

 

int bufsize,

 

 

 

 

intptr_t* ulen,

 

 

 

 

int term_size)

Converts a UCS-4 encoding (the indicated range of text) to a UTF-16 encoding. The end argument must be no less than start.

A result buffer is allocated if buf is not long enough (as indicated by bufsize). If ulen is non-NULL, it is filled with the length of the UTF-16 encoding. The term_size argument indicates a number of shorts to reserve at the end of the result buffer for a terminator (but no terminator is actually written).

mzchar*

 

scheme_utf16_to_ucs4

(

const unsigned short* text,

 

 

 

 

int start,

 

 

 

 

int end,

 

 

 

 

mzchar* buf,

 

 

 

 

int bufsize,

 

 

 

 

intptr_t* ulen,

 

 

 

 

int term_size)

Converts a UTF-16 encoding (the indicated range of text) to a UCS-4 encoding. The end argument must be no less than start.

A result buffer is allocated if buf is not long enough (as indicated by bufsize). If ulen is non-NULL, it is filled with the length of the UCS-4 encoding. The term_size argument indicates a number of mzchars to reserve at the end of the result buffer for a terminator (but no terminator is actually written).