# unicode.h ## Overview
API for converting between UTF-8 encoded text and UTF-16 or UTF-32. All strings in The Machinery are UTF-8 encoded, but UTF-16 and UTF-32 are sometimes needed to communicate with external APIs. For example, Windows uses UTF-16. !!! TODO: API-REVIEW * Add `tm_str_t codepoint_range(tm_str_t s)`.
## Index
`struct tm_unicode_api`

UTF-8
`is_valid()`
`truncate()`

UTF-32
`utf8_encode()`
`utf8_decode()`
`utf8_num_codepoints()`
`utf8_decode_n()`
`utf8_to_utf32()`
`utf8_to_utf32_n()`
`utf32_to_utf8()`
`utf32_to_utf8_n()`

UTF-16
`utf16_encode()`
`utf16_decode()`
`utf8_to_utf16()`
`utf8_to_utf16_n()`
`utf16_to_utf8()`
`utf16_to_utf8_n()`

`tm_unicode_api_version`
`tm_codepoint_to_utf8()`
## API
### `struct tm_unicode_api`

UTF-8

#### `is_valid()` ~~~c bool (*is_valid)(const char *utf8); ~~~ Returns *true* if `utf8` is a valid UTF-8 string, *false* otherwise. #### `truncate()` ~~~c void (*truncate)(char *utf8); ~~~ Fixes the truncation of a UTF-8 encoded string `utf-8` by replacing any split codepoints at the end of the string with `\0` bytes. You can use this after truncating a string to make sure that the resulting string is still a valid UTF-8 string.

UTF-32

#### `utf8_encode()` ~~~c char *(*utf8_encode)(char *utf8, uint32_t codepoint); ~~~ Encodes the `codepoint` as UTF-8 into `utf8` and returns a pointer to the position where to insert the next codepoint. `utf8` should have room for at least four bytes (the maximum size of a UTF-8 encoded codepoint). #### `utf8_decode()` ~~~c uint32_t (*utf8_decode)(const char **utf8); ~~~ Decodes and returns the first codepoint in the UTF-8 string `utf8`. The string pointer is advanced to point to the next codepoint in the string. Will generate an error message if the string is not a UTF-8 string. #### `utf8_num_codepoints()` ~~~c uint32_t (*utf8_num_codepoints)(const char *utf8); ~~~ Returns the number of codepoints in `utf8`. #### `utf8_decode_n()` ~~~c uint32_t (*utf8_decode_n)(uint32_t *codepoints, uint32_t n, const char *utf8); ~~~ Decodes the first `n` codepoints in `utf8` to the `codepoints` buffer. If `utf8` contains fewer than `n` codepoints -- decodes as many codepoints there are in `utf8`. Returns the number of decoded codepoints. #### `utf8_to_utf32()` ~~~c uint32_t *(*utf8_to_utf32)(const char *utf8, struct tm_temp_allocator_i *ta); ~~~ Converts a UTF-8 encoded string to a UTF-32 encoded one, allocated with the supplied temp allocator. Will generate an error message if the string is not a UTF-8 string. #### `utf8_to_utf32_n()` ~~~c uint32_t *(*utf8_to_utf32_n)(const char *utf8, uint32_t n, struct tm_temp_allocator_i *ta); ~~~ As `utf8_to_utf32()`, but uses an explicit length instead of a zero terminated string. Note that the result string will still be zero terminated. #### `utf32_to_utf8()` ~~~c char *(*utf32_to_utf8)(const uint32_t *utf32, struct tm_temp_allocator_i *ta); ~~~ Converts a UTF-32 encoded string to a UTF-8 encoded one, allocated with the specified temp allocator. Generates an error if the data is outside the UTF-8 encoding range. #### `utf32_to_utf8_n()` ~~~c char *(*utf32_to_utf8_n)(const uint32_t *utf32, uint32_t n, struct tm_temp_allocator_i *ta); ~~~ As `utf32_to_utf8()`, but uses an explicit length instead of a zero terminated string. Note that the result string will still be zero terminated.

UTF-16

#### `utf16_encode()` ~~~c uint16_t *(*utf16_encode)(uint16_t *utf16, uint32_t codepoint); ~~~ Encodes the codepoint as UTF-16 into `utf16` and returns a pointer to the position where to insert the next codepoint. `utf16` should have at room for at least two `uint16_t` (the maximum size of a UTF-16 encoded codepoint). #### `utf16_decode()` ~~~c uint32_t (*utf16_decode)(const uint16_t **utf16); ~~~ Decodes and returns the first codepoint in the UTF-16 string `utf16`. The string pointer is advanced to point to the next codepoint in the string. #### `utf8_to_utf16()` ~~~c uint16_t *(*utf8_to_utf16)(const char *utf8, struct tm_temp_allocator_i *ta); ~~~ Converts a UTF-8 encoded string to a UTF-16 encoded one, allocated with the supplied temp allocator. Will generate an error message if the data is outside the UTF-8 encoding range. #### `utf8_to_utf16_n()` ~~~c uint16_t *(*utf8_to_utf16_n)(const char *utf8, uint32_t n, struct tm_temp_allocator_i *ta); ~~~ As `utf8_to_utf16()` but uses an explicit length instead of a zero terminated string. Note that the result string will still be zero terminated. #### `utf16_to_utf8()` ~~~c char *(*utf16_to_utf8)(const uint16_t *utf16, struct tm_temp_allocator_i *ta); ~~~ Converts a UTF-16 encoded string to a UTF-8 encoded one, allocated with the specified temp allocator. Will generate an error message if the string is not a UTF-16 string. #### `utf16_to_utf8_n()` ~~~c char *(*utf16_to_utf8_n)(const uint16_t *utf16, uint32_t n, struct tm_temp_allocator_i *ta); ~~~ As `utf16_to_utf8()` but uses an explicit length instead of a zero terminated string. Note that the result string will still be zero terminated.
### `tm_unicode_api_version`
~~~c #define tm_unicode_api_version ~~~
### `tm_codepoint_to_utf8()`
~~~c #define tm_codepoint_to_utf8(cp) ~~~ Returns a UTF-8 string representing the codepoint `cp`. The string is stack allocated.