mirror of
https://github.com/fish-shell/fish-shell.git
synced 2026-05-19 09:51:16 -03:00
Accurately computing the width of arbitrary strings is a non-trivial problem. We outsource the logic for it to the `unicode-width` crate. But directly passing our PUA-encoded strings to the crate would give incorrect results whenever a PUA codepoint is encoded in our string, since one input PUA codepoint is converted into 3 consecutive codepoints in our encoding. Therefore, we need to decode before performing width calculations. Our regular decoding decodes to raw bytes, which is incompatible with the `unicode-width` crate, since it expects `char`s, and the decoded bytes could be invalid UTF-8, making their width undefined. We tackle this problem by building a custom iterator which does on-the-fly decoding. Encoded PUA codepoints are turned back into the original codepoints, and any other PUA-encoded bytes are replaced by one replacement character (U+FFFD) per byte. The latter is not necessary since PUA codepoints have a defined width of 1, so we could also forward the PUA-encoded bytes which encode invalid UTF-8 input instead of inserting the replacement character. The choice to use the replacement character is made to avoid producing a char sequence where some PUA codepoints represent themselves, whereas others still encode non-UTF-8 bytes. Such a mix of semantics would be confusing if the char sequence is ever used for anything else. Replacement characters make it clear that there are no remaining encoded semantics. Note that using the char sequences produced in this way for any purpose other than width computation is not intended. For output, our pre-existing decoding to bytes should be used, which allows preserving non-UTF-8 bytes. The implementation of the iterator is not entirely straightforward, since we need to read up to 3 chars to be able to decide whether we have an encoded PUA character. Therefore, we need to cache some chars across invocations of the iterator's `next` and `next_back` invocations. This is done via a custom buffer struct, which does not require dynamic allocations. The tests for the new functionality are only in the main crate because the encoding function is not available in the `fish-widestring` crate. Once that is resolved, the tests should be moved. Part of #12457
15 lines
262 B
TOML
15 lines
262 B
TOML
[package]
|
|
name = "fish-widestring"
|
|
edition.workspace = true
|
|
rust-version.workspace = true
|
|
version = "0.0.0"
|
|
repository.workspace = true
|
|
license.workspace = true
|
|
|
|
[dependencies]
|
|
unicode-width.workspace = true
|
|
widestring.workspace = true
|
|
|
|
[lints]
|
|
workspace = true
|