Files
fish-shell/crates/widestring/Cargo.toml
Daniel Rainer c38dd1f420 unicode: add function for width computation
Accurately computing the width of arbitrary strings is a non-trivial
problem. We outsource the logic for it to the `unicode-width` crate. But
directly passing our PUA-encoded strings to the crate would give
incorrect results whenever a PUA codepoint is encoded in our string,
since one input PUA codepoint is converted into 3 consecutive codepoints
in our encoding. Therefore, we need to decode before performing width
calculations. Our regular decoding decodes to raw bytes, which is
incompatible with the `unicode-width` crate, since it expects `char`s,
and the decoded bytes could be invalid UTF-8, making their width
undefined. We tackle this problem by building a custom iterator which
does on-the-fly decoding. Encoded PUA codepoints are turned back into
the original codepoints, and any other PUA-encoded bytes are replaced by
one replacement character (U+FFFD) per byte. The latter is not necessary
since PUA codepoints have a defined width of 1, so we could also forward
the PUA-encoded bytes which encode invalid UTF-8 input instead of
inserting the replacement character. The choice to use the replacement
character is made to avoid producing a char sequence where some PUA
codepoints represent themselves, whereas others still encode non-UTF-8
bytes. Such a mix of semantics would be confusing if the char sequence
is ever used for anything else. Replacement characters make it clear
that there are no remaining encoded semantics. Note that using the char
sequences produced in this way for any purpose other than width
computation is not intended. For output, our pre-existing decoding to
bytes should be used, which allows preserving non-UTF-8 bytes.

The implementation of the iterator is not entirely straightforward,
since we need to read up to 3 chars to be able to decide whether we have
an encoded PUA character. Therefore, we need to cache some chars across
invocations of the iterator's `next` and `next_back` invocations. This
is done via a custom buffer struct, which does not require dynamic
allocations.

The tests for the new functionality are only in the main crate because
the encoding function is not available in the `fish-widestring` crate.
Once that is resolved, the tests should be moved.

Part of #12457
2026-02-25 18:18:21 +11:00

15 lines
262 B
TOML

[package]
name = "fish-widestring"
edition.workspace = true
rust-version.workspace = true
version = "0.0.0"
repository.workspace = true
license.workspace = true
[dependencies]
unicode-width.workspace = true
widestring.workspace = true
[lints]
workspace = true