Switch to builtin gettext implementation

This completely removes our runtime dependency on gettext. As a
replacement, we have our own code for runtime localization in
`src/wutil/gettext.rs`. It considers the relevant locale variables to
decide which message catalogs to take localizations from. The use of
locale variables is mostly the same as in gettext, with the notable
exception that we do not support "default dialects". If `LANGUAGE=ll` is
set and we don't have a `ll` catalog but a `ll_CC` catalog, we will use
the catalog with the country code suffix. If multiple such catalogs
exist, we use an arbitrary one. (At the moment we have at most one
catalog per language, so this is not particularly relevant.)

By using an `EnvStack` to pass variables to gettext at runtime, we now
respect locale variables which are not exported.
For early output, we don't have an `EnvStack` to pass, so we add an
initialization function which constructs an `EnvStack` containing the
relevant locale variables from the corresponding Environment variables.
Treat `LANGUAGE` as path variable. This add automatic colon-splitting.

The sourcing of catalogs is completely reworked. Instead of looking for
MO files at runtime, we create catalogs as Rust maps at build time, by
converting PO files into MO data, which is not stored, but immediately
parsed to extract the mappings. From the mappings, we create Rust source
code as a build artifact, which is then macro-included in the crate's
library, i.e. `crates/gettext-maps/src/lib.rs`. The code in
`src/wutil/gettext.rs` includes the message catalogs from this library,
resulting in the message catalogs being built into the executable.

The `localize-messages` feature can now be used to control whether to
build with gettext support. By default, it is enabled. If `msgfmt` is
not available at build time, and `gettext` is enabled, a warning will be
emitted and fish is built with gettext support, but without any message
catalogs, so localization will not work then.

As a performance optimization, for each language we cache a separate
Rust source file containing its catalog as a map. This allows us to
reuse parsing results if the corresponding PO files have not changed
since we cached the parsing result.

Note that this approach does not eliminate our build-time dependency on
gettext. The process for generating PO files (which uses `msguniq` and
`msgmerge`) is unchanged, and we still need `msgfmt` to translate from
PO to MO. We could parse PO files directly, but these are significantly
more complex to parse, so we use `msgfmt` to do it for us and parse the
resulting MO data.

Advantages of the new approach:
- We have no runtime dependency on gettext anymore.
- The implementation has the same behavior everywhere.
- Our implementation is significantly simpler than GNU gettext.
- We can have localization in cargo-only builds by embedding
  localizations into the code.
  Previously, localization in such builds could only work reliably as
  long as the binary was not moved from the build directory.
- We no longer have to take care of building and installing MO files in
  build systems; everything we need for localization to work happens
  automatically when building fish.
- Reduced overhead when disabling localization, both in compilation time
  and binary size.

Disadvantages of this approach:
- Our own runtime implementation of gettext needs to be maintained.
- The implementation has a more limited feature set (but I don't think
  it lacks any features which have been in use by fish).

Part of #11726
Closes #11583
Closes #11725
Closes #11683
This commit is contained in:
Daniel Rainer
2025-08-22 20:03:45 +02:00
committed by Johannes Altmanninger
parent 3a196c3a08
commit ad323d03b6
30 changed files with 756 additions and 322 deletions

View File

@@ -5,10 +5,22 @@ pub fn workspace_root() -> &'static Path {
manifest_dir.ancestors().nth(2).unwrap()
}
pub fn cargo_target_dir() -> Cow<'static, Path> {
fn cargo_target_dir() -> Cow<'static, Path> {
option_env!("CARGO_TARGET_DIR")
.map(|d| Cow::Borrowed(Path::new(d)))
.unwrap_or(std::borrow::Cow::Owned(workspace_root().join("target")))
.unwrap_or(Cow::Owned(workspace_root().join("target")))
}
pub fn fish_build_dir() -> Cow<'static, Path> {
// FISH_BUILD_DIR is set by CMake, if we are using it.
option_env!("FISH_BUILD_DIR")
.map(|d| Cow::Borrowed(Path::new(d)))
.unwrap_or(cargo_target_dir())
}
// TODO Move this to rsconf
pub fn rebuild_if_path_changed<P: AsRef<Path>>(path: P) {
rsconf::rebuild_if_path_changed(path.as_ref().to_str().unwrap());
}
// TODO Move this to rsconf

View File

@@ -1,11 +1,8 @@
#[cfg(not(clippy))]
use std::path::Path;
use fish_build_helper::cargo_target_dir;
fn main() {
let cargo_target_dir = cargo_target_dir();
let mandir = cargo_target_dir.join("fish-man");
let mandir = fish_build_helper::fish_build_dir().join("fish-man");
let sec1dir = mandir.join("man1");
// Running `cargo clippy` on a clean build directory panics, because when rust-embed tries to
// embed a directory which does not exist it will panic.

View File

@@ -0,0 +1,18 @@
[package]
name = "fish-gettext-maps"
edition.workspace = true
rust-version.workspace = true
version = "0.0.0"
repository.workspace = true
[dependencies]
phf.workspace = true
[build-dependencies]
fish-build-helper.workspace = true
fish-gettext-mo-file-parser.workspace = true
phf_codegen.workspace = true
rsconf.workspace = true
[lints]
workspace = true

View File

@@ -0,0 +1,142 @@
use std::{
env,
ffi::OsStr,
path::{Path, PathBuf},
process::Command,
};
fn main() {
let cache_dir =
PathBuf::from(fish_build_helper::fish_build_dir()).join("fish-localization-map-cache");
embed_localizations(&cache_dir);
fish_build_helper::rebuild_if_path_changed(fish_build_helper::workspace_root().join("po"));
}
fn embed_localizations(cache_dir: &Path) {
use fish_gettext_mo_file_parser::parse_mo_file;
use std::{
fs::File,
io::{BufWriter, Write},
};
let po_dir = fish_build_helper::workspace_root().join("po");
// Ensure that the directory is created, because clippy cannot compile the code if the
// directory does not exist.
std::fs::create_dir_all(cache_dir).unwrap();
let localization_map_path =
Path::new(&env::var("OUT_DIR").unwrap()).join("localization_maps.rs");
let mut localization_map_file = BufWriter::new(File::create(&localization_map_path).unwrap());
// This will become a map which maps from language identifiers to maps containing localizations
// for the respective language.
let mut catalogs = phf_codegen::Map::new();
match Command::new("msgfmt").arg("-h").status() {
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
rsconf::warn!(
"Cannot find msgfmt to build gettext message catalogs. Localization will not work."
);
rsconf::warn!(
"If you install it now you need to trigger a rebuild to get localization support."
);
rsconf::warn!(
"One way to achieve that is running `touch po` followed by the build command."
);
}
Err(e) => {
panic!("Error when trying to run `msgfmt -h`: {e:?}");
}
Ok(_) => {
for dir_entry_result in po_dir.read_dir().unwrap() {
let dir_entry = dir_entry_result.unwrap();
let po_file_path = dir_entry.path();
if po_file_path.extension() != Some(OsStr::new("po")) {
continue;
}
let lang = po_file_path
.file_stem()
.expect("All entries in the po directory must be regular files.");
let language = lang.to_str().unwrap().to_owned();
// Each language gets its own static map for the mapping from message in the source code to
// the localized version.
let map_name = format!("LANG_MAP_{language}");
let cached_map_path = cache_dir.join(lang);
// Include the file containing the map for this language in the main generated file.
writeln!(
&mut localization_map_file,
"include!(\"{}\");",
cached_map_path.display()
)
.unwrap();
// Map from the language identifier to the map containing the localizations for this
// language.
catalogs.entry(language, format!("&{map_name}"));
if let Ok(metadata) = std::fs::metadata(&cached_map_path) {
// Cached map file exists, but might be outdated.
let cached_map_mtime = metadata.modified().unwrap();
let po_mtime = dir_entry.metadata().unwrap().modified().unwrap();
if cached_map_mtime > po_mtime {
// Cached map file is considered up-to-date.
continue;
};
}
// Generate the map file.
// Try to create new MO data and load it into `mo_data`.
let output = Command::new("msgfmt")
.arg("--check-format")
.arg("--output-file=-")
.arg(&po_file_path)
.output()
.unwrap();
let mo_data = output.stdout;
// Extract map from MO data.
let language_localizations = parse_mo_file(&mo_data).unwrap();
// This file will contain the localization map for the current language.
let mut cached_map_file = File::create(&cached_map_path).unwrap();
let mut single_language_localization_map = phf_codegen::Map::new();
// The values will be written into the source code as is, meaning escape sequences and
// double quotes in the data will be interpreted by the Rust compiler, which is undesirable.
// Converting them to raw strings prevents this. (As long as no input data contains `"###`.)
fn to_raw_str(s: &str) -> String {
assert!(!s.contains("\"###"));
format!("r###\"{s}\"###")
}
for (msgid, msgstr) in language_localizations {
single_language_localization_map.entry(
String::from_utf8(msgid.into()).unwrap(),
to_raw_str(&String::from_utf8(msgstr.into()).unwrap()),
);
}
writeln!(&mut cached_map_file, "#[allow(non_upper_case_globals)]").unwrap();
write!(
&mut cached_map_file,
"static {}: phf::Map<&'static str, &'static str> = {}",
&map_name,
single_language_localization_map.build()
)
.unwrap();
writeln!(&mut cached_map_file, ";").unwrap();
}
}
}
write!(
&mut localization_map_file,
"pub static CATALOGS: phf::Map<&str, &phf::Map<&str, &str>> = {}",
catalogs.build()
)
.unwrap();
writeln!(&mut localization_map_file, ";").unwrap();
}

View File

@@ -0,0 +1 @@
include!(concat!(env!("OUT_DIR"), "/localization_maps.rs"));

View File

@@ -0,0 +1,9 @@
[package]
name = "fish-gettext-mo-file-parser"
edition.workspace = true
rust-version.workspace = true
version = "0.0.0"
repository.workspace = true
[lints]
workspace = true

View File

@@ -0,0 +1,131 @@
use std::collections::HashMap;
const U32_SIZE: usize = std::mem::size_of::<u32>();
fn read_le_u32(bytes: &[u8]) -> u32 {
u32::from_le_bytes(bytes[..U32_SIZE].try_into().unwrap())
}
fn read_be_u32(bytes: &[u8]) -> u32 {
u32::from_be_bytes(bytes[..U32_SIZE].try_into().unwrap())
}
fn get_u32_reader_from_magic_number(magic_number: &[u8]) -> std::io::Result<fn(&[u8]) -> u32> {
match magic_number {
[0x95, 0x04, 0x12, 0xde] => Ok(read_be_u32),
[0xde, 0x12, 0x04, 0x95] => Ok(read_le_u32),
_ => Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"First 4 bytes of MO file must correspond to magic number 0x950412de, either big or little endian.",
)),
}
}
/// Returns an error if an unknown major revision is detected.
/// There are no relevant differences between supported revisions.
fn check_if_revision_is_supported(revision: u32) -> std::io::Result<()> {
// From the reference:
// A program seeing an unexpected major revision number should stop reading the MO file entirely;
// whereas an unexpected minor revision number means that the file can be read
// but will not reveal its full contents,
// when parsed by a program that supports only smaller minor revision numbers.
let major_revision = revision >> 16;
match major_revision {
0 | 1 => {
// At time of writing, these are the only major revisions which exist.
// There is no documented difference and the GNU gettext code does not seem to
// differentiate between the two either.
// All features we care about are supported in minor revision 0,
// so we do not need to care about the minor revision.
Ok(())
}
_ => Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"Major revision must be 0 or 1",
)),
}
}
fn as_usize(value: u32) -> usize {
use std::mem::size_of;
const _: () = assert!(size_of::<u32>() <= size_of::<usize>());
usize::try_from(value).unwrap()
}
fn parse_strings(
file_content: &[u8],
num_strings: usize,
table_offset: usize,
read_u32: fn(&[u8]) -> u32,
) -> std::io::Result<Vec<&[u8]>> {
let file_too_short_error = || {
Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"MO file is too short.",
))
};
if table_offset + num_strings * 2 * U32_SIZE > file_content.len() {
return file_too_short_error();
}
let mut strings = Vec::with_capacity(num_strings);
let mut offset = table_offset;
let mut get_next_u32 = || {
let val = read_u32(&file_content[offset..]);
offset += U32_SIZE;
val
};
for _ in 0..num_strings {
// not including NUL terminator
let string_length = as_usize(get_next_u32());
let string_offset = as_usize(get_next_u32());
let string_end = string_offset.checked_add(string_length).unwrap();
if string_end > file_content.len() {
return file_too_short_error();
}
// Contexts are stored by storing the concatenation of the context, a EOT byte, and the original string, instead of the original string.
// Contexts are not supported by this implementation.
// The format allows plural forms to appear behind singular forms, separated by a NUL byte,
// where `string_length` includes the length of both.
// This is not supported here.
// Do not include the NUL terminator in the slice.
strings.push(&file_content[string_offset..string_end]);
}
Ok(strings)
}
/// Parse a MO file.
/// Format reference used: <https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html>
pub fn parse_mo_file(file_content: &[u8]) -> std::io::Result<HashMap<&[u8], &[u8]>> {
if file_content.len() < 7 * U32_SIZE {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"File too short to contain header.",
));
}
// The first 4 bytes are a magic number, from which the endianness can be determined.
let read_u32 = get_u32_reader_from_magic_number(&file_content[0..U32_SIZE])?;
let mut offset = U32_SIZE;
let mut get_next_u32 = || {
let val = read_u32(&file_content[offset..]);
offset += U32_SIZE;
val
};
let file_format_revision = get_next_u32();
check_if_revision_is_supported(file_format_revision)?;
let num_strings = as_usize(get_next_u32());
let original_strings_offset = as_usize(get_next_u32());
let translation_strings_offset = as_usize(get_next_u32());
let original_strings =
parse_strings(file_content, num_strings, original_strings_offset, read_u32)?;
let translated_strings = parse_strings(
file_content,
num_strings,
translation_strings_offset,
read_u32,
)?;
let mut translation_map = HashMap::with_capacity(num_strings);
for i in 0..num_strings {
translation_map.insert(original_strings[i], translated_strings[i]);
}
Ok(translation_map)
}