Just realized it's impossible to use UCS-2 (UTF-16) for passing arguments to unix programs, because arguments are nul-terminated, and in UCS-2, almost every other byte is zero...
@wolf480pl Unix programs? Those are GNU's Not Unix programs sir.
UTF-16 is a useless format, as it's a multibyte encoding that almost doubles the storage size of text, unless all you are encoding is Chinese characters.
Just use UTF-8 - it's ASCII compatible and you can pass it to whatever program and it will work unless the program does something stupid.
If you have some UTF-16 encoded files, you can convert them to UTF-8 with GNU iconv.
@a1ba@wolf480pl >It's double the size for most things (most things are ASCII) >It's somewhat faster to decode. ????
Thinking about the differences between the variable encodings of UTF-8 and UTF-16, I don't see how either is meaningfully faster to decode than the other.
@Suiseiseki@wolf480pl utf-16 is somewhat faster to decode. Don't even have to be Chinese, it's true even for Cyrillic text and the other half of Latin-1.
But then it's still double the size on everything that's ASCII.
@divVerent@Suiseiseki@wolf480pl@a1ba utf-16 was propped up by microsoft and sun as a misguided attempt to get out of how unicode turns all string operations from o(1) to o(n). the idea was if you just use 16-bit cells then you are back to being able to just reach at an arbitrary rune.
this is false because diacritics still exist in utf-16. and utf-16 STILL has characters it cannot represent (the ones outside the standard bitmap plane) so you STILL have to perform local checks to see if you are about to slice directly in to a rune at the wrong place.
basically unicode sucks and some corporate coders tried to get around it and made everything suck even more.
@Suiseiseki@wolf480pl@a1ba "It depends". UTF-16 is definitely faster to decode because you have fewer loop iterations for the same string (8bit and 16bit RAM reads are about the same speed on the CPU).
HOWEVER, especially when all codepoints are ASCII, UTF-16 uses twice the memory bandwidth. And that hurts too.
So, ultimately depends on the character set / language used.
@divVerent@Suiseiseki@a1ba@wolf480pl this assumes you care about being correct. if you don't, and evidently companies in current year do not, then SHRUG as long as you normalize the input and confine everything to the BMP then it kind of works.
@divVerent@Suiseiseki@wolf480pl@a1ba back when i had that code i basically made several accessors and iterators to deal with it. you told it if you were dealing with graphemes, or just code points, and it had heuristics and loops to check the nearest safe split point at a given byte.
i don't think i have that C code since ages. shame since it would have been neat resume fodder before chatgpt.
@icedquinn@Suiseiseki@wolf480pl@a1ba TBH diacritics are less of an issue - most operations on strings can easily work on a per-codepoint basis, such as word wrapping - you just need to handle diacritics and other combining codepoints as if they're a word character.
And for stuff like line length computation, you need to take the different per-character width of your font into account anyway.
What's really annoying is string comparing, as you now have to apply a normalization first...