Conversation

Notices

Embed this notice
Wolf480pl (wolf480pl@mstdn.io)'s status on Saturday, 29-Mar-2025 20:35:50 JST Wolf480pl

Just realized it's impossible to use UCS-2 (UTF-16) for passing arguments to unix programs, because arguments are nul-terminated, and in UCS-2, almost every other byte is zero...

In conversation about 3 months ago from mstdn.io permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 29-Mar-2025 20:35:49 JST 翠星石
  in reply to
  
  @wolf480pl Unix programs? Those are GNU's Not Unix programs sir.
  
  UTF-16 is a useless format, as it's a multibyte encoding that almost doubles the storage size of text, unless all you are encoding is Chinese characters.
  
  Just use UTF-8 - it's ASCII compatible and you can pass it to whatever program and it will work unless the program does something stupid.
  
  If you have some UTF-16 encoded files, you can convert them to UTF-8 with GNU iconv.
  
  In conversation about 3 months ago permalink
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 29-Mar-2025 20:46:04 JST 翠星石
  in reply to
  - :umu: :umu:
  @a1ba @wolf480pl >It's double the size for most things (most things are ASCII)
  >It's somewhat faster to decode.
  ????
  
  Thinking about the differences between the variable encodings of UTF-8 and UTF-16, I don't see how either is meaningfully faster to decode than the other.
  
  In conversation about 3 months ago permalink
- Embed this notice
  :umu: :umu: (a1ba@suya.place)'s status on Saturday, 29-Mar-2025 20:46:05 JST :umu: :umu:
  in reply to
  - 翠星石
  @Suiseiseki @wolf480pl utf-16 is somewhat faster to decode. Don't even have to be Chinese, it's true even for Cyrillic text and the other half of Latin-1.
  
  But then it's still double the size on everything that's ASCII.
  
  Just use right tools to achieve the goal.
  
  In conversation about 3 months ago permalink
- Embed this notice
  iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:02:28 JST iced depresso
  in reply to
  @divVerent @Suiseiseki @wolf480pl @a1ba utf-16 was propped up by microsoft and sun as a misguided attempt to get out of how unicode turns all string operations from o(1) to o(n). the idea was if you just use 16-bit cells then you are back to being able to just reach at an arbitrary rune.
  
  this is false because diacritics still exist in utf-16. and utf-16 STILL has characters it cannot represent (the ones outside the standard bitmap plane) so you STILL have to perform local checks to see if you are about to slice directly in to a rune at the wrong place.
  
  basically unicode sucks and some corporate coders tried to get around it and made everything suck even more.
  
  In conversation about 3 months ago permalink
- Embed this notice
  divVerent (divverent@social.vivaldi.net)'s status on Saturday, 29-Mar-2025 21:02:31 JST divVerent
  in reply to
  - 翠星石
  - :umu: :umu:
  @Suiseiseki @wolf480pl @a1ba "It depends". UTF-16 is definitely faster to decode because you have fewer loop iterations for the same string (8bit and 16bit RAM reads are about the same speed on the CPU).
  HOWEVER, especially when all codepoints are ASCII, UTF-16 uses twice the memory bandwidth. And that hurts too.
  So, ultimately depends on the character set / language used.
  In conversation about 3 months ago permalink
  Attachments
  1. Untitled attachment
- Embed this notice
  iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:03:45 JST iced depresso
  in reply to
  @divVerent @Suiseiseki @a1ba @wolf480pl this assumes you care about being correct. if you don't, and evidently companies in current year do not, then SHRUG as long as you normalize the input and confine everything to the BMP then it kind of works.
  
  In conversation about 3 months ago permalink
- Embed this notice
  iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:06:48 JST iced depresso
  in reply to
  @wolf480pl @Suiseiseki @divVerent @a1ba yes :cat_sad:
  
  In conversation about 3 months ago permalink
- Embed this notice
  Wolf480pl (wolf480pl@mstdn.io)'s status on Saturday, 29-Mar-2025 21:06:49 JST Wolf480pl
  in reply to
  @icedquinn @Suiseiseki @divVerent @a1ba ok but like
  Did Unicode contain surrogates and modifier codepoints at the time when UTF-16 was designed?
  
  In conversation about 3 months ago permalink
- Embed this notice
  iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:09:41 JST iced depresso
  in reply to
  @wolf480pl @Suiseiseki @divVerent @a1ba no. it's more that they coded for precursor formats https://www.ibm.com/docs/en/i/7.4?topic=unicode-ucs-2-its-relationship-utf-16 and then tried to upgrade by gesticulating wildly.
  
  (i wrote a whole utf-8 module once upon a horrible time)
  In conversation about 3 months ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    UCS-2 and its relationship to Unicode (UTF-16)
    
    The UCS-2 standard, an early version of Unicode, is limited to 65 535 characters. However, the data processing industry needs over 94 000 characters; the UCS-2 standard has been superseded by the Unicode UTF-16 standard.
- Embed this notice
  Wolf480pl (wolf480pl@mstdn.io)'s status on Saturday, 29-Mar-2025 21:09:42 JST Wolf480pl
  in reply to
  @icedquinn @Suiseiseki @divVerent @a1ba
  So I can't even blame this on Unicode Consortium's scope creep ;_;
  
  In conversation about 3 months ago permalink
- Embed this notice
  iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:23:15 JST iced depresso
  in reply to
  @divVerent @Suiseiseki @wolf480pl @a1ba back when i had that code i basically made several accessors and iterators to deal with it. you told it if you were dealing with graphemes, or just code points, and it had heuristics and loops to check the nearest safe split point at a given byte.
  
  i don't think i have that C code since ages. shame since it would have been neat resume fodder before chatgpt.
  
  In conversation about 3 months ago permalink
- Embed this notice
  divVerent (divverent@social.vivaldi.net)'s status on Saturday, 29-Mar-2025 21:23:16 JST divVerent
  in reply to
  @icedquinn @Suiseiseki @wolf480pl @a1ba TBH diacritics are less of an issue - most operations on strings can easily work on a per-codepoint basis, such as word wrapping - you just need to handle diacritics and other combining codepoints as if they're a word character.
  And for stuff like line length computation, you need to take the different per-character width of your font into account anyway.
  What's really annoying is string comparing, as you now have to apply a normalization first...
  
  In conversation about 3 months ago permalink

Public

Conversation

Notices

Feeds